601_finalpro

Author

Sai Padma Pothula

Published

December 21, 2022

R Markdown

Introduction This data set describes about Bike buyers from Pacific, Europe and North America. Data can be collected from previous buyers records. Analysing and modelling these datasets gives us an idea of what kind of people are buying bikes.Based on this data, we can predict who would be likely to purchase a bike using a classification algorithm. Our potential target variable is “Purchased.Bike”, which is binary (Yes = 1, No = 0). It is not very easy to read this data because you should have a clear understanding on how certain variables are impacting some variables. This data will give us an insight on income, occupation, age, Marital status which I believe are major factors in purchasing a bike. I am interested to study this bike buyers dataset. However there might some difficulties to identify some patterns. I have never studied any data sets related to automobiles. I would like to have a hands on experience on automobile related things. I believe this will help me understand more about data. We will be analysing data like acquiring, examining, querying the data. Then, we will visualise the data and determine needs for cleaning that is the most important phase of any data project. After completion of data understanding phase, we will prepare the data. In the data preparation phase, we will determine how to use the data set. For example, correction, removing or replacing.

Data Description The data has been provided in the form of a CSV file, which contains the following information:

ID - An identifier column for each record
Marital Status - Is the record for a person who is Married, or Single
Gender - Is the record for a person who is Male, Female, or NA (not given)
Income - Income level of the person. Values given in integer dollars
Children - Number of children for the person
Education - Education level of the person
Occupation - Occupation that the person currently has
Home Owner - Is the person a home owner (Yes) or not (No)? NA indicates no data available
Cars - Number of cars that the person owns
Commute Distance - Distance to commute to ????
Region - Region the person is from
Age - Age of the person
Purchased Bike - Did the person purchase a bike (Yes) or not? (No)

Data Exploration

For the purposes of building a supervised classification algorithm, we set our target variable as Purchased Bike, which is 1 if the person did purchase a bike and 0 if the person did not.

We would now like to explore all the variables we have to understand their distributions, any outliers / missing values, and which are the best that can be used as feature variables.

These have been explored in the Jupyter notebook, with relevant observations noted in the markdown cells.

library('tidyverse')

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

Warning: package 'tibble' was built under R version 4.2.2

Warning: package 'stringr' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library('ggplot2')
bike_buyers = read_csv('_data/bike_buyers.csv')

Rows: 1000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Marital Status, Gender, Education, Occupation, Home Owner, Commute ...
dbl (5): ID, Income, Children, Cars, Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike_buyers

# A tibble: 1,000 × 13
      ID Marital S…¹ Gender Income Child…² Educa…³ Occup…⁴ Home …⁵  Cars Commu…⁶
   <dbl> <chr>       <chr>   <dbl>   <dbl> <chr>   <chr>   <chr>   <dbl> <chr>  
 1 12496 Married     Female  40000       1 Bachel… Skille… Yes         0 0-1 Mi…
 2 24107 Married     Male    30000       3 Partia… Cleric… Yes         1 0-1 Mi…
 3 14177 Married     Male    80000       5 Partia… Profes… No          2 2-5 Mi…
 4 24381 Single      <NA>    70000       0 Bachel… Profes… Yes         1 5-10 M…
 5 25597 Single      Male    30000       0 Bachel… Cleric… No          0 0-1 Mi…
 6 13507 Married     Female  10000       2 Partia… Manual  Yes         0 1-2 Mi…
 7 27974 Single      Male   160000       2 High S… Manage… <NA>        4 0-1 Mi…
 8 19364 Married     Male    40000       1 Bachel… Skille… Yes         0 0-1 Mi…
 9 22155 <NA>        Male    20000       2 Partia… Cleric… Yes         2 5-10 M…
10 19280 Married     Male       NA       2 Partia… Manual  Yes         1 0-1 Mi…
# … with 990 more rows, 3 more variables: Region <chr>, Age <dbl>,
#   `Purchased Bike` <chr>, and abbreviated variable names ¹`Marital Status`,
#   ²Children, ³Education, ⁴Occupation, ⁵`Home Owner`, ⁶`Commute Distance`

summary(cars)

     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.