DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

601_finalpro

  • Final materials
    • Fall 2022 posts
    • final Posts

On this page

  • R Markdown
  • Including Plots

601_finalpro

Author

Sai Padma Pothula

Published

December 21, 2022

R Markdown

Introduction This data set describes about Bike buyers from Pacific, Europe and North America. Data can be collected from previous buyers records. Analysing and modelling these datasets gives us an idea of what kind of people are buying bikes.Based on this data, we can predict who would be likely to purchase a bike using a classification algorithm. Our potential target variable is “Purchased.Bike”, which is binary (Yes = 1, No = 0). It is not very easy to read this data because you should have a clear understanding on how certain variables are impacting some variables. This data will give us an insight on income, occupation, age, Marital status which I believe are major factors in purchasing a bike. I am interested to study this bike buyers dataset. However there might some difficulties to identify some patterns. I have never studied any data sets related to automobiles. I would like to have a hands on experience on automobile related things. I believe this will help me understand more about data. We will be analysing data like acquiring, examining, querying the data. Then, we will visualise the data and determine needs for cleaning that is the most important phase of any data project. After completion of data understanding phase, we will prepare the data. In the data preparation phase, we will determine how to use the data set. For example, correction, removing or replacing.

Data Description The data has been provided in the form of a CSV file, which contains the following information:

  • ID - An identifier column for each record
  • Marital Status - Is the record for a person who is Married, or Single
  • Gender - Is the record for a person who is Male, Female, or NA (not given)
  • Income - Income level of the person. Values given in integer dollars
  • Children - Number of children for the person
  • Education - Education level of the person
  • Occupation - Occupation that the person currently has
  • Home Owner - Is the person a home owner (Yes) or not (No)? NA indicates no data available
  • Cars - Number of cars that the person owns
  • Commute Distance - Distance to commute to ????
  • Region - Region the person is from
  • Age - Age of the person
  • Purchased Bike - Did the person purchase a bike (Yes) or not? (No)

Data Exploration

For the purposes of building a supervised classification algorithm, we set our target variable as Purchased Bike, which is 1 if the person did purchase a bike and 0 if the person did not.

We would now like to explore all the variables we have to understand their distributions, any outliers / missing values, and which are the best that can be used as feature variables.

These have been explored in the Jupyter notebook, with relevant observations noted in the markdown cells.

library('tidyverse')
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tibble' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library('ggplot2')
bike_buyers = read_csv('_data/bike_buyers.csv')
Rows: 1000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Marital Status, Gender, Education, Occupation, Home Owner, Commute ...
dbl (5): ID, Income, Children, Cars, Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike_buyers
# A tibble: 1,000 × 13
      ID Marital S…¹ Gender Income Child…² Educa…³ Occup…⁴ Home …⁵  Cars Commu…⁶
   <dbl> <chr>       <chr>   <dbl>   <dbl> <chr>   <chr>   <chr>   <dbl> <chr>  
 1 12496 Married     Female  40000       1 Bachel… Skille… Yes         0 0-1 Mi…
 2 24107 Married     Male    30000       3 Partia… Cleric… Yes         1 0-1 Mi…
 3 14177 Married     Male    80000       5 Partia… Profes… No          2 2-5 Mi…
 4 24381 Single      <NA>    70000       0 Bachel… Profes… Yes         1 5-10 M…
 5 25597 Single      Male    30000       0 Bachel… Cleric… No          0 0-1 Mi…
 6 13507 Married     Female  10000       2 Partia… Manual  Yes         0 1-2 Mi…
 7 27974 Single      Male   160000       2 High S… Manage… <NA>        4 0-1 Mi…
 8 19364 Married     Male    40000       1 Bachel… Skille… Yes         0 0-1 Mi…
 9 22155 <NA>        Male    20000       2 Partia… Cleric… Yes         2 5-10 M…
10 19280 Married     Male       NA       2 Partia… Manual  Yes         1 0-1 Mi…
# … with 990 more rows, 3 more variables: Region <chr>, Age <dbl>,
#   `Purchased Bike` <chr>, and abbreviated variable names ¹​`Marital Status`,
#   ²​Children, ³​Education, ⁴​Occupation, ⁵​`Home Owner`, ⁶​`Commute Distance`
summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.