Homework3

Data Reading

Hanyu
4/3/2022

Read data

library(dplyr)
data <- read.csv(file = "business_data.csv")
head(data)
  User_ID Product_ID Gender   Age Occupation City_Category
1 1000001  P00069042      F  0-17         10             A
2 1000001  P00248942      F  0-17         10             A
3 1000001  P00087842      F  0-17         10             A
4 1000001  P00085442      F  0-17         10             A
5 1000002  P00285442      M   55+         16             C
6 1000003  P00193542      M 26-35         15             A
  Stay_In_Current_City_Years Marital_Status Product_Category_1
1                          2              0                  3
2                          2              0                  1
3                          2              0                 12
4                          2              0                 12
5                         4+              0                  8
6                          3              0                  1
  Product_Category_2 Product_Category_3 Purchase
1                 NA                 NA     8370
2                  6                 14    15200
3                 NA                 NA     1422
4                 14                 NA     1057
5                 NA                 NA     7969
6                  2                 NA    15227

Clean data

data = data[,-11]       #Drop this column because Product_Category_3 has too many missing values
data[is.na(data)] <- 0  #Fill in the missing values of Product_Category_2 with 0
head(data)
  User_ID Product_ID Gender   Age Occupation City_Category
1 1000001  P00069042      F  0-17         10             A
2 1000001  P00248942      F  0-17         10             A
3 1000001  P00087842      F  0-17         10             A
4 1000001  P00085442      F  0-17         10             A
5 1000002  P00285442      M   55+         16             C
6 1000003  P00193542      M 26-35         15             A
  Stay_In_Current_City_Years Marital_Status Product_Category_1
1                          2              0                  3
2                          2              0                  1
3                          2              0                 12
4                          2              0                 12
5                         4+              0                  8
6                          3              0                  1
  Product_Category_2 Purchase
1                  0     8370
2                  6    15200
3                  0     1422
4                 14     1057
5                  0     7969
6                  2    15227
glimpse(data)
Rows: 537,577
Columns: 11
$ User_ID                    <int> 1000001, 1000001, 1000001, 100000…
$ Product_ID                 <chr> "P00069042", "P00248942", "P00087…
$ Gender                     <chr> "F", "F", "F", "F", "M", "M", "M"…
$ Age                        <chr> "0-17", "0-17", "0-17", "0-17", "…
$ Occupation                 <int> 10, 10, 10, 10, 16, 15, 7, 7, 7, …
$ City_Category              <chr> "A", "A", "A", "A", "C", "A", "B"…
$ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2…
$ Marital_Status             <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
$ Product_Category_1         <int> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5…
$ Product_Category_2         <dbl> 0, 6, 0, 14, 0, 2, 8, 15, 16, 0, …
$ Purchase                   <int> 8370, 15200, 1422, 1057, 7969, 15…

Introduction

The datasets have 12 variables and a total of 537,577 data. These are 12 variable.

  1. User_ID: the shopper’s code.
  2. Product_ID: Product code.
  3. Gender: the gender of the shopper.
  4. Age: the age of the shopper.
  5. Ccupation: Occupation.
  6. City_Category: The shopper’s place of residence.
  7. Stay_In_Current_City_Years: The number of years spent in the current city.
  8. Marital_Status: The shopper’s marital status.
  9. Product_Category_1: The category of the product purchased.
  10. Product_Category_2: The product may belong to other categories.
  11. Product_Category_3: Products may belong to other categories.
  12. Purchase: The amount of the purchase.

Potential research questions

  1. Research the most popular types of goods.
  2. The effect of gender differences on consumption levels.
  3. Differences in consumption levels in different levels of cities.
  4. Differences in consumption levels between different age groups.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Hanyu (2022, May 19). Data Analytics and Computational Social Science: Homework3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902257/

BibTeX citation

@misc{hanyu2022homework3,
  author = {Hanyu, },
  title = {Data Analytics and Computational Social Science: Homework3},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902257/},
  year = {2022}
}