Data Reading
User_ID Product_ID Gender Age Occupation City_Category
1 1000001 P00069042 F 0-17 10 A
2 1000001 P00248942 F 0-17 10 A
3 1000001 P00087842 F 0-17 10 A
4 1000001 P00085442 F 0-17 10 A
5 1000002 P00285442 M 55+ 16 C
6 1000003 P00193542 M 26-35 15 A
Stay_In_Current_City_Years Marital_Status Product_Category_1
1 2 0 3
2 2 0 1
3 2 0 12
4 2 0 12
5 4+ 0 8
6 3 0 1
Product_Category_2 Product_Category_3 Purchase
1 NA NA 8370
2 6 14 15200
3 NA NA 1422
4 14 NA 1057
5 NA NA 7969
6 2 NA 15227
data = data[,-11] #Drop this column because Product_Category_3 has too many missing values
data[is.na(data)] <- 0 #Fill in the missing values of Product_Category_2 with 0
head(data)
User_ID Product_ID Gender Age Occupation City_Category
1 1000001 P00069042 F 0-17 10 A
2 1000001 P00248942 F 0-17 10 A
3 1000001 P00087842 F 0-17 10 A
4 1000001 P00085442 F 0-17 10 A
5 1000002 P00285442 M 55+ 16 C
6 1000003 P00193542 M 26-35 15 A
Stay_In_Current_City_Years Marital_Status Product_Category_1
1 2 0 3
2 2 0 1
3 2 0 12
4 2 0 12
5 4+ 0 8
6 3 0 1
Product_Category_2 Purchase
1 0 8370
2 6 15200
3 0 1422
4 14 1057
5 0 7969
6 2 15227
glimpse(data)
Rows: 537,577
Columns: 11
$ User_ID <int> 1000001, 1000001, 1000001, 100000…
$ Product_ID <chr> "P00069042", "P00248942", "P00087…
$ Gender <chr> "F", "F", "F", "F", "M", "M", "M"…
$ Age <chr> "0-17", "0-17", "0-17", "0-17", "…
$ Occupation <int> 10, 10, 10, 10, 16, 15, 7, 7, 7, …
$ City_Category <chr> "A", "A", "A", "A", "C", "A", "B"…
$ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2…
$ Marital_Status <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
$ Product_Category_1 <int> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5…
$ Product_Category_2 <dbl> 0, 6, 0, 14, 0, 2, 8, 15, 16, 0, …
$ Purchase <int> 8370, 15200, 1422, 1057, 7969, 15…
The datasets have 12 variables and a total of 537,577 data. These are 12 variable.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hanyu (2022, May 19). Data Analytics and Computational Social Science: Homework3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902257/
BibTeX citation
@misc{hanyu2022homework3, author = {Hanyu, }, title = {Data Analytics and Computational Social Science: Homework3}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomhanyu902257/}, year = {2022} }