Basic Data Wrangling
PROBLEM 1. Identify the dataset you will be using for the final project. The dataset may come from the course datasets, or it may be a dataset you found on your own. We encourage students to find datasets pertaining to their own interests.You should also identify the variables in the dataset, including what type of data each variable is.
The dataset I will be using contains detailed climate data from the National Centers for Climate Data. More specifically, it contains information on temperature and precipitation in the United States. It can be found at: https://www.ncei.noaa.gov/data/climdiv/ under the label access/ > -pcpncy-v1.0.0-20220304.
The data required significant cleaning. In the original data set, the row names are coded. Certain digits of the row names represent which state, metric (temperature or precipitation), and year are associated with that particular row. The 12 columns represent the 12 months of that year. The 12 column variables are of type double, because they contain the precipitation or temperature measurements for that month. We will revisit the variables and variable types after we re-organize the data to make it more interpretable.
PROBLEM 2. Read in/clean the data set. Using R packages like readr/readxl, dplyr, and tidyr, import the data set to R and clean it as needed.
During the process of reading in the data set, I had some challenges. To download the data, I clicked on a link which brought me to a page displaying the data set. I am accustomed to downloading a data set in the .csv format, but the data set provided was not in the form of a csv and had no commas.
At first, I tried selecting the data, copying it, and pasting it on the Notepad app. I saved it as a .txt file, but this did not work, because the file was “too large to import”. At first, I thought this meant I would have to cut down the number of rows or columns, but I decided to try to a convert the file to a .csv first.
I found an application that converted files from the format .txt to .csv, and after converting the data set into a .csv I was able to import it appropriately. Below, I read in the .csv file and print the first 6 rows.
X01001011895 X7.03 X2.96 X8.36 X3.53 X3.96 X5.40 X3.92 X3.36 X0.73
1 1001011896 5.86 5.42 5.54 3.98 3.77 6.24 4.38 2.57 0.82
2 1001011897 3.27 6.63 10.94 4.35 0.81 1.57 3.96 5.02 0.87
3 1001011898 2.33 2.07 2.60 4.56 0.54 3.13 5.80 6.02 1.51
4 1001011899 5.80 6.94 3.35 2.22 2.93 2.31 6.80 2.90 0.63
5 1001011900 3.18 9.07 5.77 7.14 1.63 7.36 3.35 3.85 4.74
6 1001011901 5.20 4.39 6.35 4.61 5.44 2.24 2.79 5.58 3.75
X2.03 X1.44 X3.66
1 1.66 2.89 1.94
2 0.75 1.84 4.38
3 3.21 6.66 3.91
4 3.02 1.98 5.25
5 5.92 4.09 4.89
6 1.01 2.07 7.55
As mentioned, it is very difficult to interpret the data because it has not yet been cleaned, and the variable names are coded. The next step for me was to clean the data to make these variable names more visible.
I chose this data set partly because there were many challenges associated with its structure from the beginning, and in addition to being interested in climate data I thought it would be good practice to learn how to adjust it so that it would be in the form I wanted.
Although data sets can take on many forms, in the most standard case, the rows tend to represent observations, and the columns represent their attributes. Here this is not the case. The biggest challenge was tackling the coded row names, but before that I took care of smaller issues.
head(climate)
X01001011895 X7.03 X2.96 X8.36 X3.53 X3.96 X5.40 X3.92 X3.36 X0.73
1 1001011896 5.86 5.42 5.54 3.98 3.77 6.24 4.38 2.57 0.82
2 1001011897 3.27 6.63 10.94 4.35 0.81 1.57 3.96 5.02 0.87
3 1001011898 2.33 2.07 2.60 4.56 0.54 3.13 5.80 6.02 1.51
4 1001011899 5.80 6.94 3.35 2.22 2.93 2.31 6.80 2.90 0.63
5 1001011900 3.18 9.07 5.77 7.14 1.63 7.36 3.35 3.85 4.74
6 1001011901 5.20 4.39 6.35 4.61 5.44 2.24 2.79 5.58 3.75
X2.03 X1.44 X3.66
1 1.66 2.89 1.94
2 0.75 1.84 4.38
3 3.21 6.66 3.91
4 3.02 1.98 5.25
5 5.92 4.09 4.89
6 1.01 2.07 7.55
For example, as seen above, there are no real column names in the original data set. The first row is misplaced and sits where the column names should be. We don’t want that to be the case; column names are useful. Also, the measurement values for the first observation should inhabit the first row, where they belong. Thus I planned to rename the columns, and then move the first row to where it belongs.
First, I changed the column names to represent the months of the year.
I will print the data set before and after this last step, to compare.
X01001011895 X7.03 X2.96 X8.36 X3.53 X3.96 X5.40 X3.92 X3.36 X0.73
1 1001011896 5.86 5.42 5.54 3.98 3.77 6.24 4.38 2.57 0.82
2 1001011897 3.27 6.63 10.94 4.35 0.81 1.57 3.96 5.02 0.87
3 1001011898 2.33 2.07 2.60 4.56 0.54 3.13 5.80 6.02 1.51
4 1001011899 5.80 6.94 3.35 2.22 2.93 2.31 6.80 2.90 0.63
5 1001011900 3.18 9.07 5.77 7.14 1.63 7.36 3.35 3.85 4.74
6 1001011901 5.20 4.39 6.35 4.61 5.44 2.24 2.79 5.58 3.75
X2.03 X1.44 X3.66
1 1.66 2.89 1.94
2 0.75 1.84 4.38
3 3.21 6.66 3.91
4 3.02 1.98 5.25
5 5.92 4.09 4.89
6 1.01 2.07 7.55
#data set with column names
head(climate2)
a January February March April May June July August
1 1001011896 5.86 5.42 5.54 3.98 3.77 6.24 4.38 2.57
2 1001011897 3.27 6.63 10.94 4.35 0.81 1.57 3.96 5.02
3 1001011898 2.33 2.07 2.60 4.56 0.54 3.13 5.80 6.02
4 1001011899 5.80 6.94 3.35 2.22 2.93 2.31 6.80 2.90
5 1001011900 3.18 9.07 5.77 7.14 1.63 7.36 3.35 3.85
6 1001011901 5.20 4.39 6.35 4.61 5.44 2.24 2.79 5.58
September October November December
1 0.82 1.66 2.89 1.94
2 0.87 0.75 1.84 4.38
3 1.51 3.21 6.66 3.91
4 0.63 3.02 1.98 5.25
5 4.74 5.92 4.09 4.89
6 3.75 1.01 2.07 7.55
Next, I will transfer the original column names into the first row, and print out the resulting data frame. During the first step, I was careful to save the contents of the first row, which made it easy to manually transfer the information:
climate3 <- climate2
library(tidyverse)
climate3 <- climate3 %>% add_row("a" = 01001011895, "January" = 7.03, "February" = 2.96, "March" = 8.36, "April" = 3.53, "May" = 3.96, "June" = 5.40, "July" = 3.92, "August" = 3.36, "September" = 0.73, "October" = 2.03, "November" = 1.44, "December" = 3.66, .before = 1)
head(climate3)
a January February March April May June July August
1 1001011895 7.03 2.96 8.36 3.53 3.96 5.40 3.92 3.36
2 1001011896 5.86 5.42 5.54 3.98 3.77 6.24 4.38 2.57
3 1001011897 3.27 6.63 10.94 4.35 0.81 1.57 3.96 5.02
4 1001011898 2.33 2.07 2.60 4.56 0.54 3.13 5.80 6.02
5 1001011899 5.80 6.94 3.35 2.22 2.93 2.31 6.80 2.90
6 1001011900 3.18 9.07 5.77 7.14 1.63 7.36 3.35 3.85
September October November December
1 0.73 2.03 1.44 3.66
2 0.82 1.66 2.89 1.94
3 0.87 0.75 1.84 4.38
4 1.51 3.21 6.66 3.91
5 0.63 3.02 1.98 5.25
6 4.74 5.92 4.09 4.89
After that, the data set had column names, and the first row was placed where it belongs.
Next I worked on dealing with the coded row names. Though the set-up of the original data frame technically works, it is difficult to perform analyses on it.
As mentioned in problem 1, each row name is a coded number. According to the data set description, the first two digits represent the state, the next three represent the division, the sixth represents the metric (the climate measurement, like temperature), and the seventh to tenth represent the year. There is a legend on the website which lists the code number for each state, division, and metric. My goal was to convert the data set, such that instead of relying on coded row names, it would contain columns that represented the state, division, metric, and year. I perform this data tidying process below:
#Change the row names:
#(1) Make the row names contain the values in column a.
climate4 <- climate3
rownames(climate4) = climate4$a
#(2) Eliminate the now unnecessary column a
climate5 <- climate4[,-(1),drop=FALSE]
#Add columns to represent the digits in the row names.
climate6<-climate5%>%
mutate(state = as.numeric(substr(rownames(climate5), 1,2)),
division = as.numeric(substr(rownames(climate5), 3,5)),
metric = as.numeric(substr(rownames(climate5), 6,6)),
year = as.numeric(substr(rownames(climate5), 7,10)))
#Add another column to explicitly describe what metric number codings 1, 2, 27, and 28 refer to.
climate7 <- climate6 %>%
mutate(metric_name = case_when(
metric == 1 ~ "Precipitation",
metric == 2 ~ "Average Temperature",
metric == 27 ~ "Maximum Temperature",
metric == 28 ~ "Minimum Temperature"
))
#Rename row names to simply be ID numbers from 1 to 400,636.
climate8 <- climate7
rownames(climate8) = 1:400636
head(climate8)
January February March April May June July August September
1 7.03 2.96 8.36 3.53 3.96 5.40 3.92 3.36 0.73
2 5.86 5.42 5.54 3.98 3.77 6.24 4.38 2.57 0.82
3 3.27 6.63 10.94 4.35 0.81 1.57 3.96 5.02 0.87
4 2.33 2.07 2.60 4.56 0.54 3.13 5.80 6.02 1.51
5 5.80 6.94 3.35 2.22 2.93 2.31 6.80 2.90 0.63
6 3.18 9.07 5.77 7.14 1.63 7.36 3.35 3.85 4.74
October November December state division metric year metric_name
1 2.03 1.44 3.66 10 10 1 1895 Precipitation
2 1.66 2.89 1.94 10 10 1 1896 Precipitation
3 0.75 1.84 4.38 10 10 1 1897 Precipitation
4 3.21 6.66 3.91 10 10 1 1898 Precipitation
5 3.02 1.98 5.25 10 10 1 1899 Precipitation
6 5.92 4.09 4.89 10 10 1 1900 Precipitation
Now the data frame is tidied. The row names are now simply ID numbers from 1 to 400,636. I added columns for state, division, metric, year, and metric name. The first 12 columns of each row contains the precipitation or temperature value from January to December for a specific state, division, and year. The metric column tells us the coded number for the metric being reported in that row. However, it is coded as a number so I included an additional column, metric_name, which tells us whether the metric represented in that row is precipitation, average temperature, maximum temperature, or minimum temperature. I chose to keep the columns state and division coded, which means that it will be useful to have the legend available during further analysis of the data set.
PROBLEM 3. Identify potential research questions that your data set can help answer.
head(climate8)
January February March April May June July August September
1 7.03 2.96 8.36 3.53 3.96 5.40 3.92 3.36 0.73
2 5.86 5.42 5.54 3.98 3.77 6.24 4.38 2.57 0.82
3 3.27 6.63 10.94 4.35 0.81 1.57 3.96 5.02 0.87
4 2.33 2.07 2.60 4.56 0.54 3.13 5.80 6.02 1.51
5 5.80 6.94 3.35 2.22 2.93 2.31 6.80 2.90 0.63
6 3.18 9.07 5.77 7.14 1.63 7.36 3.35 3.85 4.74
October November December state division metric year metric_name
1 2.03 1.44 3.66 10 10 1 1895 Precipitation
2 1.66 2.89 1.94 10 10 1 1896 Precipitation
3 0.75 1.84 4.38 10 10 1 1897 Precipitation
4 3.21 6.66 3.91 10 10 1 1898 Precipitation
5 3.02 1.98 5.25 10 10 1 1899 Precipitation
6 5.92 4.09 4.89 10 10 1 1900 Precipitation
Possible research questions I might explore include:
-Has the amount of average precipitation changed in California over time?
-Has the average temperature increased in certain states over time? Has it become more erratic (i.e. more extreme highs and lows)?
-What state had the highest precipitation rate in recent years? Temperature?
-Which state/division contains the greatest range of precipitation over the course of the year? What further research might help us understand the implications of these results?
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Laidler (2022, May 4). Data Analytics and Computational Social Science: Homework 3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomericalaidler898200/
BibTeX citation
@misc{laidler2022homework, author = {Laidler, Erica}, title = {Data Analytics and Computational Social Science: Homework 3}, url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomericalaidler898200/}, year = {2022} }