Homework_2

Reading in Data

Cynthia Hester
09-29-2021

Reading in the first data set

Reading in or importing data files to RStudio is a necessary step to gain access to any files that are needed for cleaning or tidying. After imported data is cleaned, it is then more suitable for exploration.

As we know data formats are not homogeneous,and come in many different flavors. So,whether data is in CSV, SPSS,XLSX,SAS,TXT,STATA,or HTML as well as many other formats, there is usually R package to read in the data.

The first data set I will read in is from the included R package “Data Sets”. It is the MTCars (MotorTrend) dataset which was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption as well as 10 aspects of automotive design and performance for 32 cars (1973-74).

This R chunk loads in the data sets package and provides a summary of the statistics for the mtcars data set

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

This R chunk uses an alternative to the summary function called Skim. Skim provides a comprehensive overview of the mtcars data set as well as providing a visualization of the data in the rows represented by histograms.

Table 1: Data summary
Name mtcars
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.40 15.43 19.20 22.80 33.90 <U+2583><U+2587><U+2585><U+2581><U+2582>
cyl 0 1 6.19 1.79 4.00 4.00 6.00 8.00 8.00 <U+2586><U+2581><U+2583><U+2581><U+2587>
disp 0 1 230.72 123.94 71.10 120.83 196.30 326.00 472.00 <U+2587><U+2583><U+2583><U+2583><U+2582>
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 <U+2587><U+2587><U+2586><U+2583><U+2581>
drat 0 1 3.60 0.53 2.76 3.08 3.70 3.92 4.93 <U+2587><U+2583><U+2587><U+2585><U+2581>
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 <U+2583><U+2583><U+2587><U+2581><U+2582>
qsec 0 1 17.85 1.79 14.50 16.89 17.71 18.90 22.90 <U+2583><U+2587><U+2587><U+2582><U+2581>
vs 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2586>
am 0 1 0.41 0.50 0.00 0.00 0.00 1.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2586>
gear 0 1 3.69 0.74 3.00 3.00 4.00 4.00 5.00 <U+2587><U+2581><U+2586><U+2581><U+2582>
carb 0 1 2.81 1.62 1.00 2.00 2.00 4.00 8.00 <U+2587><U+2582><U+2585><U+2581><U+2581>

This R chunk exemplifies the granularity of the Skim package by selecting specific columns to summarize.

Table 2: Data summary
Name mtcars
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 <U+2587><U+2587><U+2586><U+2583><U+2581>
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 <U+2583><U+2583><U+2587><U+2581><U+2582>

This R chunk provides the column names of the mtcars dataset using the colnames() function.

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
[10] "gear" "carb"

This R chuck introduces the dim() function provides information on the dimensions of the data set,which shows this data array to have 32 rows and 11 columns.

[1] 32 11
This R chunk shows a generic visualization of the mtcars object using the plot() function.

The Second Data Set comes from the course csv file eggs_tidy.

I wanted to try reading data in from an external data set, that used the csv format.

This first R chunk reads in the eggs tidy csv data

Summarizes the eggs_tidy data set
    month                year      large_half_dozen  large_dozen   
 Length:120         Min.   :2004   Min.   :126.0    Min.   :225.0  
 Class :character   1st Qu.:2006   1st Qu.:129.4    1st Qu.:233.5  
 Mode  :character   Median :2008   Median :174.5    Median :267.5  
                    Mean   :2008   Mean   :155.2    Mean   :254.2  
                    3rd Qu.:2011   3rd Qu.:174.5    3rd Qu.:268.0  
                    Max.   :2013   Max.   :178.0    Max.   :277.5  
 extra_large_half_dozen extra_large_dozen
 Min.   :132.0          Min.   :230.0    
 1st Qu.:135.8          1st Qu.:241.5    
 Median :185.5          Median :285.5    
 Mean   :164.2          Mean   :266.8    
 3rd Qu.:185.5          3rd Qu.:285.5    
 Max.   :188.1          Max.   :290.0    
Summarizes data set using the skim function
Table 3: Data summary
Name eggs_tidy
Number of rows 120
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
month 0 1 3 9 0 12 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2008.50 2.88 2004 2006.00 2008.5 2011.0 2013.00 <U+2587><U+2587><U+2587><U+2587><U+2587>
large_half_dozen 0 1 155.17 22.59 126 129.44 174.5 174.5 178.00 <U+2586><U+2581><U+2581><U+2581><U+2587>
large_dozen 0 1 254.20 18.55 225 233.50 267.5 268.0 277.50 <U+2585><U+2582><U+2581><U+2581><U+2587>
extra_large_half_dozen 0 1 164.22 24.68 132 135.78 185.5 185.5 188.13 <U+2586><U+2581><U+2581><U+2581><U+2587>
extra_large_dozen 0 1 266.80 22.80 230 241.50 285.5 285.5 290.00 <U+2585><U+2582><U+2581><U+2581><U+2587>

This chunk uses the tibble function which provides a more comprehensive and readable data frame

# A tibble: 120 x 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ... with 110 more rows, and 1 more variable:
#   extra_large_dozen <dbl>

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Hester (2021, Sept. 29). DACSS 601 Fall 2021: Homework_2 . Retrieved from https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-29-reading-in-data-hw2/

BibTeX citation

@misc{hester2021homework_2,
  author = {Hester, Cynthia},
  title = {DACSS 601 Fall 2021: Homework_2 },
  url = {https://mrolfe.github.io/DACSS601Fall21/posts/2021-09-29-reading-in-data-hw2/},
  year = {2021}
}