DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Challenge 1 Instructions

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Challenge Overview
  • Reading the Data
  • Description of the data

Challenge 1 Instructions

  • Show All Code
  • Hide All Code

  • View Source
challenge_1
railroads
faostat
wildbirds
Author

Tejaswini_Ketineni

Published

August 21, 2022

Code
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

I am going to work with one data set: 1. railroad_2012_clean_county.csv

Reading the Data

Initially the data set-1(railroad_2012_clean_country) is read

Code
library(readxl)
railroad_2012_clean_county <- read_csv("_data/railroad_2012_clean_county.csv")
View(railroad_2012_clean_county)

Head function is used to understand the population of data

Code
head(railroad_2012_clean_county)
# A tibble: 6 × 3
  state county               total_employees
  <chr> <chr>                          <dbl>
1 AE    APO                                2
2 AK    ANCHORAGE                          7
3 AK    FAIRBANKS NORTH STAR               2
4 AK    JUNEAU                             3
5 AK    MATANUSKA-SUSITNA                  2
6 AK    SITKA                              1
Code
rows_and_columns_ds1 <- dim(railroad_2012_clean_county)
rows_and_columns_ds1
[1] 2930    3

There are about 2930 rows and about 3 columns present in the dataset.

Code
names_col <- colnames(railroad_2012_clean_county)
names_col
[1] "state"           "county"          "total_employees"

There are three columns present in the data set namely : state,county and total_employees present

Code
sum(is.na(railroad_2012_clean_county))
[1] 0
Code
sum(is.null(railroad_2012_clean_county))
[1] 0

There are no nulls or missing values present in the data set

Code
summary(railroad_2012_clean_county)
    state              county          total_employees  
 Length:2930        Length:2930        Min.   :   1.00  
 Class :character   Class :character   1st Qu.:   7.00  
 Mode  :character   Mode  :character   Median :  21.00  
                                       Mean   :  87.18  
                                       3rd Qu.:  65.00  
                                       Max.   :8207.00  
Code
library(data.table)
data_railroad <- data.table(railroad_2012_clean_county)
data_railroad[, .(distinct_states = length(unique(state)))]
   distinct_states
1:              53
Code
data_railroad[, .(distinct_county = length(unique(county)))]
   distinct_county
1:            1709

There are 53 distinct states and 1709 distinct counties present.

Code
(table(railroad_2012_clean_county$state))

 AE  AK  AL  AP  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
  1   6  67   1  72  15  55  57   8   1   3  67 152   3  99  36 103  92  95 119 
 LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR 
 63  12  24  16  78  86 115  78  53  94  49  89  10  21  29  12  61  88  73  33 
 PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
 65   5  46  52  91 221  25  92  14  39  69  53  22 

Description of the data

The data set taken is analysed and the following observations are made.There are about 2930 rows and about 3 columns namely (state, county and the total_employees) present in the data set.The data set is checked for null and missing values.We observe that there are no such values present and the data set is clean.The summary statistics are checked.the count of unique states present in the data set is 53 and 1709 unique counties are present.Tabulating the states and the total_employees, we see that the highest number of employees are present in Texas(TX) and Georgia(GA) while the lowest employee count is observed in Armed forces(AE),Armed forces Pacific(AP).We also observe that the no.of states with employee count <10 is very less.

Source Code
---
title: "Challenge 1 Instructions"
author: "Tejaswini_Ketineni"
desription: "Reading in data and creating a post"
date: "08/21/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - challenge_1
  - railroads
  - faostat
  - wildbirds
---

```{r}
#| label: setup
#| warning: false
#| message: false

library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

## Challenge Overview

I am going to work with one data set: 
1. railroad_2012_clean_county.csv


## Reading the Data

Initially the data set-1(railroad_2012_clean_country) is read

```{r}
library(readxl)
railroad_2012_clean_county <- read_csv("_data/railroad_2012_clean_county.csv")
View(railroad_2012_clean_county)
```

Head function is used to understand the population of data 

```{r}
head(railroad_2012_clean_county)
```

```{r}
rows_and_columns_ds1 <- dim(railroad_2012_clean_county)
rows_and_columns_ds1
```
There are about 2930 rows and about 3 columns present in the dataset.

```{r}
names_col <- colnames(railroad_2012_clean_county)
names_col
```
There are three columns present in the data set namely : state,county and total_employees present


```{r}
sum(is.na(railroad_2012_clean_county))
sum(is.null(railroad_2012_clean_county))
```
There are no nulls or missing values present in the data set

```{r}
summary(railroad_2012_clean_county)
```

```{r}
library(data.table)
data_railroad <- data.table(railroad_2012_clean_county)
data_railroad[, .(distinct_states = length(unique(state)))]
data_railroad[, .(distinct_county = length(unique(county)))]
```

There are 53 distinct states and 1709 distinct counties present.

```{r}
(table(railroad_2012_clean_county$state))
```

## Description of the data

 The data set taken is analysed and the following observations are made.There are about 2930 rows and about 3 columns namely (state, county and the total_employees) present in the data set.The data set is checked for null and missing values.We observe that there are no such values present and the data set is clean.The summary statistics are checked.the count of unique states present in the data set is 53 and 1709 unique counties are present.Tabulating the states and the total_employees, we see that the highest number of employees are present in Texas(TX) and Georgia(GA) while the lowest employee count is observed in Armed forces(AE),Armed forces Pacific(AP).We also observe that the no.of states with employee count <10 is very less.