Code
library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Prachiti Parkar
March 8, 2023
I have selected faostat_livestock dataset to do my analysis on.
# A tibble: 3 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1961 1961 Head
2 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962 Head
3 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1963 1963 Head
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Using R commands I am exploring the data and coming up with the description from the data.
There are total 82117 rows and 14 columns The data ranges from year 1961 to 2018
Domain Code Domain Area Code Area
Length:82116 Length:82116 Min. : 1.0 Length:82116
Class :character Class :character 1st Qu.: 73.0 Class :character
Mode :character Mode :character Median : 146.0 Mode :character
Mean : 912.7
3rd Qu.: 221.0
Max. :5504.0
Element Code Element Item Code Item
Min. :5111 Length:82116 Min. : 866 Length:82116
1st Qu.:5111 Class :character 1st Qu.: 976 Class :character
Median :5111 Mode :character Median :1034 Mode :character
Mean :5111 Mean :1018
3rd Qu.:5111 3rd Qu.:1096
Max. :5111 Max. :1126
Year Code Year Unit Value
Min. :1961 Min. :1961 Length:82116 Min. :0.000e+00
1st Qu.:1976 1st Qu.:1976 Class :character 1st Qu.:1.250e+04
Median :1991 Median :1991 Mode :character Median :2.247e+05
Mean :1990 Mean :1990 Mean :1.163e+07
3rd Qu.:2005 3rd Qu.:2005 3rd Qu.:2.377e+06
Max. :2018 Max. :2018 Max. :1.490e+09
NA's :1301
Flag Flag Description
Length:82116 Length:82116
Class :character Class :character
Mode :character Mode :character
Conducting exploratory data analysis, using dplyr commands such as group_by()
, select()
, filter()
, and summarise()
.
Grouping by the Area and finding the mean_value
# A tibble: 253 × 2
Area mean_value
<chr> <dbl>
1 Afghanistan 3597216.
2 Africa 78159910.
3 Albania NA
4 Algeria 2575444.
5 American Samoa 5399.
6 Americas 95795716.
7 Angola 1123117.
8 Antigua and Barbuda 7877.
9 Argentina 12879607.
10 Armenia 167587.
# … with 243 more rows
Table output by selecting Columns “Area”, “Year”, and “Value”.
# A tibble: 82,116 × 3
Area Year Value
<chr> <dbl> <dbl>
1 Afghanistan 1961 1300000
2 Afghanistan 1962 851850
3 Afghanistan 1963 1001112
4 Afghanistan 1964 1150000
5 Afghanistan 1965 1300000
6 Afghanistan 1966 1200000
7 Afghanistan 1967 1200000
8 Afghanistan 1968 1328000
9 Afghanistan 1969 1250000
10 Afghanistan 1970 1300000
# … with 82,106 more rows
I wanted to check out how any data rows / points are available for Afghanistan and Ireland in the year of 1962 for the item Asses. There are 2 datapoints available.
# A tibble: 2 × 14
Domai…¹ Domain Area …² Area Eleme…³ Element Item …⁴ Item Year …⁵ Year Unit
<chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 QA Live … 2 Afgh… 5111 Stocks 1107 Asses 1962 1962 Head
2 QA Live … 104 Irel… 5111 Stocks 1107 Asses 1962 1962 Head
# … with 3 more variables: Value <dbl>, Flag <chr>, `Flag Description` <chr>,
# and abbreviated variable names ¹`Domain Code`, ²`Area Code`,
# ³`Element Code`, ⁴`Item Code`, ⁵`Year Code`
Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for Value within the data set.
#deleting the rows with Null values
data2 <- na.omit(data)
# Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
data2%>%
group_by(Year)%>%
summarize(mean.Value = mean(Value, na.rm = TRUE), median.Value = median(Value, na.rm = TRUE), min.Value = min(Value, na.rm = TRUE), max.Value = max(Value, na.rm = TRUE), sd.Value = sd(Value, na.rm = TRUE), var.Value = var(Value, na.rm = TRUE), IQR.Value = IQR(Value, na.rm = TRUE))
# A tibble: 58 × 8
Year mean.Value median.Value min.Value max.Value sd.Value var.Va…¹ IQR.V…²
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1961 14080955. 129000 10 994268736 66803024. 4.46e15 2.14e6
2 1962 14085663. 130000 10 997193122 66008462. 4.36e15 2.01e6
3 1963 14568714. 154000 10 999696719 67268611. 4.53e15 2.49e6
4 1964 15101322. 160000 10 1013486149 68974136. 4.76e15 2.61e6
5 1965 15558689. 140000 10 1030878735 70820667. 5.02e15 2.75e6
6 1966 15481029. 155000 10 1040688491 72204592. 5.21e15 2.42e6
7 1967 15804831. 148000 10 1059154631 72456707. 5.25e15 2.54e6
8 1968 15526705. 105850 10 1074022618 72343610. 5.23e15 2.25e6
9 1969 15203756. 116412. 12 1080720926 71629580. 5.13e15 2.15e6
10 1970 15475993. 103000 12 1081641464 72217288. 5.22e15 2.24e6
# … with 48 more rows, and abbreviated variable names ¹var.Value, ²IQR.Value
Some Conclusion from seeing the data summary
Some insights: There is no need of both Year Code and Year, either can be used to save space
---
title: "Challenge 2"
author: "Prachiti Parkar"
description: "Data wrangling and Data Sumarization to find statistics in data"
date: "03/08/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- challenge_2
- faostat
- faostat_livestock
- Prachiti Parkar
---
```{r}
#| label: setup
#| warning: false
#| message: false
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
## Read in the Data
I have selected faostat_livestock dataset to do my analysis on.
```{r}
# Importing the faostat_livestock dataset and reading them
library(readr)
data <-read_csv("_data/FAOSTAT_livestock.csv")
#Lets see the first 3 rows of the dataset
head(data, 3)
```
## Describe the data
Using R commands I am exploring the data and coming up with the description from the data.
There are total 82117 rows and 14 columns
The data ranges from year 1961 to 2018
### Checking the variation in data for each column
```{r}
#| label: summary
# Check the variation in data for each column
summary(data)
```
## Provide Grouped Summary Statistics
Conducting exploratory data analysis, using dplyr commands such as `group_by()`, `select()`, `filter()`, and `summarise()`.
```{r}
#install.packages("dplyr")
library(dplyr)
```
Grouping by the Area and finding the mean_value
```{r}
data %>%
group_by(Area) %>%
summarise(mean_value = mean(Value))
```
Table output by selecting Columns "Area", "Year", and "Value".
```{r}
data %>%
select(Area, Year, Value)
```
I wanted to check out how any data rows / points are available for Afghanistan and Ireland in the year of 1962 for the item Asses. There are 2 datapoints available.
```{r}
data %>%
filter(Area == "Afghanistan" | Area == "Ireland",Item == "Asses", Year == 1962)
```
## Table Value Statistics
Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for Value within the data set.
```{r}
#deleting the rows with Null values
data2 <- na.omit(data)
# Find the central tendency (mean, median, mode) and dispersion (standard deviation, mix/max/quantile) for different subgroups within the data set.
data2%>%
group_by(Year)%>%
summarize(mean.Value = mean(Value, na.rm = TRUE), median.Value = median(Value, na.rm = TRUE), min.Value = min(Value, na.rm = TRUE), max.Value = max(Value, na.rm = TRUE), sd.Value = sd(Value, na.rm = TRUE), var.Value = var(Value, na.rm = TRUE), IQR.Value = IQR(Value, na.rm = TRUE))
```
Some Conclusion from seeing the data summary
1. The Data is Collected from the year 1961 - 2018
2. The Element column just has one value Stocks and the Element Code also has only 1 value - 5111
3. The Area Code varies from 1 - 5504
4. The Value Column has a lot of variance
Some insights:
There is no need of both Year Code and Year, either can be used to save space