DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

HW2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

HW2

hw2
Introduction to Visualization
Author

Lai Wei

Published

November 15, 2022

  • Goals of this HW: gain experience with working with external data, dplyr, and the pipe operator.

Background for mmr_2015.csv: The maternal mortality ratio (MMR) is defined as the number of maternal deaths per 100,000 live births. The UN maternal mortality estimation group produces estimates of the MMR for all countries in the world.

In this HW, I will use mmr_2015.csv, which is a data set that contains a subset of the (real) data that were used to generate the United Nations Maternal mortality estimates, as published in the year 2015. Variables in the data set mmr_2015.csv are as follows:

  • Iso = ISO code
  • Name = country name
  • Year = observation year
  • MMR = observed maternal mortality ratio, which is defined as the number of maternal deaths/total number of births*100,000
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(babynames)
Error in library(babynames): there is no package called 'babynames'
library(dplyr)
  1. using mmr_2015.csv: Read in mmr_2015.csv. Then construct a graph that shows the observed values of the MMR plotted against year (starting in 2000) for India and Thailand, as in the example Figure 1 below. Use the pipe operator so that the graph follows from a multi-line command that starts with “mmr %>%”. Hint 1: Use data transformation functions to filter rows with i. year >= 2000 and ii. countries India and Thailand only. Hint 2: Use ggplot() to display the data.
mmr <- read.csv("D:/Umass Amherst/BIOSTATS 597D/HW/mmr_2015.csv")
Warning in file(file, "rt"): cannot open file 'D:/Umass Amherst/BIOSTATS
597D/HW/mmr_2015.csv': No such file or directory
Error in file(file, "rt"): cannot open the connection
data_IT = filter(mmr,country == "India"|country == "Thailand",year >= 2000)
Error in filter(mmr, country == "India" | country == "Thailand", year >= : object 'mmr' not found
ggplot(data = data_IT,aes(x = year,y= mmr))+
  geom_point(aes(group = country,color = country))
Error in ggplot(data = data_IT, aes(x = year, y = mmr)): object 'data_IT' not found
  1. using babynames as used in the lecture slides:

Reproduce the example Figure 2 below where babynames was filtered to include only those rows with year > 1975, sex equal to male, and either prop > 0.025 or n > 50000. Note that the y-axis starts at zero.

babynames %>% 
  filter(year > 1975, sex == "M",prop > 0.025|n > 50000) %>% 
  ggplot(aes(x = year, y = prop))+
  geom_point(aes(group = name,color = name), size = 2)+
  geom_line(aes(group = name, color = name))+
  expand_limits(y = 0)
Error in filter(., year > 1975, sex == "M", prop > 0.025 | n > 50000): object 'babynames' not found
  1. Construct and print a tibble that shows the countries sorted by their average observed MMR (rounded to zero digits), with the country with the highest average MMR listed first, as example Figure 3 below:
data1<- group_by(mmr,country) %>% 
  summarise_at(vars(mmr),list(name = mean))
Error in group_by(mmr, country): object 'mmr' not found
  names(data1)[2] = "ave" 
Error in names(data1)[2] = "ave": object 'data1' not found
  data1$ave <- round(data1$ave,0)
Error in eval(expr, envir, enclos): object 'data1' not found
  arrange(data1,desc(ave))
Error in arrange(data1, desc(ave)): object 'data1' not found
  1. Continuing with the mmr data set

Part a: For each year - first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per year; note that this is not true in this data set). - then rank countries by increasing MMR for each year.

Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table.

data2<-
  mmr %>% 
  group_by(year) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
Error in group_by(., year): object 'mmr' not found
data2
Error in eval(expr, envir, enclos): object 'data2' not found
  arrange(data2,desc(Mean)) 
Error in arrange(data2, desc(Mean)): object 'data2' not found
lowest10 <- print(tail(data2,10))
Error in tail(data2, 10): object 'data2' not found

Part b: do the same thing but now with rankings calculated separately for two periods, with period 1 referring to years < 2000 and period 2 referring to years >= 2000. For each period

  • first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per period)
  • then rank countries by increasing MMR for each period.

Calculate the mean ranking across all periods, extract the 10 countries with the lowest ranking across all periods, and print the table.

before_2000<-mmr %>% 
  filter(year < 2000) %>% 
  group_by(country) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
Error in filter(., year < 2000): object 'mmr' not found
before_2000
Error in eval(expr, envir, enclos): object 'before_2000' not found
  print(tail(before_2000,10))
Error in tail(before_2000, 10): object 'before_2000' not found
after_2000 <- mmr %>% 
  filter(year >= 2000) %>% 
  group_by(country) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
Error in filter(., year >= 2000): object 'mmr' not found
after_2000
Error in eval(expr, envir, enclos): object 'after_2000' not found
  print(tail(after_2000,10))
Error in tail(after_2000, 10): object 'after_2000' not found
Source Code
---
title: "HW2"
author: "Lai Wei"
description: "Introduction to Visualization"
date: "11/15/2022"
format:
  html:
    toc: true
    code-copy: true
    code-tools: true
categories:
  - hw2
---

- Goals of this HW: 
  gain experience with working with external data, dplyr, and the pipe operator.

Background for  mmr_2015.csv: 
The maternal mortality ratio (MMR) is defined as the number of maternal deaths per 100,000 live births. The UN maternal mortality estimation group produces estimates of the MMR for all countries in the world.

In this HW, I will use mmr_2015.csv, which is a data set that contains a subset of the (real) data that were used to generate the United Nations Maternal mortality estimates, as published in the year 2015. Variables in the data set mmr_2015.csv are as follows:

-   Iso = ISO code
-   Name = country name
-   Year = observation year
-   MMR = observed maternal mortality ratio, which is defined as the number of maternal deaths/total number of births*100,000


```{r setup}
library(tidyverse)
library(babynames)
library(dplyr)
```

1. using mmr_2015.csv: Read in mmr_2015.csv. Then construct a graph that shows the observed values of the MMR plotted against year (starting in 2000) for India and Thailand, as in the example Figure 1 below. Use the pipe operator so that the graph follows from a multi-line command that starts with “mmr %>%”. Hint 1: Use data transformation functions to filter rows with i. year >= 2000 and ii. countries India and Thailand only. Hint 2: Use ggplot() to display the data.


```{r}
mmr <- read.csv("D:/Umass Amherst/BIOSTATS 597D/HW/mmr_2015.csv")
data_IT = filter(mmr,country == "India"|country == "Thailand",year >= 2000)
ggplot(data = data_IT,aes(x = year,y= mmr))+
  geom_point(aes(group = country,color = country))
```

2. using babynames as used in the lecture slides: 

Reproduce the example Figure 2 below where babynames was filtered to include only those rows with year > 1975, sex equal to male, and either prop > 0.025 or n > 50000. Note that the y-axis starts at zero.


```{r}
babynames %>% 
  filter(year > 1975, sex == "M",prop > 0.025|n > 50000) %>% 
  ggplot(aes(x = year, y = prop))+
  geom_point(aes(group = name,color = name), size = 2)+
  geom_line(aes(group = name, color = name))+
  expand_limits(y = 0)
  
```

3. Construct and print a tibble that shows the countries sorted by their average observed MMR (rounded to zero digits), with the country with the highest average MMR listed first, as example Figure 3 below:


```{r}
data1<- group_by(mmr,country) %>% 
  summarise_at(vars(mmr),list(name = mean))
  names(data1)[2] = "ave" 
  data1$ave <- round(data1$ave,0)
  arrange(data1,desc(ave))
```

4. Continuing with the mmr data set

Part a: For each year
- first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per year; note that this is not true in this data set). 
- then rank countries by increasing MMR for each year. 

Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table. 

```{r}
data2<-
  mmr %>% 
  group_by(year) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
data2
  arrange(data2,desc(Mean)) 
lowest10 <- print(tail(data2,10))


```

Part b: do the same thing but now with rankings calculated separately for two periods, with period 1 referring to years < 2000 and period 2 referring to years >= 2000. 
For each period

- first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per period)
- then rank countries by increasing MMR for each period. 

Calculate the mean ranking across all periods, extract the 10 countries with the lowest ranking across all periods, and print the table.

```{r}
before_2000<-mmr %>% 
  filter(year < 2000) %>% 
  group_by(country) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
before_2000
  print(tail(before_2000,10))

after_2000 <- mmr %>% 
  filter(year >= 2000) %>% 
  group_by(country) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
after_2000
  print(tail(after_2000,10))
  
```