Final Project Assignment#1: Project & Data Description

final_Project_assignment_1
final_project_data_description
Project & Data Description
Author

Sean Conway, Meredith Rolfe & Erico Yu

Published

March 31, 2023

Important Formatting & Submission Notes:

  1. Use this file as the template to work on: start your own writing from Section “Part.1”

  2. Please make the following changes to the above YAML header:

    • Change the “title” to “Final Project Assignment#1: First Name Last Name”;

    • Change the “author” to your name;

    • Change the “date” to the current date in the “MM-DD-YYYY” format;

  3. Submission:

    • Delete the unnecessary sections (“Overview”, “Tasks”, “Special Note”, and “Evaluation”).
    • In the posts folder of your local 601_Spring_2023 project, create a folder named “FirstNameLastName_FinalProjectData”, and save your final project dataset(s) in this folder. DO NOT save the dataset(s) to the _data folder which stores the dataset(s) for challenges.
    • Render and submit the file to the blog post like a regular challenge.

Overview of the Final Project

The goal is to tell a coherent and focused story with your data, which answers a question (or questions) that a researcher, or current or future employer, might want to have answered. The goal might be to understand a source of covariance, make a recommendation, or understand change over time. We don’t expect you to reach a definitive conclusion in this analysis. Still, you are expected to tell a data-driven story using evidence to support the claims you are making on the basis of the exploratory analysis conducted over the past term.

In this final project, statistical analyses are not required, but any students who wish to include these may do so. However, your primary analysis should center around visualization rather than inferential statistics. Many scientists only compute statistics after a careful process of exploratory data analysis and data visualization. Statistics are a way to gauge your certainty in your results - NOT A WAY TO DISCOVER MEANINGFUL DATA PATTERNS. Do not run a multiple regression with numerous predictors and report which predictors are significant!!

Tasks of Assignment#1

This assignment is the first component of your final project. Together with the later assignments, it make up a short paper/report. In this assignment, you should introduce a dataset(s) and how you plan to present this dataset(s). This assignment should include the following components:

  1. A clear description of the dataset(s) that you are using.

  2. What “story” do you want to present to the audience? In other words, what “question(s)” do you like to answer with this dataset(s)?

  3. The Plan for Further Analysis and Visualization.

We will have a special class meeting on April 12 to review and discuss students’ proposed datasets for the final project. If you want your project being discussed in the class, please submit this assignment before April 12.

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

In this part, you should introduce the dataset(s) and your research questions.

  1. Dataset(s) Introduction:

    • identify the source of the dataset(s): who or which organization collected the dataset(s); some dataset(s) also tells you how and when it was collected ;

    • a description of the “cases” represented by the dataset(s); in other words, what does each row represent?

    • Erico’s hint: the website of the dataset(s) usually has a brief introduction of the above information; you can also look for the “user manual” document that comes with the dataset(s).

    • For reference, you can check outthe “Introduction” section of this final project as an example of dataset(s) introduction.

  2. What questions do you like to answer with this dataset(s)?

Part 2. Describe the data set(s)

This part contains both a coding and a storytelling component.

In the coding component, you should:

  1. read the dataset;

    • (optional) If you have multiple dataset(s) you want to work with, you should combine these datasets at this step.

    • (optional) If your dataset is too big (for example, it contains too many variables/columns that may not be useful for your analysis), you may want to subset the data just to include the necessary variables/columns.

  1. present the descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;

    • for examples: dim(), length(unique()), head();
  2. conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.

In the storytelling component, you should describe the basic information of the dataset(s) and the variables in a way that corresponds to your descriptive and summary statistics in the above coding component. DO NOT simply report the number of rows. Instead, describe the dataset(s) fully by specifying what each row and column mean. In other words, your description should be comprehensive and detailed enough for readers to picture or envision the dataset(s) in their brains.

  • For example, suppose I use a dataset of all the athletes who participated in the Olympic Games. Here is how I describe the basic information of the data: “the case of this dataset is ab individual athlete, represented by each row in the dataset. The dataset includes individual (e.g., gender, age, height, weight, race) and event performance (e.g., final placement) information for all athletes (22,398) competing in all events (e.g., Male 400m Free, Female …) in all Olympics Games since 1922 (24 Winter and 28 Summer Games. Athletes appearing in the dataset competed in anywhere from 1-11 distinct events (of 198 possible) during 1-5 distinct Olympic competitions, for a total of XXX, XXX athlete-event-Olympic-year observations. XXX Countries are represented, etc).”

  • Erico’s hint: as I mentioned above, sometimes a dataset is too large, and it is difficult to present and explain all the variables/columns (especially if you run summary statistics for the whole dataset). In this case, you will have to make a decision to select the most important variables/columns to discuss. For example, the Olympic dataset I mentioned above as an example contains more than 50 columns. For clarity of data presentation, I may just focus on 6 items/columns of individual athletes (gender, age, weight, height, race, nationality) and the column of final placement that are most relevant to answer my specific research questions. By doing so, you can just present the tables of the summary statistics of these 7 variables/columns without showing too much information and confusing the readers.

  • A good example is to can check out the Data Description section of the above student’s final project. As you can see, the student describes the dataset after he runs a few descriptive statistics. You can also see the weekly challenge solutions by Professor Rolfe for other examples of clear, concise data descriptions.

3. The Tentative Plan for Visualization

  1. Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.

  2. Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?

  3. If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.

    • What do you need to do to mutate the datasets (convert date data, create a new variable, pivot the data format, etc.)?

    • How are you going to deal with the missing data/NAs and outliers? And why do you choose this way to deal with NAs?

  4. (Optional) It is encouraged, but optional, to include a coding component of tidy data in this part.

Special Note on the role of statistics

Statistical analyses are not required, but any students who wish to include these may do so. However, your primary analysis should center around visualization rather than inferential statistics. Many scientists only compute statistics after a careful process of exploratory data analysis and data visualization. Statistics are a way to gauge your certainty in your results - NOT A WAY TO DISCOVER MEANINGFUL DATA PATTERNS. Do not run a multiple regression with numerous predictors and report which predictors are significant!!

Remember: The goal is to tell a story about the data. For example, you might identify sports where winning athletes are younger or older than average and then try to see if you can find some sort of pattern that accounts for this difference. Or perhaps you might compare a country’s performance to GDP and see if this changes over time. The goal is not to get a significant statistical result but to identify an interesting pattern in the data and then extract some sort of meaningful recommendation or information from it.

Evaluation

You will be evaluated on both the quality of your source code and your written report, with a greater emphasis on the clarity and details of the description of your dataset(s) and your research questions.