library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project Assignment#1: Fnu Avinesh Krishnan
Important Formatting & Submission Notes:
Use this file as the template to work on: start your own writing from Section “Part.1”
Please make the following changes to the above YAML header:
Change the “title” to “Final Project Assignment#1: First Name Last Name”;
Change the “author” to your name;
Change the “date” to the current date in the “MM-DD-YYYY” format;
Submission:
- Delete the unnecessary sections (“Overview”, “Tasks”, “Special Note”, and “Evaluation”).
- In the posts folder of your local 601_Spring_2023 project, create a folder named “FirstNameLastName_FinalProjectData”, and save your final project dataset(s) in this folder. DO NOT save the dataset(s) to the _data folder which stores the dataset(s) for challenges.
- Render and submit the file to the blog post like a regular challenge.
Part 1. Introduction
In this part, you should introduce the dataset(s) and your research questions.
- Dataset(s) Introduction:
The exponential growth of the global population has resulted in a strain on essential resources such as healthcare, food, and shelter, which in turn has led to an increase in the incidence of genetic disorders. Genetic disorders are health conditions that typically result from DNA mutations or changes in the overall structure or number of chromosomes. Hereditary illnesses are becoming more prevalent due to insufficient awareness regarding the necessity of genetic testing. Tragically, such illnesses frequently result in the premature death of children, underscoring the vital significance of genetic testing during pregnancy.
The dataset contains the following features:
- Patient Id: Represents the unique identification number of a patient
- Patient Age: Represents the age of a patient
- Genes in mother's side: Represents a gene defect in the patient's mother
- Inherited from father: Represents a gene defect in the patient's father
- Maternal gene: Represents a gene defect in the patient's maternal side of the family
- Paternal gene: Represents a gene defect in the patient's paternal side of the family
- Blood cell count (mcL): Represents the blood cell count of a patient
- Patient First Name: Represents a patient's first name
- Family Name: Represents a patient's family name or surname
- Father's name: Represents a patient's father's name
- Mother's age: Represents a patient's mother's name
- Father's age: Represents a patient's father's age
- Institute Name: Represents the medical institute where a patient was born
- Location of Institute: Represents the location of the medical institute
- Status: Represents whether a patient is deceased
- Respiratory Rate (breaths/min): Represents a patient's respiratory breathing rate
- Heart Rate (rates/min): Represents a patient's heart rate
- Test 1 to Test 5: Represents different (masked) tests that were conducted on a patient
- Parental consent: Represents whether a patient's parents approved the treatment plan
- Follow-up: Represents a patient's level of risk (how intense their condition is)
- Gender: Represents a patient's gender
- Birth asphyxia: Represents whether a patient suffered from birth asphyxia
- Autopsy shows birth defect (if applicable): Represents whether a patient's autopsy showed any birth defects
- Place of birth: Represents whether a patient was born in a medical institute or home
- Folic acid details (peri-conceptional): Represents the periconceptional folic acid supplementation details of a patient
- H/O serious maternal illness: Represents an unexpected outcome of labor and delivery that resulted in significant short or long-term consequences to a patient's mother
- H/O radiation exposure (x-ray): Represents whether a patient has any radiation exposure history
- H/O substance abuse: Represents whether a parent has a history of drug addiction
- Assisted conception IVF/ART: Represents the type of treatment used for infertility
- History of anomalies in previous pregnancies: Represents whether the mother had any anomalies in her previous pregnancies
- No. of previous abortion: Represents the number of abortions that a mother had
- Birth defects: Represents whether a patient has birth defects
- White Blood cell count (thousand per microliter): Represents a patient's white blood cell count
- Blood test result: Represents a patient's blood test results
- Symptom 1 to Symptom 5: Represents (masked) different types of symptoms that a patient had
- Genetic Disorder: Represents the genetic disorder that a patient has
- Disorder Subclass: Represents the subclass of the disorder
The genome and genetics dataset is collected by the National Center for Biotechnology Information (NCBI). The dataset includes patients with various genetic disorders, and the genetic disorders can be broadly categorized into three groups: mitochondrial genetic inheritance disorders, single-gene inheritance diseases, and multifactorial genetic inheritance disorders. These genetic disorders are just a few examples of the conditions that are included in the dataset. Each row represents classified medicial information like Age, Maternal and Paternal gene, Blood cell count, Respiratory and Heart rate, Test results, Previous abortions, Presence of symptoms, Birth defects and the Genetic disorder of a particular patient. My study aims to analyse the relationship between different genetic factors, extracting insights based on correlation between medical precursors and test results, and predict the specific genetic disorder and its subclass for each patient based on their medical information.
- What questions do you like to answer with this dataset(s)?
Part 2. Describe the data set(s)
This part contains both a coding and a storytelling component.
In the coding component, you should:
- read the dataset;
library(readr)
<-read_csv("/Users/avineshkrishnan/Desktop/601_Spring_2023/posts/FnuAvineshKrishnan_FinalProjectData/train_genetic_disorders.csv")
dataview(data)
present the descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;
- for examples: dim(), length(unique()), head();
dim(data)
[1] 22083 45
head(data)
colnames(data)
[1] "Patient Id" [2] "Patient Age" [3] "Genes in mother's side" [4] "Inherited from father" [5] "Maternal gene" [6] "Paternal gene" [7] "Blood cell count (mcL)" [8] "Patient First Name" [9] "Family Name" [10] "Father's name" [11] "Mother's age" [12] "Father's age" [13] "Institute Name" [14] "Location of Institute" [15] "Status" [16] "Respiratory Rate (breaths/min)" [17] "Heart Rate (rates/min" [18] "Test 1" [19] "Test 2" [20] "Test 3" [21] "Test 4" [22] "Test 5" [23] "Parental consent" [24] "Follow-up" [25] "Gender" [26] "Birth asphyxia" [27] "Autopsy shows birth defect (if applicable)" [28] "Place of birth" [29] "Folic acid details (peri-conceptional)" [30] "H/O serious maternal illness" [31] "H/O radiation exposure (x-ray)" [32] "H/O substance abuse" [33] "Assisted conception IVF/ART" [34] "History of anomalies in previous pregnancies" [35] "No. of previous abortion" [36] "Birth defects" [37] "White Blood cell count (thousand per microliter)" [38] "Blood test result" [39] "Symptom 1" [40] "Symptom 2" [41] "Symptom 3" [42] "Symptom 4" [43] "Symptom 5" [44] "Genetic Disorder" [45] "Disorder Subclass"
length(unique(data$Status))
[1] 3
unique(data$Status)
[1] "Alive" "Deceased" NA
length(unique(data$`Genetic Disorder`))
[1] 4
unique(data$`Genetic Disorder`)
[1] "Mitochondrial genetic inheritance disorders" [2] NA [3] "Multifactorial genetic inheritance disorders" [4] "Single-gene inheritance diseases"
conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.
summary(data)
Patient Id Patient Age Genes in mother's side
Length:22083 Min. : 0.000 Length:22083
Class :character 1st Qu.: 3.000 Class :character
Mode :character Median : 7.000 Mode :character
Mean : 6.975
3rd Qu.:11.000
Max. :14.000
NA's :2440
Inherited from father Maternal gene Paternal gene
Length:22083 Length:22083 Length:22083
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Blood cell count (mcL) Patient First Name Family Name
Min. :4.093 Length:22083 Length:22083
1st Qu.:4.763 Class :character Class :character
Median :4.899 Mode :character Mode :character
Mean :4.899
3rd Qu.:5.034
Max. :5.610
NA's :1072
Father's name Mother's age Father's age Institute Name
Length:22083 Min. :18.00 Min. :20.00 Length:22083
Class :character 1st Qu.:26.00 1st Qu.:31.00 Class :character
Mode :character Median :35.00 Median :42.00 Mode :character
Mean :34.52 Mean :41.94
3rd Qu.:43.00 3rd Qu.:53.00
Max. :51.00 Max. :64.00
NA's :6790 NA's :6761
Location of Institute Status Respiratory Rate (breaths/min)
Length:22083 Length:22083 Length:22083
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Heart Rate (rates/min Test 1 Test 2 Test 3
Length:22083 Min. :0 Min. :0 Min. :0
Class :character 1st Qu.:0 1st Qu.:0 1st Qu.:0
Mode :character Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0
NA's :3091 NA's :3125 NA's :3113
Test 4 Test 5 Parental consent Follow-up
Min. :1 Min. :0 Length:22083 Length:22083
1st Qu.:1 1st Qu.:0 Class :character Class :character
Median :1 Median :0 Mode :character Mode :character
Mean :1 Mean :0
3rd Qu.:1 3rd Qu.:0
Max. :1 Max. :0
NA's :3121 NA's :3144
Gender Birth asphyxia
Length:22083 Length:22083
Class :character Class :character
Mode :character Mode :character
Autopsy shows birth defect (if applicable) Place of birth
Length:22083 Length:22083
Class :character Class :character
Mode :character Mode :character
Folic acid details (peri-conceptional) H/O serious maternal illness
Length:22083 Length:22083
Class :character Class :character
Mode :character Mode :character
H/O radiation exposure (x-ray) H/O substance abuse Assisted conception IVF/ART
Length:22083 Length:22083 Length:22083
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
History of anomalies in previous pregnancies No. of previous abortion
Length:22083 Min. :0
Class :character 1st Qu.:1
Mode :character Median :2
Mean :2
3rd Qu.:3
Max. :4
NA's :3126
Birth defects White Blood cell count (thousand per microliter)
Length:22083 Min. : 3.000
Class :character 1st Qu.: 5.419
Mode :character Median : 7.473
Mean : 7.485
3rd Qu.: 9.529
Max. :12.000
NA's :3118
Blood test result Symptom 1 Symptom 2 Symptom 3
Length:22083 Min. :0.000 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :1.000 Median :1.000 Median :1.0000
Mean :0.592 Mean :0.553 Mean :0.5374
3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.0000
Max. :1.000 Max. :1.000 Max. :1.0000
NA's :3128 NA's :3184 NA's :3075
Symptom 4 Symptom 5 Genetic Disorder Disorder Subclass
Min. :0.0000 Min. :0.0000 Length:22083 Length:22083
1st Qu.:0.0000 1st Qu.:0.0000 Class :character Class :character
Median :0.0000 Median :0.0000 Mode :character Mode :character
Mean :0.4974 Mean :0.4608
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
NA's :3096 NA's :3127
- The dataset includes medical information like blood cell count, test results, respiratory and heart rate, maternal and paternal genes, and previous abnormalities or birth defects of 22083 patients which includes both alive and deceased. Our patients consist of children up to the age of 14 who suffer from Mitochondrial genetic inheritance disorders, Multifactorial genetic inheritance disorders, and Single-gene inheritance diseases.
3. The Tentative Plan for Visualization
- Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.
I plan on analyzing the correlation and dependencies between each medical factor of a patient and how it affects the type of genetic disorder they suffer from.
On the visualization aspect, I want to plot line charts of density distributions of patient’s age and blood cell count to see their spread, histograms of the density distribution of previous abortions, scatter plots of blood cell count against age, bar graphs of counts of heart and respiratory rate, radiation exposure and anomalies in previous pregnancies. I also wanted to explore box plots of folic acid details and counts of autopsies showing birth defects and birth asphyxia.
The most important insight to be derived is to have box plots or histograms for the counts of disorder subclass and make scatter plots to see the relationship between genetic disorder and each of the features.
- Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?
Histogram: Histograms can be used to visualize the distributions of numerical variables. For this dataset, we can use it to visualize the density distribution of previous abortions, folic acid details, counts of heart and respiratory rate, radiation exposure, anomalies in previous pregnancies, autopsies showing birth defects, birth asphyxia and disorder subclass.
Scatter Plots: Scatter plots are useful in understanding and visualizing the relationship between two numerical values. We can use these plots to analyze density distributions of patient’s age and blood cell count, and figure out the relationship between a genetic disorder and each of the medical factor.
- If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.
- I need to remove patient records without a patient id, age, maternal or paternal gene as these are key factors in determining the type of genetic disorder. I also need to create two new columns combining the test results and symptoms. Missing values in Birth defects, folic acid details and autopy columns can be replaced with the avergae value as these do. not have a very strong effect on the genetic disorder subclass.
- (Optional) It is encouraged, but optional, to include a coding component of tidy data in this part.
<-data %>%
count_nais.na() %>%
sum()
count_na
[1] 138434
<-data %>%
count_na_ageselect(`Patient Age`) %>%
is.na() %>%
sum()
count_na_age
[1] 2440
%>%
data select(`Patient Id`) %>%
n_distinct()
[1] 21012