Final Project Assignment#1: Fnu Avinesh Krishnan

final_Project_assignment_1
final_project_data_description
Project & Data Description
Author

Fnu Avinesh Krishnan

Published

April 11, 2023

Important Formatting & Submission Notes:

  1. Use this file as the template to work on: start your own writing from Section “Part.1”

  2. Please make the following changes to the above YAML header:

    • Change the “title” to “Final Project Assignment#1: First Name Last Name”;

    • Change the “author” to your name;

    • Change the “date” to the current date in the “MM-DD-YYYY” format;

  3. Submission:

    • Delete the unnecessary sections (“Overview”, “Tasks”, “Special Note”, and “Evaluation”).
    • In the posts folder of your local 601_Spring_2023 project, create a folder named “FirstNameLastName_FinalProjectData”, and save your final project dataset(s) in this folder. DO NOT save the dataset(s) to the _data folder which stores the dataset(s) for challenges.
    • Render and submit the file to the blog post like a regular challenge.
library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

In this part, you should introduce the dataset(s) and your research questions.

  1. Dataset(s) Introduction:

The exponential growth of the global population has resulted in a strain on essential resources such as healthcare, food, and shelter, which in turn has led to an increase in the incidence of genetic disorders. Genetic disorders are health conditions that typically result from DNA mutations or changes in the overall structure or number of chromosomes. Hereditary illnesses are becoming more prevalent due to insufficient awareness regarding the necessity of genetic testing. Tragically, such illnesses frequently result in the premature death of children, underscoring the vital significance of genetic testing during pregnancy.

The dataset contains the following features:

-     Patient Id: Represents the unique identification number of a patient
-     Patient Age: Represents the age of a patient
-     Genes in mother's side: Represents a gene defect in the patient's mother
-     Inherited from father: Represents a gene defect in the patient's father
-     Maternal gene: Represents a gene defect in the patient's maternal side of the family
-     Paternal gene: Represents a gene defect in the patient's paternal side of the family
-     Blood cell count (mcL): Represents the blood cell count of a patient
-     Patient First Name: Represents a patient's first name
-     Family Name: Represents a patient's family name or surname
-     Father's name: Represents a patient's father's name
-    Mother's age: Represents a patient's mother's name                                    
-     Father's age: Represents a patient's father's age                                   
-     Institute Name: Represents the medical institute where a patient was born                   
-     Location of Institute: Represents the location of the medical institute                      
-     Status: Represents whether a patient is deceased                                         
-     Respiratory Rate (breaths/min): Represents a patient's respiratory breathing rate           
-    Heart Rate (rates/min): Represents a patient's heart rate                          
-     Test 1 to Test 5: Represents different (masked) tests that were conducted on a patient
-     Parental consent: Represents whether a patient's parents approved the treatment plan        
-    Follow-up: Represents a patient's level of risk (how intense their condition is)  
-     Gender: Represents a patient's gender                                         
-     Birth asphyxia: Represents whether a patient suffered from birth asphyxia                    
-     Autopsy shows birth defect (if applicable): Represents whether a patient's autopsy showed any           birth defects     
-     Place of birth: Represents whether a patient was born in a medical institute or home        
-     Folic acid details (peri-conceptional): Represents the periconceptional folic acid                     supplementation details of a patient          
-    H/O serious maternal illness: Represents an unexpected outcome of labor and delivery that              resulted in significant short or long-term consequences to a patient's mother                 
-    H/O radiation exposure (x-ray): Represents whether a patient has any radiation exposure                history     
-    H/O substance abuse: Represents whether a parent has a history of drug addiction             
-    Assisted conception IVF/ART: Represents the type of treatment used for infertility            
-    History of anomalies in previous pregnancies: Represents whether the mother had any anomalies          in her previous pregnancies   
-    No. of previous abortion: Represents the number of abortions that a mother had                
-    Birth defects: Represents whether a patient has birth defects                                
-    White Blood cell count (thousand per microliter): Represents a patient's white blood cell              count
-    Blood test result: Represents a patient's blood test results                              
-    Symptom 1 to Symptom 5: Represents (masked) different types of symptoms that a patient had    
-    Genetic Disorder: Represents the genetic disorder that a patient has                          
-    Disorder Subclass: Represents the subclass of the disorder

The genome and genetics dataset is collected by the National Center for Biotechnology Information (NCBI). The dataset includes patients with various genetic disorders, and the genetic disorders can be broadly categorized into three groups: mitochondrial genetic inheritance disorders, single-gene inheritance diseases, and multifactorial genetic inheritance disorders. These genetic disorders are just a few examples of the conditions that are included in the dataset. Each row represents classified medicial information like Age, Maternal and Paternal gene, Blood cell count, Respiratory and Heart rate, Test results, Previous abortions, Presence of symptoms, Birth defects and the Genetic disorder of a particular patient. My study aims to analyse the relationship between different genetic factors, extracting insights based on correlation between medical precursors and test results, and predict the specific genetic disorder and its subclass for each patient based on their medical information.

  1. What questions do you like to answer with this dataset(s)?

Part 2. Describe the data set(s)

This part contains both a coding and a storytelling component.

In the coding component, you should:

  1. read the dataset;
library(readr)
data<-read_csv("/Users/avineshkrishnan/Desktop/601_Spring_2023/posts/FnuAvineshKrishnan_FinalProjectData/train_genetic_disorders.csv")
view(data)
  1. present the descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;

    • for examples: dim(), length(unique()), head();
    dim(data)
    [1] 22083    45
    head(data)
    colnames(data)
     [1] "Patient Id"                                      
     [2] "Patient Age"                                     
     [3] "Genes in mother's side"                          
     [4] "Inherited from father"                           
     [5] "Maternal gene"                                   
     [6] "Paternal gene"                                   
     [7] "Blood cell count (mcL)"                          
     [8] "Patient First Name"                              
     [9] "Family Name"                                     
    [10] "Father's name"                                   
    [11] "Mother's age"                                    
    [12] "Father's age"                                    
    [13] "Institute Name"                                  
    [14] "Location of Institute"                           
    [15] "Status"                                          
    [16] "Respiratory Rate (breaths/min)"                  
    [17] "Heart Rate (rates/min"                           
    [18] "Test 1"                                          
    [19] "Test 2"                                          
    [20] "Test 3"                                          
    [21] "Test 4"                                          
    [22] "Test 5"                                          
    [23] "Parental consent"                                
    [24] "Follow-up"                                       
    [25] "Gender"                                          
    [26] "Birth asphyxia"                                  
    [27] "Autopsy shows birth defect (if applicable)"      
    [28] "Place of birth"                                  
    [29] "Folic acid details (peri-conceptional)"          
    [30] "H/O serious maternal illness"                    
    [31] "H/O radiation exposure (x-ray)"                  
    [32] "H/O substance abuse"                             
    [33] "Assisted conception IVF/ART"                     
    [34] "History of anomalies in previous pregnancies"    
    [35] "No. of previous abortion"                        
    [36] "Birth defects"                                   
    [37] "White Blood cell count (thousand per microliter)"
    [38] "Blood test result"                               
    [39] "Symptom 1"                                       
    [40] "Symptom 2"                                       
    [41] "Symptom 3"                                       
    [42] "Symptom 4"                                       
    [43] "Symptom 5"                                       
    [44] "Genetic Disorder"                                
    [45] "Disorder Subclass"                               
    length(unique(data$Status))
    [1] 3
    unique(data$Status)
    [1] "Alive"    "Deceased" NA        
    length(unique(data$`Genetic Disorder`))
    [1] 4
    unique(data$`Genetic Disorder`)
    [1] "Mitochondrial genetic inheritance disorders" 
    [2] NA                                            
    [3] "Multifactorial genetic inheritance disorders"
    [4] "Single-gene inheritance diseases"            
  2. conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.

summary(data)
  Patient Id         Patient Age     Genes in mother's side
 Length:22083       Min.   : 0.000   Length:22083          
 Class :character   1st Qu.: 3.000   Class :character      
 Mode  :character   Median : 7.000   Mode  :character      
                    Mean   : 6.975                         
                    3rd Qu.:11.000                         
                    Max.   :14.000                         
                    NA's   :2440                           
 Inherited from father Maternal gene      Paternal gene     
 Length:22083          Length:22083       Length:22083      
 Class :character      Class :character   Class :character  
 Mode  :character      Mode  :character   Mode  :character  
                                                            
                                                            
                                                            
                                                            
 Blood cell count (mcL) Patient First Name Family Name       
 Min.   :4.093          Length:22083       Length:22083      
 1st Qu.:4.763          Class :character   Class :character  
 Median :4.899          Mode  :character   Mode  :character  
 Mean   :4.899                                               
 3rd Qu.:5.034                                               
 Max.   :5.610                                               
 NA's   :1072                                                
 Father's name       Mother's age    Father's age   Institute Name    
 Length:22083       Min.   :18.00   Min.   :20.00   Length:22083      
 Class :character   1st Qu.:26.00   1st Qu.:31.00   Class :character  
 Mode  :character   Median :35.00   Median :42.00   Mode  :character  
                    Mean   :34.52   Mean   :41.94                     
                    3rd Qu.:43.00   3rd Qu.:53.00                     
                    Max.   :51.00   Max.   :64.00                     
                    NA's   :6790    NA's   :6761                      
 Location of Institute    Status          Respiratory Rate (breaths/min)
 Length:22083          Length:22083       Length:22083                  
 Class :character      Class :character   Class :character              
 Mode  :character      Mode  :character   Mode  :character              
                                                                        
                                                                        
                                                                        
                                                                        
 Heart Rate (rates/min     Test 1         Test 2         Test 3    
 Length:22083          Min.   :0      Min.   :0      Min.   :0     
 Class :character      1st Qu.:0      1st Qu.:0      1st Qu.:0     
 Mode  :character      Median :0      Median :0      Median :0     
                       Mean   :0      Mean   :0      Mean   :0     
                       3rd Qu.:0      3rd Qu.:0      3rd Qu.:0     
                       Max.   :0      Max.   :0      Max.   :0     
                       NA's   :3091   NA's   :3125   NA's   :3113  
     Test 4         Test 5     Parental consent    Follow-up        
 Min.   :1      Min.   :0      Length:22083       Length:22083      
 1st Qu.:1      1st Qu.:0      Class :character   Class :character  
 Median :1      Median :0      Mode  :character   Mode  :character  
 Mean   :1      Mean   :0                                           
 3rd Qu.:1      3rd Qu.:0                                           
 Max.   :1      Max.   :0                                           
 NA's   :3121   NA's   :3144                                        
    Gender          Birth asphyxia    
 Length:22083       Length:22083      
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
                                      
 Autopsy shows birth defect (if applicable) Place of birth    
 Length:22083                               Length:22083      
 Class :character                           Class :character  
 Mode  :character                           Mode  :character  
                                                              
                                                              
                                                              
                                                              
 Folic acid details (peri-conceptional) H/O serious maternal illness
 Length:22083                           Length:22083                
 Class :character                       Class :character            
 Mode  :character                       Mode  :character            
                                                                    
                                                                    
                                                                    
                                                                    
 H/O radiation exposure (x-ray) H/O substance abuse Assisted conception IVF/ART
 Length:22083                   Length:22083        Length:22083               
 Class :character               Class :character    Class :character           
 Mode  :character               Mode  :character    Mode  :character           
                                                                               
                                                                               
                                                                               
                                                                               
 History of anomalies in previous pregnancies No. of previous abortion
 Length:22083                                 Min.   :0               
 Class :character                             1st Qu.:1               
 Mode  :character                             Median :2               
                                              Mean   :2               
                                              3rd Qu.:3               
                                              Max.   :4               
                                              NA's   :3126            
 Birth defects      White Blood cell count (thousand per microliter)
 Length:22083       Min.   : 3.000                                  
 Class :character   1st Qu.: 5.419                                  
 Mode  :character   Median : 7.473                                  
                    Mean   : 7.485                                  
                    3rd Qu.: 9.529                                  
                    Max.   :12.000                                  
                    NA's   :3118                                    
 Blood test result    Symptom 1       Symptom 2       Symptom 3     
 Length:22083       Min.   :0.000   Min.   :0.000   Min.   :0.0000  
 Class :character   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0000  
 Mode  :character   Median :1.000   Median :1.000   Median :1.0000  
                    Mean   :0.592   Mean   :0.553   Mean   :0.5374  
                    3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.0000  
                    Max.   :1.000   Max.   :1.000   Max.   :1.0000  
                    NA's   :3128    NA's   :3184    NA's   :3075    
   Symptom 4        Symptom 5      Genetic Disorder   Disorder Subclass 
 Min.   :0.0000   Min.   :0.0000   Length:22083       Length:22083      
 1st Qu.:0.0000   1st Qu.:0.0000   Class :character   Class :character  
 Median :0.0000   Median :0.0000   Mode  :character   Mode  :character  
 Mean   :0.4974   Mean   :0.4608                                        
 3rd Qu.:1.0000   3rd Qu.:1.0000                                        
 Max.   :1.0000   Max.   :1.0000                                        
 NA's   :3096     NA's   :3127                                          
  • The dataset includes medical information like blood cell count, test results, respiratory and heart rate, maternal and paternal genes, and previous abnormalities or birth defects of 22083 patients which includes both alive and deceased. Our patients consist of children up to the age of 14 who suffer from Mitochondrial genetic inheritance disorders, Multifactorial genetic inheritance disorders, and Single-gene inheritance diseases.

3. The Tentative Plan for Visualization

  1. Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.
  • I plan on analyzing the correlation and dependencies between each medical factor of a patient and how it affects the type of genetic disorder they suffer from.

  • On the visualization aspect, I want to plot line charts of density distributions of patient’s age and blood cell count to see their spread, histograms of the density distribution of previous abortions, scatter plots of blood cell count against age, bar graphs of counts of heart and respiratory rate, radiation exposure and anomalies in previous pregnancies. I also wanted to explore box plots of folic acid details and counts of autopsies showing birth defects and birth asphyxia.

  • The most important insight to be derived is to have box plots or histograms for the counts of disorder subclass and make scatter plots to see the relationship between genetic disorder and each of the features.

  1. Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?
  • Histogram: Histograms can be used to visualize the distributions of numerical variables. For this dataset, we can use it to visualize the density distribution of previous abortions, folic acid details, counts of heart and respiratory rate, radiation exposure, anomalies in previous pregnancies, autopsies showing birth defects, birth asphyxia and disorder subclass.

  • Scatter Plots: Scatter plots are useful in understanding and visualizing the relationship between two numerical values. We can use these plots to analyze density distributions of patient’s age and blood cell count, and figure out the relationship between a genetic disorder and each of the medical factor.

  1. If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.
  • I need to remove patient records without a patient id, age, maternal or paternal gene as these are key factors in determining the type of genetic disorder. I also need to create two new columns combining the test results and symptoms. Missing values in Birth defects, folic acid details and autopy columns can be replaced with the avergae value as these do. not have a very strong effect on the genetic disorder subclass.
  1. (Optional) It is encouraged, but optional, to include a coding component of tidy data in this part.
count_na<-data %>%
  is.na() %>% 
  sum()
count_na
[1] 138434
count_na_age<-data %>%
  select(`Patient Age`) %>% 
  is.na() %>% 
  sum()
count_na_age
[1] 2440
data %>%
  select(`Patient Id`) %>%
  n_distinct()
[1] 21012