Final Project check in 1

Final_Project_Checkin_1
Final Project Check in 1
Author

Thrishul Pola

Published

March 22, 2023

PREDICTING STUDENT PERFORMANCE IN EXAMS USING REGRESSION ALGORITHMS

Dataset

The dataset being used is sourced from Kaggle: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?resource=download

Project Description:

The objective of this project is to predict the math score of students based on the features available in the dataset using various regression algorithms. The dataset consists of data related to students of a particular grade and their scores in Maths, Reading and Writing specified out of 100. The dataset also contains additional features such as gender, race/ethnicity, parental level of education, lunch type, and test preparation course.

The first step in this project is to explore the dataset and perform data preprocessing tasks such as handling missing values, encoding categorical variables, and scaling the data. After preprocessing the data, the next step is to perform exploratory data analysis to gain insights into the relationship between the features and the target variable.

In the next step, various regression algorithms such as Linear Regression, Polynomial Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression will be trained on the preprocessed dataset. The performance of each algorithm will be evaluated using metrics such as Mean Squared Error, Root Mean Squared Error, and R-squared.

After evaluating the performance of each algorithm, the best performing algorithm will be selected and used to make predictions on the test dataset. The predictions will be evaluated using the same metrics used to evaluate the performance of the algorithms.

Finally, the project will conclude with a summary of the findings and recommendations for improving student performance in Math. This project will be beneficial for educators and policymakers to identify the factors that influence student performance in Math and take appropriate actions to improve it.

VARIABLE DESCRIPTIONS:

  1. gender: specifies gender of the student(male/female)

  2. race: specifies race of the student(group A,group B,group C)

  3. parental level of education: specifies highest educational qualification of any parent of each student

  4. lunch_type: standard/reduced,the type of lunch package selected for the student

  5. test_prep: specifies if the test preparation course was completed by the student or not

  6. math_score: specifies score in math(our target variable)

  7. reading_score: specifies score in reading

  8. writing_score: specifies score in writing

All scores are taken out of 100.

##Hypothesis

There is a significant correlation between the features available in the dataset, such as parental level of education, test preparation course, and lunch type, and the math score of students. By using various regression algorithms, it is possible to predict the math score of students with reasonable accuracy based on the available features. Furthermore, the use of regression algorithms can identify the most influential factors that contribute to student performance in Math, allowing educators and policymakers to take appropriate actions to improve student performance

Importing Libraries

Code
set.seed(12345)
library(caret)
Warning: package 'caret' was built under R version 4.2.3
Loading required package: ggplot2
Loading required package: lattice
Code
library(Metrics)
Warning: package 'Metrics' was built under R version 4.2.3

Attaching package: 'Metrics'
The following objects are masked from 'package:caret':

    precision, recall
Code
#The caret package provides a wide range of functions for training and evaluating machine learning models, while the Metrics package provides various metrics for evaluating model performance, including the R-squared score 

library(glmnet)
Warning: package 'glmnet' was built under R version 4.2.3
Loading required package: Matrix
Loaded glmnet 4.1-7
Code
#The glmnet package provides functions for fitting regularized regression models, including Ridge regression (glmnet function with alpha = 0) and Lasso regression (glmnet function with alpha = 1)

#To perform cross-validation, the cv.glmnet function which performs k-fold cross-validation with a specified number of folds (nfolds)

#Similarly, performing Lasso regression by setting alpha = 1 in the glmnet function
Code
library(readr)
StudentsPerformance <- read_csv("_data/StudentsPerformance.csv", show_col_types = FALSE)
str(StudentsPerformance)
spc_tbl_ [1,000 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ gender                     : chr [1:1000] "female" "female" "female" "male" ...
 $ race/ethnicity             : chr [1:1000] "group B" "group C" "group B" "group A" ...
 $ parental level of education: chr [1:1000] "bachelor's degree" "some college" "master's degree" "associate's degree" ...
 $ lunch                      : chr [1:1000] "standard" "standard" "standard" "free/reduced" ...
 $ test preparation course    : chr [1:1000] "none" "completed" "none" "none" ...
 $ math score                 : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
 $ reading score              : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
 $ writing score              : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
 - attr(*, "spec")=
  .. cols(
  ..   gender = col_character(),
  ..   `race/ethnicity` = col_character(),
  ..   `parental level of education` = col_character(),
  ..   lunch = col_character(),
  ..   `test preparation course` = col_character(),
  ..   `math score` = col_double(),
  ..   `reading score` = col_double(),
  ..   `writing score` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
summary(StudentsPerformance)
    gender          race/ethnicity     parental level of education
 Length:1000        Length:1000        Length:1000                
 Class :character   Class :character   Class :character           
 Mode  :character   Mode  :character   Mode  :character           
                                                                  
                                                                  
                                                                  
    lunch           test preparation course   math score     reading score   
 Length:1000        Length:1000             Min.   :  0.00   Min.   : 17.00  
 Class :character   Class :character        1st Qu.: 57.00   1st Qu.: 59.00  
 Mode  :character   Mode  :character        Median : 66.00   Median : 70.00  
                                            Mean   : 66.09   Mean   : 69.17  
                                            3rd Qu.: 77.00   3rd Qu.: 79.00  
                                            Max.   :100.00   Max.   :100.00  
 writing score   
 Min.   : 10.00  
 1st Qu.: 57.75  
 Median : 69.00  
 Mean   : 68.05  
 3rd Qu.: 79.00  
 Max.   :100.00  

Proposed Models

various regression algorithms such as Linear Regression, Polynomial Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression will be trained on the preprocessed dataset. The performance of each algorithm will be evaluated using metrics such as Mean Squared Error, Root Mean Squared Error, and R-squared.