PREDICTING STUDENT PERFORMANCE IN EXAMS USING REGRESSION ALGORITHMS
Dataset
The dataset being used is sourced from Kaggle: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?resource=download
Project Description:
The objective of this project is to predict the math score of students based on the features available in the dataset using various regression algorithms. The dataset consists of data related to students of a particular grade and their scores in Maths, Reading and Writing specified out of 100. The dataset also contains additional features such as gender, race/ethnicity, parental level of education, lunch type, and test preparation course.
The first step in this project is to explore the dataset and perform data preprocessing tasks such as handling missing values, encoding categorical variables, and scaling the data. After preprocessing the data, the next step is to perform exploratory data analysis to gain insights into the relationship between the features and the target variable.
In the next step, various regression algorithms such as Linear Regression, Polynomial Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression will be trained on the preprocessed dataset. The performance of each algorithm will be evaluated using metrics such as Mean Squared Error, Root Mean Squared Error, and R-squared.
After evaluating the performance of each algorithm, the best performing algorithm will be selected and used to make predictions on the test dataset. The predictions will be evaluated using the same metrics used to evaluate the performance of the algorithms.
Finally, the project will conclude with a summary of the findings and recommendations for improving student performance in Math. This project will be beneficial for educators and policymakers to identify the factors that influence student performance in Math and take appropriate actions to improve it.
VARIABLE DESCRIPTIONS:
gender: specifies gender of the student(male/female)
race: specifies race of the student(group A,group B,group C)
parental level of education: specifies highest educational qualification of any parent of each student
lunch_type: standard/reduced,the type of lunch package selected for the student
test_prep: specifies if the test preparation course was completed by the student or not
math_score: specifies score in math(our target variable)
reading_score: specifies score in reading
writing_score: specifies score in writing
All scores are taken out of 100.
##Hypothesis
There is a significant correlation between the features available in the dataset, such as parental level of education, test preparation course, and lunch type, and the math score of students. By using various regression algorithms, it is possible to predict the math score of students with reasonable accuracy based on the available features. Furthermore, the use of regression algorithms can identify the most influential factors that contribute to student performance in Math, allowing educators and policymakers to take appropriate actions to improve student performance
Importing Libraries
Code
set.seed(12345)library(caret)
Warning: package 'caret' was built under R version 4.2.3
Loading required package: ggplot2
Loading required package: lattice
Code
library(Metrics)
Warning: package 'Metrics' was built under R version 4.2.3
Attaching package: 'Metrics'
The following objects are masked from 'package:caret':
precision, recall
Code
#The caret package provides a wide range of functions for training and evaluating machine learning models, while the Metrics package provides various metrics for evaluating model performance, including the R-squared score library(glmnet)
Warning: package 'glmnet' was built under R version 4.2.3
Loading required package: Matrix
Loaded glmnet 4.1-7
Code
#The glmnet package provides functions for fitting regularized regression models, including Ridge regression (glmnet function with alpha = 0) and Lasso regression (glmnet function with alpha = 1)#To perform cross-validation, the cv.glmnet function which performs k-fold cross-validation with a specified number of folds (nfolds)#Similarly, performing Lasso regression by setting alpha = 1 in the glmnet function
gender race/ethnicity parental level of education
Length:1000 Length:1000 Length:1000
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
lunch test preparation course math score reading score
Length:1000 Length:1000 Min. : 0.00 Min. : 17.00
Class :character Class :character 1st Qu.: 57.00 1st Qu.: 59.00
Mode :character Mode :character Median : 66.00 Median : 70.00
Mean : 66.09 Mean : 69.17
3rd Qu.: 77.00 3rd Qu.: 79.00
Max. :100.00 Max. :100.00
writing score
Min. : 10.00
1st Qu.: 57.75
Median : 69.00
Mean : 68.05
3rd Qu.: 79.00
Max. :100.00
Proposed Models
various regression algorithms such as Linear Regression, Polynomial Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression will be trained on the preprocessed dataset. The performance of each algorithm will be evaluated using metrics such as Mean Squared Error, Root Mean Squared Error, and R-squared.
Source Code
---title: "Final Project check in 1"author: "Thrishul Pola"description: "Final Project Check in 1"date: "3/22/2023"format: html: df-print: paged css: styles.css toc: true code-fold: true code-copy: true code-tools: truecategories: - Final_Project_Checkin_1---## PREDICTING STUDENT PERFORMANCE IN EXAMS USING REGRESSION ALGORITHMS# DatasetThe dataset being used is sourced from Kaggle: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?resource=downloadProject Description:The objective of this project is to predict the math score of students based on the features available in the dataset using various regression algorithms. The dataset consists of data related to students of a particular grade and their scores in Maths, Reading and Writing specified out of 100. The dataset also contains additional features such as gender, race/ethnicity, parental level of education, lunch type, and test preparation course.The first step in this project is to explore the dataset and perform data preprocessing tasks such as handling missing values, encoding categorical variables, and scaling the data. After preprocessing the data, the next step is to perform exploratory data analysis to gain insights into the relationship between the features and the target variable.In the next step, various regression algorithms such as Linear Regression, Polynomial Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression will be trained on the preprocessed dataset. The performance of each algorithm will be evaluated using metrics such as Mean Squared Error, Root Mean Squared Error, and R-squared.After evaluating the performance of each algorithm, the best performing algorithm will be selected and used to make predictions on the test dataset. The predictions will be evaluated using the same metrics used to evaluate the performance of the algorithms.Finally, the project will conclude with a summary of the findings and recommendations for improving student performance in Math. This project will be beneficial for educators and policymakers to identify the factors that influence student performance in Math and take appropriate actions to improve it.## VARIABLE DESCRIPTIONS:1) gender: specifies gender of the student(male/female)2) race: specifies race of the student(group A,group B,group C)3) parental level of education: specifies highest educational qualification of any parent of each student4) lunch_type: standard/reduced,the type of lunch package selected for the student5) test_prep: specifies if the test preparation course was completed by the student or not6) math_score: specifies score in math(our target variable)7) reading_score: specifies score in reading8) writing_score: specifies score in writingAll scores are taken out of 100.##Hypothesis There is a significant correlation between the features available in the dataset, such as parental level of education, test preparation course, and lunch type, and the math score of students. By using various regression algorithms, it is possible to predict the math score of students with reasonable accuracy based on the available features. Furthermore, the use of regression algorithms can identify the most influential factors that contribute to student performance in Math, allowing educators and policymakers to take appropriate actions to improve student performance## Importing Libraries```{r}set.seed(12345)library(caret)library(Metrics)#The caret package provides a wide range of functions for training and evaluating machine learning models, while the Metrics package provides various metrics for evaluating model performance, including the R-squared score library(glmnet)#The glmnet package provides functions for fitting regularized regression models, including Ridge regression (glmnet function with alpha = 0) and Lasso regression (glmnet function with alpha = 1)#To perform cross-validation, the cv.glmnet function which performs k-fold cross-validation with a specified number of folds (nfolds)#Similarly, performing Lasso regression by setting alpha = 1 in the glmnet function``````{r}library(readr)StudentsPerformance <-read_csv("_data/StudentsPerformance.csv", show_col_types =FALSE)str(StudentsPerformance)``````{r}summary(StudentsPerformance)```## Proposed Modelsvarious regression algorithms such as Linear Regression, Polynomial Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression will be trained on the preprocessed dataset. The performance of each algorithm will be evaluated using metrics such as Mean Squared Error, Root Mean Squared Error, and R-squared.