FInal Project Qmd file

hw1
challenge1
my name
dataset
ggplot2
Author

Paritosh G

Published

May 25, 2023

Calling the Libraries.

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Code
library(stringr)
library(rmarkdown)
library(knitr)

Reading the Data

Code
df <- read.csv("posts/_data/Bullying_2018.csv", sep = ";")
Warning in file(file, "rt"): cannot open file 'posts/_data/Bullying_2018.csv':
No such file or directory
Error in file(file, "rt"): cannot open the connection

Replacing white Spaces with “NA”

Replacing the White Spaces with “NA”

Code
df[df == " "] <- NA
Error in df == " ": comparison (==) is possible only for atomic and list types

Returning Relevant Numbers

returning relevant numbers and getting rid of irrelevant strings such as “time”, “times”, “day”, “days”,“or more” etc. as will not be helpful in model creation

Using both str_sub and sub

Code
df$Custom_Age <- str_sub(df$Custom_Age,1,2)
Error in df$Custom_Age: object of type 'closure' is not subsettable
Code
df$Physically_attacked <- sub(" .*", "", df$Physically_attacked)
Error in df$Physically_attacked: object of type 'closure' is not subsettable
Code
df$Physical_fighting <- sub(" .*","",df$Physical_fighting)
Error in df$Physical_fighting: object of type 'closure' is not subsettable
Code
df$Close_friends <- str_sub(df$Close_friends,1,2)
Error in df$Close_friends: object of type 'closure' is not subsettable
Code
df$Miss_school_no_permission <- sub(" .*","",df$Miss_school_no_permission)
Error in df$Miss_school_no_permission: object of type 'closure' is not subsettable

Recoding frequency adverbs to number

Columns Such as “Felt_lonely”, “Other_students_kind_and_helpful”, “Parents_understand_problems” have 5 level of text responses which are replaced into numeric as follows:

and levels are relevant as the value of integer increases the intensity is increasing as well

1) “Never” <-1,

2) “Rarely” <- 2,

3) “Sometimes” <- 3,

4) “Most of the time” <- 4,

5) “Always” <- 5.

Code
df[df == "Never"] <- 1
Error in df == "Never": comparison (==) is possible only for atomic and list types
Code
df[df == "Rarely"] <- 2
Error in df == "Rarely": comparison (==) is possible only for atomic and list types
Code
df[df == "Sometimes"] <- 3
Error in df == "Sometimes": comparison (==) is possible only for atomic and list types
Code
df[df == "Most of the time"] <- 4
Error in df == "Most of the time": comparison (==) is possible only for atomic and list types
Code
df[df == "Always"] <- 5
Error in df == "Always": comparison (==) is possible only for atomic and list types

Replacing “YES” with 1 and “No” with 0 in the columns

“Bullied_on_school_property_in_past_12_months” , “Bullied_not_on_school_property_in_past_12_months”, “Cyber_bullied_in_past_12_months” , “Most_of_the_time_or_always_felt_lonely”, “Missed_classes_or_school_without_permission”.

“1” is for Bullied

“0” is for not Bullied

Code
df[df == "Yes"] <- 1
Error in df == "Yes": comparison (==) is possible only for atomic and list types
Code
df[df == "No"] <- 0
Error in df == "No": comparison (==) is possible only for atomic and list types

Assigning data-types into Factors depending upon the requirement of the model

Code
df$Bullied_on_school_property_in_past_12_months <- as.factor(df$Bullied_on_school_property_in_past_12_months)
Error in df$Bullied_on_school_property_in_past_12_months: object of type 'closure' is not subsettable
Code
df$Bullied_not_on_school_property_in_past_12_months <- as.factor(df$Bullied_not_on_school_property_in_past_12_months)
Error in df$Bullied_not_on_school_property_in_past_12_months: object of type 'closure' is not subsettable
Code
df$Cyber_bullied_in_past_12_months <- as.factor(df$Cyber_bullied_in_past_12_months)
Error in df$Cyber_bullied_in_past_12_months: object of type 'closure' is not subsettable
Code
df$Sex <- as.factor(df$Sex)
Error in df$Sex: object of type 'closure' is not subsettable
Code
df$Most_of_the_time_or_always_felt_lonely <- as.factor(df$Most_of_the_time_or_always_felt_lonely)
Error in df$Most_of_the_time_or_always_felt_lonely: object of type 'closure' is not subsettable
Code
df$Missed_classes_or_school_without_permission <- as.factor(df$Missed_classes_or_school_without_permission)
Error in df$Missed_classes_or_school_without_permission: object of type 'closure' is not subsettable

Assigning the data-types into integers depending upon the requirement of the model.

Code
df$Custom_Age <- as.integer(df$Custom_Age)
Error in df$Custom_Age: object of type 'closure' is not subsettable
Code
df$Physically_attacked <- as.integer(df$Physically_attacked)
Error in df$Physically_attacked: object of type 'closure' is not subsettable
Code
df$Physical_fighting <- as.integer(df$Physical_fighting)
Error in df$Physical_fighting: object of type 'closure' is not subsettable
Code
df$Close_friends <- as.integer(df$Close_friends)
Error in df$Close_friends: object of type 'closure' is not subsettable
Code
df$Miss_school_no_permission <- as.integer(df$Miss_school_no_permission)
Error in df$Miss_school_no_permission: object of type 'closure' is not subsettable
Code
df$Felt_lonely <- as.integer(df$Felt_lonely)
Error in df$Felt_lonely: object of type 'closure' is not subsettable
Code
df$Other_students_kind_and_helpful <- as.integer(df$Other_students_kind_and_helpful)
Error in df$Other_students_kind_and_helpful: object of type 'closure' is not subsettable
Code
df$Parents_understand_problems <- as.integer(df$Parents_understand_problems)
Error in df$Parents_understand_problems: object of type 'closure' is not subsettable

Deleting the columns which seems irrelevant to the model or they have high number of missing values

The columns we are deleting represent whether the student is overweight/underweight/obese due to following reasons

1) there are multiple responses where the entries in all 3 columns is “NA” so it is not possible to interpret it as yes/no.

2) there are multiple responses where the entries in all e columns is “NO” and we do not have a 4th option to interpret.

3) if we keep these 3 columns and delete all rows which contains atleast one “NA” we will have to delete about 20k rows out of 56981 but if we do the same task after deleting 3 columns only about 7k rows need to be deleted which contains at least one “NA” values.

Code
df <- df[, -c(16:18)] 
Error in df[, -c(16:18)]: object of type 'closure' is not subsettable

Deleting all “NA” from columns 2 to 15

Code
df <- df[complete.cases(df),]
Error in complete.cases(df): invalid 'type' (closure) of argument

Visualizations

Analysing Gender wise pattern for students who experienced bullying in past 12 months

Code
df %>% 
  select(Bullied_on_school_property_in_past_12_months,Sex) %>% 
  ggplot(aes(x = Bullied_on_school_property_in_past_12_months,fill = Sex)) +
  geom_bar() +
  scale_y_continuous(limits = c(0,42000), breaks = seq(0,42000,by=3000)) +
  geom_hline(yintercept = c(4500), linetype = "dashed", color = "Brown")
Error in UseMethod("select"): no applicable method for 'select' applied to an object of class "function"
  • there were about 23.26% female who experienced bullying 6219 out of 26732 of total and in males there were 4453 males out of 24022 which is about 18.53% of whom of total who expderienced bullying.

Table’s showing the numbers plotted in Visualisation above.

Code
df %>% 
  count(Sex) 
Error in UseMethod("count"): no applicable method for 'count' applied to an object of class "function"
Code
df %>% 
  select(Bullied_on_school_property_in_past_12_months,Sex) %>% 
  filter(Bullied_on_school_property_in_past_12_months == 1) %>% 
  count(Sex)
Error in UseMethod("select"): no applicable method for 'select' applied to an object of class "function"
Code
df %>% 
ggplot(aes(fill = Bullied_on_school_property_in_past_12_months, x = Custom_Age)) +
  geom_bar() +
  scale_x_continuous(breaks = seq(from = 10, to = 19, by = 1)) +
  scale_y_continuous(breaks = seq(from = 0, to = 12000, by = 500))
Error in `ggplot()`:
! `data` cannot be a function.
ℹ Have you misspelled the `data` argument in `ggplot()`

from age 12 to 15 over 20% student overall have experience bullying and at ge 16 about 19.45 % of students have experienced bullying. the bar for 12% is not visible as very small number of responses were present for age 12.

Table’s showing the values plotted in visualisation above.

Code
df %>% 
 count(Custom_Age)
Error in UseMethod("count"): no applicable method for 'count' applied to an object of class "function"
Code
df %>% 
  select(Bullied_on_school_property_in_past_12_months,Custom_Age) %>% 
  filter(Bullied_on_school_property_in_past_12_months == 1) %>% 
  count(Custom_Age)
Error in UseMethod("select"): no applicable method for 'select' applied to an object of class "function"

Model Building

  • Building a model using multiple variables even if it seems slightly logical
Code
logistic <- glm(Bullied_on_school_property_in_past_12_months ~ Custom_Age + Sex + Felt_lonely + Close_friends + Miss_school_no_permission + Other_students_kind_and_helpful + Parents_understand_problems + Most_of_the_time_or_always_felt_lonely, family = binomial, data = df)
Error in model.frame.default(formula = Bullied_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic)
Error in summary(logistic): object 'logistic' not found
  • Getting rid of multiple variables which are insignificant “Sex”, “Parents Understand Problems”, “Most of the times or always felt lonely”.
Code
logistic_U1 <- glm(Bullied_on_school_property_in_past_12_months ~ Custom_Age + Felt_lonely + Close_friends + Miss_school_no_permission + Other_students_kind_and_helpful, family = binomial, data = df)
Error in model.frame.default(formula = Bullied_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic_U1)
Error in summary(logistic_U1): object 'logistic_U1' not found
  • Aic score drops by 3 units after removing 3 varibles

  • removing variable Missed_school_no_Permission and Close_friends

Code
logistic_U2 <- glm(Bullied_on_school_property_in_past_12_months ~  Felt_lonely + Custom_Age + Other_students_kind_and_helpful, family = binomial, data = df)
Error in model.frame.default(formula = Bullied_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic_U2)
Error in summary(logistic_U2): object 'logistic_U2' not found
  • removing vatiable Custom_age and these are the variables with lowest p value though Custom_age as well had equally low p value the reason for removing both variables is discussed below.
Code
logistic_U3 <- glm(Bullied_on_school_property_in_past_12_months ~  Felt_lonely +  Other_students_kind_and_helpful, family = binomial, data = df)
Error in model.frame.default(formula = Bullied_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic_U3)
Error in summary(logistic_U3): object 'logistic_U3' not found

Two Variables Custom_Age and Parents_understood_problems might be statistically significant but in practice they might not be logical.

Custom_Age:- As we saw above age group in which Students are experiencing bullying around 12 to 16 in which the most number of victims are found which is in tune with most of the research papers. Secondly as the age grows around 17 to 18 students learn do differentiate with right and wrong victim starts acknowledging it and also a bully starts to realize a behavior which is might hinder their growth in society. Thirdly our data is female focused and is most likely female to female bully this form of bullying involves isolating, threatening, passing false information which could occur at any age.

Parents_understand_problems :- Parents understanding problems is a variable statistically significant but logically must not be applicable much likely due to reasons.

Firstly, Parents understanding problems even they did or did not it would not protect them from getting bullied as the act is carried out by a third party in absence of parents.

Secondly, if the students complain and parents act on it by raising an issue with the concerned authorities irrespective of the action taken by authorities on the bully our dataset focuses on whether the victim experienced it in recent times or 12 months ago without taking note of frequency while parents not understanding the problem of bullying or some other problems faced by child is not clearly defined in the dataset.

Thirdly. Suppose the parents understand the problem of their kids getting bullied and raises a complaint with the teacher but the teacher fails to solve the problem of it their is not many option apart from changing class or school.

Fourthly there is not much clarity on what type of problems of the kid the parent did not understand and also evn if they acknowledge the problem of bullying or did not they cannot do much about it. Hence the variable is omitted.

  • using same variables Felt_lonely, Other_students_kind_and helpful to plot two other models for “Cyber bullied in past 12 months” and “Bullied not on school property in past 12 months”.
Code
logistic_2_a <- glm(Bullied_not_on_school_property_in_past_12_months ~  Felt_lonely +  Other_students_kind_and_helpful, family = binomial, data = df)
Error in model.frame.default(formula = Bullied_not_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic_2_a)
Error in summary(logistic_2_a): object 'logistic_2_a' not found
  • The thing we should be keeping in mind is that variable “Bullied_not_on_school_property_in_past_12_months” is yes/true == 1 in case when a student is bullied outside of school premises or inside of school premises more than 12 months ago.
Code
logistic_3_a <- glm(Cyber_bullied_in_past_12_months ~  Felt_lonely +  Other_students_kind_and_helpful, family = binomial, data = df)
Error in model.frame.default(formula = Cyber_bullied_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic_3_a)
Error in summary(logistic_3_a): object 'logistic_3_a' not found

Plotting models

Target variable is “Bullied_on school_property_in_past_12_months”

Code
logistic_1.a <- glm(Bullied_on_school_property_in_past_12_months ~ Felt_lonely, family = binomial, data = df)
Error in model.frame.default(formula = Bullied_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
x <- seq(min(df$Felt_lonely), max(df$Felt_lonely), length.out=100)
Error in df$Felt_lonely: object of type 'closure' is not subsettable
Code
y <- predict(logistic_1.a, newdata=data.frame(Felt_lonely=x), type="response")
Error in predict(logistic_1.a, newdata = data.frame(Felt_lonely = x), : object 'logistic_1.a' not found
Code
plot(x, y, type="l", xlab="Felt lonely", ylab="Probability of being bullied")
Error in plot(x, y, type = "l", xlab = "Felt lonely", ylab = "Probability of being bullied"): object 'x' not found
  • The kid who never felt lonely had the least probability of getting bullied than the one of who always felt lonely.( 1-> Never, 2-> Barely, 3-> Sometimes, 4-> Most of the time.)
Code
logistic_1.b <- glm(Bullied_on_school_property_in_past_12_months ~ Other_students_kind_and_helpful, family = binomial, data = df)
Error in model.frame.default(formula = Bullied_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
x <- seq(min(df$Other_students_kind_and_helpful), max(df$Other_students_kind_and_helpful), length.out=100)
Error in df$Other_students_kind_and_helpful: object of type 'closure' is not subsettable
Code
y <- predict(logistic_1.b, newdata=data.frame(Other_students_kind_and_helpful=x), type="response")
Error in predict(logistic_1.b, newdata = data.frame(Other_students_kind_and_helpful = x), : object 'logistic_1.b' not found
Code
plot(x, y, type="l", xlab="Other_Students_kind_and_helpful", ylab="Probability of being bullied")
Error in plot(x, y, type = "l", xlab = "Other_Students_kind_and_helpful", : object 'x' not found
  • When Other students were “never” kind and helpful to a child it had the highest probability of being a victim of bullying than when when other students were “always” kind and helpful.( 1-> Never, 2-> Barely, 3-> Sometimes, 4-> Most of the time.)

Target variable is “Bullied_not_on school_property_in_past_12_months”

  • The thing we should be keeping in mind is that variable “Bullied_not_on_school_property_in_past_12_months” is yes/true == 1 in case when a student is bullied outside of school premises or inside of school premises more than 12 months ago.
Code
logistic_2_a <- glm(Bullied_not_on_school_property_in_past_12_months ~ Felt_lonely , family = binomial, data = df)
Error in model.frame.default(formula = Bullied_not_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic_2_a)
Error in summary(logistic_2_a): object 'logistic_2_a' not found
Code
x <- seq(min(df$Felt_lonely), max(df$Felt_lonely), length.out=100)
Error in df$Felt_lonely: object of type 'closure' is not subsettable
Code
y <- predict(logistic_2_a, newdata=data.frame(Felt_lonely=x), type="response")
Error in predict(logistic_2_a, newdata = data.frame(Felt_lonely = x), : object 'logistic_2_a' not found
Code
plot(x, y, type="l", xlab="Felt lonely", ylab="Probability of being bullied")
Error in plot(x, y, type = "l", xlab = "Felt lonely", ylab = "Probability of being bullied"): object 'x' not found
  • The kid who never felt lonely had the least probability of getting bullied than the one of who always felt lonely.( 1-> Never, 2-> Barely, 3-> Sometimes, 4-> Most of the time.)
Code
logistic_2_b <- glm(Bullied_not_on_school_property_in_past_12_months ~ Other_students_kind_and_helpful , family = binomial, data = df)
Error in model.frame.default(formula = Bullied_not_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic_2_b)
Error in summary(logistic_2_b): object 'logistic_2_b' not found
Code
predictions <- data.frame(Other_students_kind_and_helpful = seq(min(df$Other_students_kind_and_helpful), max(df$Other_students_kind_and_helpful), length.out=100),
Probability = predict(logistic_2_b, newdata=data.frame(Other_students_kind_and_helpful=x), type="response"))
Error in df$Other_students_kind_and_helpful: object of type 'closure' is not subsettable
Code
#Plot the data and the predicted probabilities
ggplot(predictions, aes(x=Other_students_kind_and_helpful, y=Probability)) +
geom_point() +
geom_smooth(method="glm", se=TRUE, method.args = list(family=binomial)) +
xlab("Other students kind and helpful") +
ylab("Probability of being bullied") +
scale_y_continuous(limits = c(0,1), breaks = seq(0,1,0.1))
Error in ggplot(predictions, aes(x = Other_students_kind_and_helpful, : object 'predictions' not found
  • When Other students were “never” kind and helpful to a child it had the highest probability of being a victim of bullying than when when other students were “always” kind and helpful.( 1-> Never, 2-> Barely, 3-> Sometimes, 4-> Most of the time.)

Using ggplot to plot “Felt_lonely” variable in logistics_2_a it was carried out using plot function.

Code
logistic_2_c <- glm(Bullied_not_on_school_property_in_past_12_months ~ Felt_lonely , family = binomial, data = df)
Error in model.frame.default(formula = Bullied_not_on_school_property_in_past_12_months ~ : 'data' must be a data.frame, environment, or list
Code
summary(logistic_2_c)
Error in summary(logistic_2_c): object 'logistic_2_c' not found
Code
predictions <- data.frame(Felt_lonely = seq(min(df$Felt_lonely), max(df$Felt_lonely), length.out=100),
Probability = predict(logistic_2_c, newdata=data.frame(Felt_lonely=x), type="response"))
Error in df$Felt_lonely: object of type 'closure' is not subsettable
Code
#Plot the data and the predicted probabilities
ggplot(predictions, aes(x=Felt_lonely, y=Probability)) +
geom_point() +
geom_smooth(method="glm", se=TRUE, method.args = list(family=binomial)) +
xlab("Felt lonely") +
ylab("Probability of being bullied") +
scale_y_continuous(limits = c(0,1), breaks = seq(0,1,0.1))
Error in ggplot(predictions, aes(x = Felt_lonely, y = Probability)): object 'predictions' not found
  • The kid who never felt lonely had the least probability of getting bullied than the one of who always felt lonely.( 1-> Never, 2-> Barely, 3-> Sometimes, 4-> Most of the time.)

  • The plots for cyber bullying provide the same information which the previous 2 variables provided thus not plotting it. Plots have also helped us to individual effects of levels or frequency adverbs such as (1-> Never, 2-> Barely, 3-> Sometimes, 4-> Most of the time for the variables “Felt_lonely” and “Other_students_kind_and_helpful”. Hence, its not required to convert these two variable to a factor and create a model and summary gain to learn about effects of individual levels.