Final Project Part-1

Final Project Part-1
Final Project Part-1
Author

Niharika Pola

Published

December 1, 2022

Code
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.2
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(readr)
library(sqldf)
Warning: package 'sqldf' was built under R version 4.2.2
Loading required package: gsubfn
Warning: package 'gsubfn' was built under R version 4.2.2
Loading required package: proto
Warning: package 'proto' was built under R version 4.2.2
Loading required package: RSQLite
Warning: package 'RSQLite' was built under R version 4.2.2
Code
library(data.table)
Warning: package 'data.table' was built under R version 4.2.2

Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

Background/Motivation

What is takes for a country or a continent to be happy? Is it the economy, life-expectancy, freedom or the trust in the government? What are the factors that affect a country’s or continents overall happiness? Can we predict the happiness score? The curiosity to find answers to these questions made me explore the world happiness data of 2022.

“This year marks the 10th anniversary of the World Happiness Report, which uses global survey data to report how people evaluate their own lives in more than 150 countries worldwide. The World Happiness Report 2022 reveals a bright light in dark times. The pandemic brought not only pain and suffering but also an increase in social support and benevolence. As we battle the ills of disease and war, it is essential to remember the universal desire for happiness and the capacity of individuals to rally to each other’s support in times of great need.” - World Happiness Report 2022

Research Question

The World happiness data tries to measure the happiness of the populace of every country and comes up with a score which connotes the level of happiness of the populace.

The data set uses various variables to measure happiness such as the GDP per capita, Freedom to make choices, life expectancy, the perception of corruption, generosity and social support.

In this study, I aim to find out answers to the following research questions:

  1. What are the variables or factors that are affecting world’s happiness, with a focus on individual countries & continents. This includes analyzing the correlation between most effective variables.
  2. To find out which model accurately predicts the happiness score.

Hypothesis

I wish to test the following hypothesis,

  1. Better economy of a country would lead to happiness
  2. Longer life expectancy would lead to happiness
  3. Having family/social support leads to happiness
  4. Freedom leads to happiness
  5. People’s trust in the Government leads to happiness
  6. Generosity leads to happiness

Model

Look at each variable individually, and put all the variables together.

Data Preparation

Reading the data set

Code
primary <- read.csv("project datasets/2022.csv")
head(primary)
Code
str(primary)
'data.frame':   147 obs. of  15 variables:
 $ RANK                                      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Country                                   : chr  "Finland" "Denmark" "Iceland" "Switzerland" ...
 $ Happiness.score                           : num  7821 7636 7557 7512 7415 ...
 $ Whisker.high                              : num  7.89 7.71 7.65 7.59 7.47 7.5 7.45 7.44 7.43 7.28 ...
 $ Whisker.high.1                            : num  7886 7710 7651 7586 7471 ...
 $ Whisker.low                               : num  7.76 7.56 7.46 7.44 7.36 7.31 7.32 7.29 7.3 7.12 ...
 $ Whisker.low.1                             : num  7756 7563 7464 7437 7359 ...
 $ Dystopia..1.83....residual                : chr  "2518.00" "2226.00" "2320.00" "2153.00" ...
 $ Explained.by..GDP.per.capita              : chr  "1892.00" "1953.00" "1936.00" "2026.00" ...
 $ Explained.by..Social.support              : chr  "1258.00" "1243.00" "1320.00" "1226.00" ...
 $ Explained.by..Healthy.life.expectancy     : chr  "0,775" "0,777" "0,803" "0,822" ...
 $ Explained.by..Freedom.to.make.life.choices: chr  "0,736" "0,719" "0,718" "0,677" ...
 $ Explained.by..Generosity                  : chr  "0,109" "0,188" "0,270" "0,147" ...
 $ Explained.by..Perceptions.of.corruption   : chr  "0,534" "0,532" "0,191" "0,461" ...
 $ X                                         : num  0.78 NA NA NA NA NA NA NA NA NA ...
Code
str(primary$Country)
 chr [1:147] "Finland" "Denmark" "Iceland" "Switzerland" "Netherlands" ...

The dataset that I have chosen is happiness 2022 dataset, one of Kaggle’s dataset. This dataset gives the happiness rank and happiness score of 147 countries around the world based on 8 factors including GDP per capita, Social support, Health life expectancy, freedom to make life choices, Generosity, Perceptions of corruption and dystopia residual. The higher value of each of these 8 factors means the level of happiness is higher. Dystopia is the opposite of utopia and has the lowest happiness level. Dystopia will be considered as a reference for other countries to show how far they are from being the poorest country regarding happiness level.

Source of the data: World Happiness Report 2022 use data from the Gallup World Poll surveys from 2019 to 2021. They are based on answers to the main life evaluation question asked in the poll.

Some of the variable names are not clear enough and I decided to change the name of several of them a little bit. Also, I will remove whisker low and whisker high variables from my dataset because these variables give only the lower and upper confidence interval of happiness score and there is no need to use them for visualization and prediction.

The next step is adding another column to the dataset which is continent. I want to work on different continents to discover whether there are different trends for them regarding which factors play a significant role in gaining higher happiness score. Asia, Africa, North America, South America, Europe, and Australia are our six continents in this dataset. Then I moved the position of the continent column to the second column because I think with this position arrange, dataset looks better. Finally, I changed the type of continent variable to factor to be able to work with it easily for visualization.

Preparation of the data

Code
# Changing the name of columns
colnames (primary) <- c("Country", "Happiness.Rank", "Happiness.Score",
                          "Whisker.High", "Whisker.Low", "Economy", "Family",
                          "Life.Expectancy", "Freedom", "Generosity",
                          "Trust", "Dystopia.Residual")


# Country: Name of countries
# Happiness.Rank: Rank of the country based on the Happiness Score
# Happiness.Score: Happiness measurement on a scale of 0 to 10
# Whisker.High: Upper confidence interval of happiness score
# Whisker.Low: Lower confidence interval of happiness score
# Economy: The value of all final goods and services produced within a nation in a given year
# Family: Importance of having a family
# Life.Expectancy: Importance of health and amount of time prople expect to live
# Freedom: Importance of freedom in each country
# Generosity: The quality of being kind and generous
# Trust: Perception of corruption in a government
# Dystopia.Residual: Plays as a reference

# Deleting unnecessary columns (Whisker.high and Whisker.low)

primary <- primary[, -c(4,5)]
Code
primary$Continent <- NA

primary$Continent[which(primary$Country %in% c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan Province of China",
                                   "Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
                                   "South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong S.A.R., China", "Philippines",
                                   "Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
                                   "Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
                                   "Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
                                   "Cambodia", "Afghanistan", "Yemen", "Syria"))] <- "Asia"
primary$Continent[which(primary$Country %in% c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
                                   "Netherlands", "Sweden", "Austria", "Ireland", "Germany",
                                   "Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
                                   "Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
                                   "Russia", "Lithuania", "Latvia", "Moldova", "Romania",
                                   "Slovenia", "North Cyprus", "Cyprus", "Estonia", "Belarus",
                                   "Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
                                   "Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
                                   "Bulgaria", "Albania", "Ukraine"))] <- "Europe"
primary$Continent[which(primary$Country %in% c("Canada", "Costa Rica", "United States", "Mexico",  
                                   "Panama","Trinidad and Tobago", "El Salvador", "Belize", "Guatemala",
                                   "Jamaica", "Nicaragua", "Dominican Republic", "Honduras",
                                   "Haiti"))] <- "North America"
primary$Continent[which(primary$Country %in% c("Chile", "Brazil", "Argentina", "Uruguay",
                                   "Colombia", "Ecuador", "Bolivia", "Peru",
                                   "Paraguay", "Venezuela"))] <- "South America"
primary$Continent[which(primary$Country %in% c("New Zealand", "Australia"))] <- "Australia"
primary$Continent[which(is.na(primary$Continent))] <- "Africa"

view(primary)

# Moving the continent column's position in the dataset to the second column

primary <- primary %>% select(Country,Continent, everything())
Error in `select()`:
! Names repair functions can't return `NA` values.
Code
#Renaming the final dataframe to happy

happy <- primary
str(happy)
'data.frame':   147 obs. of  14 variables:
 $ Country          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Happiness.Rank   : chr  "Finland" "Denmark" "Iceland" "Switzerland" ...
 $ Happiness.Score  : num  7821 7636 7557 7512 7415 ...
 $ Economy          : num  7.76 7.56 7.46 7.44 7.36 7.31 7.32 7.29 7.3 7.12 ...
 $ Family           : num  7756 7563 7464 7437 7359 ...
 $ Life.Expectancy  : chr  "2518.00" "2226.00" "2320.00" "2153.00" ...
 $ Freedom          : chr  "1892.00" "1953.00" "1936.00" "2026.00" ...
 $ Generosity       : chr  "1258.00" "1243.00" "1320.00" "1226.00" ...
 $ Trust            : chr  "0,775" "0,777" "0,803" "0,822" ...
 $ Dystopia.Residual: chr  "0,736" "0,719" "0,718" "0,677" ...
 $ NA               : chr  "0,109" "0,188" "0,270" "0,147" ...
 $ NA.1             : chr  "0,534" "0,532" "0,191" "0,461" ...
 $ NA.2             : num  0.78 NA NA NA NA NA NA NA NA NA ...
 $ Continent        : chr  "Africa" "Africa" "Africa" "Africa" ...