Final Project - Assignment 2

Final Project 2
thrishul
Author

Thrishul

Published

May 8, 2023

Introduction

This data science project aims to explore and analyze a comprehensive dataset of job titles, salaries, employment types, company sizes and locations within the field of data science. With a focus on understanding current trends and patterns, this project seeks to provide insights into the current state of the data science job market. Through analyzing the distribution of job titles, seniority levels, salaries, and employment types, we can gain an understanding of the most common roles, skillsets, and experience required in the field. Additionally, examining the location and size of companies can provide insights into which regions are leading in the data science industry and what size of companies are more likely to hire data scientists. This project aims to provide valuable insights to job seekers, employers, and researchers in the data science field.

Dataset Description

work_year: This column indicates the year in which the salary was paid to the employee. It allows us to analyze trends in salaries over time and compare salaries between different years.

experience_level: This column indicates the experience level of the employee in the job during the year. It allows us to analyze how experience level affects salaries and identify common experience levels for different job titles.

employment_type: This column indicates the type of employment for the role, whether it is Contract, Freelance, Full-Time, or Part-Time. It allows us to analyze the prevalence of different employment types in the data science field.

job_title: This column indicates the role worked in during the year. It allows us to analyze the most common job titles in the data science field and identify trends in job titles over time.

salary: This column indicates the total gross salary amount paid to the employee in the specified currency. It allows us to analyze salary ranges, identify outliers, and compare salaries between different job titles and experience levels.

salary_currency: This column indicates the currency of the salary paid as an ISO 4217 currency code. It allows us to convert salaries to a common currency for analysis and comparison.

salaryinusd: This column indicates the salary in USD. It allows us to compare salaries in a common currency and analyze the impact of currency exchange rates on salaries.

employee_residence: This column indicates the employee’s primary country of residence during the work year as an ISO 3166 country code. It allows us to analyze the geographic distribution of employees and identify common countries of residence for different job titles and experience levels.

remote_ratio: This column indicates the overall amount of work done remotely by the employee during the year. It allows us to analyze the prevalence of remote work in the data science field and identify common remote work ratios for different job titles and experience levels.

company_location: This column indicates the country of the employer’s main office or contracting branch. It allows us to analyze the geographic distribution of companies and identify common countries where data science jobs are located.

company_size: This column indicates the median number of people that worked for the company during the year. It allows us to analyze the size distribution of companies and identify common company sizes for different job titles and experience levels.

Loading Necessary Packages

Code
pacman :: p_load(pacman,stats, dplyr, knitr, ggplot2, plotly, psych, gridExtra,
                 waffle, emojifont,tidyr, tidytext, wordcloud, GGally, viridis,
                tidyverse, rnaturalearth, rnaturalearthdata) 

options(warn=-1) # warnings will be suppressed and will not be displayed in the console or any output

Data Reading

Code
df <- read.table("_data/ds_salaries.csv", sep = ",", header = T)

# Nr. of instances
cat("Number of instances:", nrow(df))
Number of instances: 3755
Code
# View data
head(df, 10)
   work_year experience_level employment_type                job_title salary
1       2023               SE              FT Principal Data Scientist  80000
2       2023               MI              CT              ML Engineer  30000
3       2023               MI              CT              ML Engineer  25500
4       2023               SE              FT           Data Scientist 175000
5       2023               SE              FT           Data Scientist 120000
6       2023               SE              FT        Applied Scientist 222200
7       2023               SE              FT        Applied Scientist 136000
8       2023               SE              FT           Data Scientist 219000
9       2023               SE              FT           Data Scientist 141000
10      2023               SE              FT           Data Scientist 147100
   salary_currency salary_in_usd employee_residence remote_ratio
1              EUR         85847                 ES          100
2              USD         30000                 US          100
3              USD         25500                 US          100
4              USD        175000                 CA          100
5              USD        120000                 CA          100
6              USD        222200                 US            0
7              USD        136000                 US            0
8              USD        219000                 CA            0
9              USD        141000                 CA            0
10             USD        147100                 US            0
   company_location company_size
1                ES            L
2                US            S
3                US            S
4                CA            M
5                CA            M
6                US            L
7                US            L
8                CA            M
9                CA            M
10               US            M
Code
# Inspect data
check <- function(data) {
  l <- list()
  columns <- names(data)
  for (col in columns) {
    instances <- sum(!is.na(data[[col]]))
    dtypes <- class(data[[col]])
    unique <- length(unique(data[[col]]))
    sum_null <- sum(is.na(data[[col]]))
    duplicates <- sum(duplicated(data))
    l[[length(l) + 1]] <- c(col, dtypes, instances, unique, sum_null, duplicates)
  }
  data_check <- as.data.frame(do.call(rbind, l))
  names(data_check) <- c("column", "dtype", "instances", "unique", "sum_null", "duplicates")
  return(data_check)
}

check(df)
               column     dtype instances unique sum_null duplicates
1           work_year   integer      3755      4        0       1171
2    experience_level character      3755      4        0       1171
3     employment_type character      3755      4        0       1171
4           job_title character      3755     93        0       1171
5              salary   integer      3755    815        0       1171
6     salary_currency character      3755     20        0       1171
7       salary_in_usd   integer      3755   1035        0       1171
8  employee_residence character      3755     78        0       1171
9        remote_ratio   integer      3755      3        0       1171
10   company_location character      3755     72        0       1171
11       company_size character      3755      3        0       1171

Inspecting the Data

Code
str(df)
'data.frame':   3755 obs. of  11 variables:
 $ work_year         : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
 $ experience_level  : chr  "SE" "MI" "MI" "SE" ...
 $ employment_type   : chr  "FT" "CT" "CT" "FT" ...
 $ job_title         : chr  "Principal Data Scientist" "ML Engineer" "ML Engineer" "Data Scientist" ...
 $ salary            : int  80000 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
 $ salary_currency   : chr  "EUR" "USD" "USD" "USD" ...
 $ salary_in_usd     : int  85847 30000 25500 175000 120000 222200 136000 219000 141000 147100 ...
 $ employee_residence: chr  "ES" "US" "US" "CA" ...
 $ remote_ratio      : int  100 100 100 100 100 0 0 0 0 0 ...
 $ company_location  : chr  "ES" "US" "US" "CA" ...
 $ company_size      : chr  "L" "S" "S" "M" ...
Code
describe(df)
                    vars    n      mean        sd median   trimmed      mad
work_year              1 3755   2022.37      0.69   2022   2022.47     1.48
experience_level*      2 3755      3.47      0.91      4      3.69     0.00
employment_type*       3 3755      3.00      0.13      3      3.00     0.00
job_title*             4 3755     40.58     18.43     34     39.47    13.34
salary                 5 3755 190695.57 671676.50 138000 138630.70 62269.20
salary_currency*       6 3755     18.41      4.06     20     19.56     0.00
salary_in_usd          7 3755 137570.39  63055.63 135000 134946.56 59304.00
employee_residence*    8 3755     67.15     19.24     76     71.95     0.00
remote_ratio           9 3755     46.27     48.59      0     45.34     0.00
company_location*     10 3755     63.12     17.50     71     67.56     0.00
company_size*         11 3755      1.92      0.39      2      1.97     0.00
                     min      max    range  skew kurtosis       se
work_year           2020     2023        3 -1.02     1.12     0.01
experience_level*      1        4        3 -1.75     2.00     0.01
employment_type*       1        4        3 -8.08   152.71     0.00
job_title*             1       93       92  0.65     0.53     0.30
salary              6000 30400000 30394000 28.91  1145.43 10961.13
salary_currency*       1       20       19 -2.33     3.91     0.07
salary_in_usd       5132   450000   444868  0.54     0.83  1029.01
employee_residence*    1       78       77 -1.92     2.09     0.31
remote_ratio           0      100      100  0.15    -1.92     0.79
company_location*      1       72       71 -1.96     2.22     0.29
company_size*          1        3        2 -0.72     2.93     0.01

EDA

Univariate Analysis

Analysis for Work Year

Code
# Analysis for work_year #

wy_categ <- as.factor(ifelse(df$work_year  == 2020, '2020',
                            ifelse(df$work_year == 2021, '2021', 
                                   ifelse(df$work_year == 2022, '2022', 
                                          ifelse(df$work_year== 2023, '2023', '2020')))))

options(repr.plot.width=16, repr.plot.height=8)
my_palette <- c("#F8EDED", "#F6DFEB", "#E4BAD4", "#CE97B0")

wy_barchart <- ggplot(data.frame(wy_categ), aes(x = wy_categ)) +
  geom_bar(aes(fill = wy_categ))  +
  scale_fill_manual(values = my_palette) +
  ggtitle("Bar Chart for Work Year") +
  xlab("Year") +
  ylab("Frequency") +
  labs(fill = "Year") +
  stat_count(geom = "text", aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() 
  theme(
    plot.title = element_text(color = "black", size = 20, face = "bold"),
    plot.subtitle = element_text(color = "#F6CD90",size = 12, face = "bold"),
    plot.caption = element_text(face = "italic"))
List of 3
 $ plot.title   :List of 11
  ..$ family       : NULL
  ..$ face         : chr "bold"
  ..$ colour       : chr "black"
  ..$ size         : num 20
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi FALSE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.subtitle:List of 11
  ..$ family       : NULL
  ..$ face         : chr "bold"
  ..$ colour       : chr "#F6CD90"
  ..$ size         : num 12
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi FALSE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 $ plot.caption :List of 11
  ..$ family       : NULL
  ..$ face         : chr "italic"
  ..$ colour       : NULL
  ..$ size         : NULL
  ..$ hjust        : NULL
  ..$ vjust        : NULL
  ..$ angle        : NULL
  ..$ lineheight   : NULL
  ..$ margin       : NULL
  ..$ debug        : NULL
  ..$ inherit.blank: logi FALSE
  ..- attr(*, "class")= chr [1:2] "element_text" "element"
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi FALSE
 - attr(*, "validate")= logi TRUE
Code
data <- data.frame(
  category=c("2020", "2021", "2022", "2023"),
  count=c(76, 230, 1664, 1785)
)
 
# Compute percentages
data$fraction <- data$count / sum(data$count)

# Compute the cumulative percentages (top of each rectangle)
data$ymax <- cumsum(data$fraction)

# Compute the bottom of each rectangle
data$ymin <- c(0, head(data$ymax, n=-1))

# Compute label position
data$labelPosition <- (data$ymax + data$ymin) / 2

# Compute a good label
data$label <- paste0(round(data$fraction*100, 1), "%")

wy_piechart <- ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
  geom_rect() +
  geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
  ggtitle("Pie Chart for Work Year") +
  scale_fill_manual(values = my_palette)  +
  coord_polar(theta="y") +
  theme_void() +
  labs(fill = "Year") +
  theme(
  plot.title = element_text(color = "#383335", size = 20, face = "bold"))

wy_barchart 

Code
wy_piechart

Findings:

The dataset indicates that the majority of observations, around 47.54% (1785), are from the year 2023, followed by 44.31% (1664) from 2022. A smaller percentage, 6% (230), corresponds to the year 2021, while only 2% (76) corresponds to the year 2020.

Analysis for Experience Level

Code
# Analysis for experience_level #

# table(df$experience_level)

el_categ <- as.factor(ifelse(df$experience_level == 'EN', 'Entry-level',
                            ifelse(df$experience_level == 'MI', 'Mid-level', 
                                   ifelse(df$experience_level == 'SE', 'Senior-level', 
                                          ifelse(df$experience_level == 'EX', 'Executive-level', '')))))

options(repr.plot.width=16, repr.plot.height=8)
my_palette <- c("#FFF2F2", "#E5E0FF", "#8EA7E9", "#7286D3")

el_barchart <- ggplot(data.frame(el_categ), aes(x = el_categ)) +
  geom_bar(aes(fill = el_categ))  +
  scale_fill_manual(values = my_palette) +
  ggtitle("Bar Chart for Experience Level") +
  xlab("Level") +
  ylab("Frequency") +
  labs(fill = "Level") +
  stat_count(geom = "text", aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  theme(
    plot.title = element_text(color = "#383335", size = 20, face = "bold"),
    plot.subtitle = element_text(color = "#F6CD90",size = 12, face = "bold"),
    plot.caption = element_text(face = "italic"))

data <- data.frame(
  category=c("Entry-level", "Mid-level", "Senior-level", "Executive-level"),
  count=c(320, 805, 2516, 114)
)
 
# Compute percentages
data$fraction <- data$count / sum(data$count)

# Compute the cumulative percentages (top of each rectangle)
data$ymax <- cumsum(data$fraction)

# Compute the bottom of each rectangle
data$ymin <- c(0, head(data$ymax, n=-1))

# Compute label position
data$labelPosition <- (data$ymax + data$ymin) / 2

# Compute a good label
data$label <- paste0(round(data$fraction*100, 1), "%")

el_piechart <- ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
  geom_rect() +
  geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
  ggtitle("Pie Chart for Experience Level") +
  scale_fill_manual(values = my_palette)  +
  coord_polar(theta="y") +
  theme_void() +
  labs(fill = "Level") +
  theme(
  plot.title = element_text(color = "black", size = 20, face = "bold"))

el_barchart

Code
el_piechart

Findings:

The dataset includes four seniority levels, with the Senior-level category having the most representation, followed by the Mid-level category. The Entry-level category is the next in terms of representation, while the Executive-level category has the least representation.

Analysis for Job Title

Code
# Analysis for job_title #

# table(df$job_title)
cat("The dataset grasps", length(unique(df$job_title))," distinct job titles.")
The dataset grasps 93  distinct job titles.
Code
# Compute the top 20 job titles in descending order
top20_job_titles <- head(sort(table(df$job_title), decreasing = T), 20)

# Create a bar plot with plotly
plot_ly(x = top20_job_titles, y = names(top20_job_titles), type = "bar",
               text = top20_job_titles, orientation = 'h', 
               marker = list(color = "#E1CCEC")) %>%
               layout(title = "Top 20 Jobs in Data Science", xaxis = list(title = "Count"), 
               yaxis = list(title = "Job Title"))

Findings:

There are a total of 93 job titles present in the dataset, with Data Engineer, Data Scientist, Data Analyst, and Machine Learning Engineer being the most commonly occurring ones.

Analysis for Salary

Code
# Analysis for salary_in_usd #

options(scipen = 999)

        # Central Tendencies
# Mean
cat("Mean:", mean(df$salary_in_usd))
Mean: 137570.4
Code
cat("\n")
Code
# Median
cat("Median:", median(df$salary_in_usd))
Median: 135000
Code
cat("\n")
Code
# Mode
mode <- function(x){
  ta <- table(x)
  tam <- max(ta)
  if(all(ta==tam))
    mod <- NA
  else
    if(is.numeric(x))
      mod <- as.numeric(names(ta)[ta==tam])
  else
    mod <- names(ta)[ta==tam]
  return(mod)
}

cat("Mode:", mode(df$salary_in_usd))
Mode: 100000
Code
cat("\n")
Code
    # Measure of Variability

# Std. Dev
cat("Standard Deviation:", sd(df$salary_in_usd))
Standard Deviation: 63055.63
Code
cat("\n")
Code
# Variance
cat("Variance:", var(df$salary_in_usd))
Variance: 3976011879
Code
cat("\n")
Code
# IQR
cat("Interquartile Range:") 
Interquartile Range:
Code
quantile(df$salary_in_usd)
    0%    25%    50%    75%   100% 
  5132  95000 135000 175000 450000 
Code
options(repr.plot.width=18, repr.plot.height=6) 
require(gridExtra)

salary_hist <- ggplot(df, aes(x = salary_in_usd)) +
  geom_histogram(color = '#5c082c', fill ='#F5B0CB', bins=30) +
  labs(title = "Histogram for Salary ($)",x = "Salary",y = "Count") +
  theme_classic() +
  theme(
    plot.title = element_text(color = "#383335", size = 20, face = "bold"),
    plot.subtitle = element_text(color = "#F5B0CB",size = 12, face = "bold"),
    plot.caption = element_text(face = "italic"))

salary_boxplot <- ggplot(df, aes_string(x = df$salary_in_usd)) +
    geom_boxplot(outlier.colour = "#F5B0CB", outlier.shape = 11, outlier.size = 2, col = "#5c082c", notch = F) +
    labs(title = "Box Plot for Salary ($)",x = "Salary") +
    theme_classic() +
    theme(
    plot.title = element_text(color = "#383335", size = 20, face = "bold"))

salary_hist

Code
salary_boxplot

Code
plot(density(df$salary_in_usd),
     col="#F5B0CB",
     main="Density Plot for Salary",
     xlab="Salary",
     ylab="Density")
polygon(density(df$salary_in_usd),
        col="#F5B0CB")

Findings:

The range of salaries in the Data Science domain is between 5,132 USD and 450,000 USD. Moreover, the average or mean salary in this field is around 137,570 USD. This information suggests that there is a wide range of salaries in the field of Data Science, with some individuals earning significantly more than others.

Analysis for Employment Type

Code
# Analysis for employment_type #


# table(df$employment_type)

et_categ <- as.factor(ifelse(df$employment_type == "CT", "Contract",
                            ifelse(df$employment_type == "FL", "Freelance", 
                                   ifelse(df$employment_type == "FT", "Full-Time", 
                                          ifelse(df$employment_type == "PT", "Part-Time", "")))))

options(repr.plot.width=16, repr.plot.height=8)
my_palette <- c("#FEDEFF", "#93C6E7", "#AEE2FF", "#B9F3FC")

et_barchart <- ggplot(data.frame(et_categ), aes(x = et_categ)) +
  geom_bar(aes(fill = et_categ))  +
  scale_fill_manual(values = my_palette) +
  ggtitle("Bar Chart for Employment Type") +
  xlab("Type") +
  ylab("Frequency") +
  labs(fill = "Level") +
  stat_count(geom = "text", aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  theme(
    plot.title = element_text(color = "#383335", size = 20, face = "bold"),
    plot.subtitle = element_text(color = "#F6CD90",size = 12, face = "bold"),
    plot.caption = element_text(face = "italic"))

data <- data.frame(
  category=c("Contract", "Freelance", "Full-Time", "Part-Time"),
  count=c(10, 10, 3718, 17)
)
 
# Compute percentages
data$fraction <- data$count / sum(data$count)

# Compute the cumulative percentages (top of each rectangle)
data$ymax <- cumsum(data$fraction)

# Compute the bottom of each rectangle
data$ymin <- c(0, head(data$ymax, n=-1))

# Compute label position
data$labelPosition <- (data$ymax + data$ymin) / 2

# Compute a good label
data$label <- paste0(round(data$fraction*100, 1), "%")

et_piechart <- ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
  geom_rect() +
  geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
  ggtitle("Pie Chart for Employment Type") +
  scale_fill_manual(values = my_palette)  +
  coord_polar(theta="y") +
  theme_void() +
  labs(fill = "Type") +
  theme(
  plot.title = element_text(color = "#383335", size = 20, face = "bold"))

et_barchart

Code
et_piechart

Findings:

The dataset includes data on four types of employment: Contract, Freelance, Full-Time, and Part-Time. Based on the graphs, it is clear that the majority of people who contributed to the dataset are employed on a Full-Time basis. This information indicates that Full-Time employment is the most common form of employment for those in the Data Science field.

Analysis for % Remote Work

Code
# Analysis for remote_ratio #
# table(df$remote_ratio)

rr_categ <- as.factor(ifelse(df$remote_ratio  == 0, "On-site",
                            ifelse(df$remote_ratio == 50, "Hybrid", 
                                ifelse(df$remote_ratio == 100, "Remote", ""))))

options(repr.plot.width=16, repr.plot.height=8)
my_palette <- c("#f0dfd1", "#f0c9a8", "#edb482")

rr_barchart <- ggplot(data.frame(rr_categ ), aes(x = rr_categ)) +
    geom_bar(aes(fill = rr_categ ))  +
    scale_fill_manual(values = my_palette) +
    labs(fill = "Type") +
    stat_count(geom = "text", aes(label = after_stat(count)), vjust = -0.5) +
    labs(title = "Bar Chart for  % Remote Work",x = "Type", y="Frequency") +
    theme_classic() +
    theme(
    plot.title = element_text(color = "#383335", size = 20, face = "bold"),
    plot.subtitle = element_text(color = "#F6CD90",size = 12, face = "bold"),
    plot.caption = element_text(face = "italic"))

data <- data.frame(
  category=c("On-site", "Hybrid", "Remote"),
  count=c(1923, 189, 1643)
)
 
# Compute percentages
data$fraction <- data$count / sum(data$count)

# Compute the cumulative percentages (top of each rectangle)
data$ymax <- cumsum(data$fraction)

# Compute the bottom of each rectangle
data$ymin <- c(0, head(data$ymax, n=-1))

# Compute label position
data$labelPosition <- (data$ymax + data$ymin) / 2

# Compute a good label
data$label <- paste0(round(data$fraction*100, 1), "%")

rr_piechart <- ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
  geom_rect() +
  geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
  ggtitle("Pie Chart for % Remote Work") +
  scale_fill_manual(values = my_palette)  +
  coord_polar(theta="y") +
  theme_void() +
  labs(fill = "Type") +
  theme(
  plot.title = element_text(color = "#383335", size = 20, face = "bold"))

rr_barchart

Code
rr_piechart

Findings:

According to the dataset, approximately 51.2% (1923 observations) of individuals are working On-site, which means they are working at a physical location provided by their employer. Around 43.8% (1643 observations) of people are working in remote mode. Finally, 5% (189 observations) of individuals are working hybrid. This information suggests that a significant portion of the individuals in the Data Science field have some level of flexibility in their work arrangements.

Analysis for Companies Size

Code
# Analysis for company_size #
# table(df$company_size)

cs_categ <- as.factor(ifelse(df$company_size  == "L", "Large",
                            ifelse(df$company_size == "M", "Medium", 
                                ifelse(df$company_size == "S", 'Small', ''))))

options(repr.plot.width=16, repr.plot.height=8)
my_palette <- c("#C8E3D4", "#96C7C1", "#89B5AF")

cs_barchart <- ggplot(data.frame(cs_categ ), aes(x = cs_categ)) +
  geom_bar(aes(fill = cs_categ ))  +
  scale_fill_manual(values = my_palette) +
  ggtitle("Bar Chart for Companies Size") +
  xlab("Size") +
  ylab("Frequency") +
  labs(fill = "Type") +
  stat_count(geom = "text", aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  theme(
  plot.title = element_text(color = "#383335", size = 20, face = "bold"),
  plot.subtitle = element_text(color = "#F6CD90",size = 12, face = "bold"),
  plot.caption = element_text(face = "italic"))

data <- data.frame(
  category=c("Large", "Medium", "Small"),
  count=c(454, 3153, 148)
)
 
# Compute percentages
data$fraction <- data$count / sum(data$count)

# Compute the cumulative percentages (top of each rectangle)
data$ymax <- cumsum(data$fraction)

# Compute the bottom of each rectangle
data$ymin <- c(0, head(data$ymax, n=-1))

# Compute label position
data$labelPosition <- (data$ymax + data$ymin) / 2

# Compute a good label
data$label <- paste0(round(data$fraction*100, 1), "%")

cs_piechart <- ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
  geom_rect() +
  geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
  ggtitle("Pie Chart for Companies Size") +
  scale_fill_manual(values = my_palette)  +
  coord_polar(theta="y") +
  theme_void() +
  labs(fill = "Size") +
  theme(
  plot.title = element_text(color = "#383335", size = 20, face = "bold"))

cs_barchart

Code
cs_piechart

Findings:

A large portion of the companies, around 84% (3153 observations), are categorized as Medium size. Additionally, 12.1% (454 observations) of the companies are classified as Large size, while only 3.9% (148 observations) are considered Small size. This implies that the majority of the companies in the dataset have a medium-sized workforce.

Analysis for Company Location

Code
# Analysis for company_location #

# table(df$company_location)

cat("The dataset grasps", length(unique(df$company_location))," distinct company locations.")
The dataset grasps 72  distinct company locations.
Code
# Compute the top 20 company locations in descending order
top20_company_location <- head(sort(table(df$company_location), decreasing = T), 20)

# Create a bar plot with plotly
plot_ly(x = top20_company_location, y = names(top20_company_location), type = "bar",
               text = top20_job_titles, orientation = 'h', 
               marker = list(color = "#0d1b2a
                             ")) %>%
               layout(title = "Top 20 Data Science Company Locations", xaxis = list(title = "Count"), 
               yaxis = list(title = "Company Location"))

Findings:

The dataset contains information on 72 unique company locations. Upon analyzing the data, it was found that the majority of companies are based in the United States, with the highest number of observations. Great Britain, Canada, Spain, and India are the next most common locations where companies are based, respectively.

Analysis for Employees Residence

Code
# Analysis for employee_residence #

# table(df$employee_residence)

cat("The dataset grasps", length(unique(df$employee_residence))," distinct employees residence.")
The dataset grasps 78  distinct employees residence.
Code
# Compute the top 20 employees residence in descending order
top20_employee_residence <- head(sort(table(df$employee_residence), decreasing = T), 20)

# Create a bar plot with plotly
plot_ly(x = top20_employee_residence, y = names(top20_employee_residence), type = "bar",
               text = top20_job_titles, orientation = 'h', 
               marker = list(color = "#0d1b2a")) %>%
               layout(title = "Top 20 Data Science Employees Residence", xaxis = list(title = "Count"), 
               yaxis = list(title = "Employee Residence"))

Findings:

The dataset includes information on the primary country of residence for employees, with a total of 78 countries represented. The highest number of employees reside in the United States, followed by Great Britain, Canada, Spain, and India, indicating that these countries have a higher concentration of data science jobs or are preferred locations for employees in this field.

Multivariate Analysis

Analysis for Salary by Year

Code
options(scipen = 999)

options(repr.plot.width=20, repr.plot.height=10) 
require(gridExtra)

my_palette <- c("#94d2bd", "#c8b6ff", "#ffb3c6", "#fb6f92")

wy_sl_boxplot <- ggplot(df, aes(x=wy_categ, y=salary_in_usd)) + 
geom_boxplot(fill=my_palette,
             outlier.colour = "#001219", 
             outlier.shape = 11, outlier.size = 2, 
             col = my_palette, 
             notch = F) +
labs(
title = "Distribution of Salaries by Year",
x= "Year",
y="Salary($)") 


wy_sl_hist <- ggplot(df, aes(x=salary_in_usd, fill=wy_categ)) + 
  geom_histogram(color="black", bins = 30) +
  labs(
      title = "Distribution of Salaries by Year",
      x="Salary", 
      y="Frequency",
      fill="Year") +
  scale_fill_manual(values = c("#ffe5ec", "#ffb3c6", "#ff8fab", "#fb6f92")) 


wy_sl_boxplot

Code
wy_sl_hist

The dataset clearly indicates that the salaries in the Data Science domain have been increasing year on year, with a significant rise in 2022. This trend is expected to continue in 2023 as well.

Analysis for Salary by Experience Level

Code
options(scipen = 999)

options(repr.plot.width=20, repr.plot.height=10) 
require(gridExtra)

my_palette <- c("#a8dadc", "#84a98c", "#52796f", "#2f3e46")

el_sl_boxplot <- ggplot(df, aes(x=el_categ, y=salary_in_usd)) + 
geom_boxplot(fill=my_palette,
             outlier.colour = "#84a98c", 
             outlier.shape = 11, outlier.size = 2, 
             col = my_palette, 
             notch = F) +
labs(
title = "Distribution of Salaries by Experience Level",
x= "Level",
y="Salary($)") 


el_sl_hist <- ggplot(df, aes(x=salary_in_usd, fill=el_categ)) + 
  geom_histogram(color="black", bins = 30) +
  labs(
      title = "Distribution of Salaries by Experience Level",
      x="Salary", 
      y="Frequency",
      fill="Level") +
  scale_fill_manual(values = c("#a8dadc", "#84a98c", "#52796f", "#2f3e46")) 


el_sl_boxplot

Code
el_sl_hist

Dataset indicates that the average executive-level salaries are higher than those of entry-level, mid-level, and senior-level positions with an average of 140k to more than 200k per year, while senior level salaries are in between 120k to 180k.

Analysis for Salary by Employment Type

Code
options(scipen = 999)

options(repr.plot.width=20, repr.plot.height=10) 
require(gridExtra)

my_palette <- c("#03045e", "#023e8a", "#0096c7", "#48cae4")

et_sl_boxplot <- ggplot(df, aes(x=et_categ, y=salary_in_usd)) + 
geom_boxplot(fill=my_palette,
             outlier.colour = "#023e8a", 
             outlier.shape = 11, outlier.size = 2, 
             col = my_palette, 
             notch = F) +
labs(
title = "Distribution of Salaries by Employment Type",
x= "Type",
y="Salary($)")

et_sl_hist <- ggplot(df, aes(x=salary_in_usd, fill=et_categ)) + 
  geom_histogram(color="black", bins = 30) +
  labs(
      title = "Distribution of Salaries by Employment Type",
      x="Salary", 
      y="Frequency",
      fill="Type") +
  scale_fill_manual(values = c("#03045e", "#023e8a", "#0096c7", "#48cae4")) 

et_sl_boxplot

Code
et_sl_hist

The data reveals that full-time employees are paid higher salaries compared to other employment types such as contract, freelance and part-time. Freelancers have an average annual salary of around 50 to 60k USD, while most number of part-time employees receive less than 50k USD per year.

Analysis for Salary by Company Size

Code
options(scipen = 999)

options(repr.plot.width=20, repr.plot.height=10) 
require(gridExtra)

my_palette <- c("#C8E3D4", "#96C7C1", "#89B5AF")

cs_sl_boxplot <- ggplot(df, aes(x=cs_categ, y=salary_in_usd)) + 
geom_boxplot(fill=my_palette,
             outlier.colour = "#305F72", 
             outlier.shape = 11, outlier.size = 2, 
             col = my_palette, 
             notch = F) +
labs(
title = "Distribution of Salaries by Company Size",
x= "Size",
y="Salary($)") 

cs_sl_hist <- ggplot(df, aes(x=salary_in_usd, fill=cs_categ)) + 
  geom_histogram(color="black", bins = 30) +
  labs(
      title = "Distribution of Salaries by Company Size",
      x="Salary", 
      y="Frequency",
      fill="Size") +
  scale_fill_manual(values = c("#C8E3D4", "#96C7C1", "#89B5AF")) 

cs_sl_boxplot

Code
cs_sl_hist

Based on the dataset, it can be observed that medium-sized companies tend to pay higher salaries compared to large-sized companies. The average salary for medium-sized companies ranges from 100k to 180k per year, while larger companies pay an average of 60k to 150k per year. This suggests that the size of the company may not always be an accurate indicator of salary levels, and that other factors may come into play such as the industry, location, and job title.

Analysis for Salary by % Remote Work

Code
options(scipen = 999)

options(repr.plot.width=20, repr.plot.height=10) 
require(gridExtra)

my_palette <- c("#f0dfd1", "#f0c9a8", "#edb482")

rr_sl_boxplot <- ggplot(df, aes(x=rr_categ, y=salary_in_usd)) + 
geom_boxplot(fill=my_palette,
             outlier.colour = "#FFD966", 
             outlier.shape = 11, outlier.size = 2, 
             col = my_palette, 
             notch = F) +
labs(
title = "Distribution of Salaries by % Remote Work",
x= "Type",
y="Salary($)")

rr_sl_hist <- ggplot(df, aes(x=salary_in_usd, fill=rr_categ)) + 
  geom_histogram(color="black", bins = 30) +
  labs(
      title = "Distribution of Salaries by % Remote Work",
      x="Salary", 
      y="Frequency",
      fill="Type") +
  scale_fill_manual(values = c("#f0dfd1", "#f0c9a8", "#edb482")) 

rr_sl_boxplot

Code
rr_sl_hist

According to the dataset, the salaries of remote employees are comparable to those of on-site employees, with an average salary range of 90k to 180k USD per year. This suggests that working remotely does not necessarily have a negative impact on salary.

Conclusion

Job Titles: Salaries in the data science field vary based on job titles. Executive-level positions tend to have higher salaries compared to entry-level, mid-level, and senior-level positions.

Seniority Levels: Salaries increase with increasing seniority levels. Executive-level salaries are higher than those of entry-level, mid-level, and senior-level positions, while senior-level salaries are higher than those of entry-level and mid-level positions.

Employment Types: Full-time employment is the most common form of employment in the data science field and typically comes with higher salaries compared to other employment types such as contract, freelance, and part-time.

Company Sizes: Medium-sized companies in the data science field tend to pay higher salaries compared to large-sized companies. This suggests that the size of the company can impact the salary levels offered to data scientists.

Geographic Location: The United States has the highest concentration of data science jobs. Salaries may vary depending on the location, with regions with a higher demand for data scientists offering higher salaries.

Salary Trends: Salaries in the data science domain have been increasing year on year, with a significant rise in 2022. This indicates that there is a positive trend in salary growth, and it is expected to continue in 2023.

Work Arrangements: A significant portion of individuals in the data science field have some level of flexibility in their work arrangements. This suggests that work flexibility may not significantly impact salary levels in the data science job market.

Remote Work: Salaries of remote employees in the data science field are comparable to those of on-site employees. This indicates that the option to work remotely does not have a significant impact on salary levels in the data science job market.

Overall, these findings suggest that the Data Science field is a growing and lucrative industry, with potential for career growth and financial success. However, factors such as employment type, company size, and location may impact an individual’s earning potential.