Blog post 4 describing the initial Data Analysis and data wrangling as a part of the course “Data Science Fundamentals”
id <- data.frame(read.csv("C:/Users/gunde/Downloads/Indian_Startup_Funding.csv",stringsAsFactors = TRUE))
glimpse(id)
Rows: 3,044
Columns: 8
$ Date <fct> 09/01/2020, 13/01/2020, 09/01/2020, 02/01/2~
$ StartupName <fct> BYJUS, Shuttl, Mamaearth, WealthBucket, Fas~
$ IndustryVertical <fct> "E-Tech", "Transportation", "E-commerce", "~
$ SubVertical <fct> "E-learning", "App based shuttle service", ~
$ CityLocation <fct> "Bangalore", "Gurgaon", "Bangalore", "New D~
$ Investors <fct> "Tiger Global Management", "Susquehanna Gro~
$ InvestmentType <fct> Private Equity Round, Series C, Series B, P~
$ AmountUSD <fct> "20,00,00,000", "80,48,394", "1,83,58,860",~
Date StartupName IndustryVertical SubVertical
0 0 0 0
CityLocation Investors InvestmentType AmountUSD
0 0 0 0
#Modifying date variable to proper format
id$Date <- as.Date(id$Date,format("%d/%m/%Y"))
#Checking for changes
str(id)
'data.frame': 3044 obs. of 8 variables:
$ Date : Date, format: "2020-01-09" ...
$ StartupName : Factor w/ 2453 levels "#Fame","121Policy",..: 276 1911 1308 2307 656 1579 2441 568 295 495 ...
$ IndustryVertical: Factor w/ 878 levels "360-degree view creating platform",..: 200 827 190 275 254 440 358 822 191 7 ...
$ SubVertical : Factor w/ 1943 levels "\"Women\\\\'s Fashion Clothing Online Platform\"",..: 461 75 1647 1279 515 1487 1216 28 112 1681 ...
$ CityLocation : Factor w/ 100 levels "Agra","Ahemadabad",..: 5 29 5 62 55 19 29 74 29 5 ...
$ Investors : Factor w/ 2405 levels " Sandeep Aggarwal, Teruhide Sato",..: 2114 2061 1857 2316 1997 468 218 1827 1523 1365 ...
$ InvestmentType : Factor w/ 55 levels "Angel","Angel / Seed Funding",..: 26 43 41 21 36 40 26 40 44 30 ...
$ AmountUSD : Factor w/ 476 levels "1,00,00,00,000",..: 184 451 75 263 129 469 115 386 406 336 ...
## Make a new column for year
id$year <- as.numeric(format(id$Date,"%Y"))
## Year frequency table
yeartable <- table(id$year)
Cleanstartup <- id[complete.cases(id), ]
dim(Cleanstartup)
[1] 3038 9
summary(Cleanstartup)
Date StartupName IndustryVertical
Min. :2015-01-02 Ola Cabs: 8 Consumer Internet: 941
1st Qu.:2015-11-03 Swiggy : 8 Technology : 478
Median :2016-07-15 BYJUS : 7 E-Commerce : 287
Mean :2016-09-22 Paytm : 7 Healthcare : 72
3rd Qu.:2017-06-12 Medinfi : 6 Finance : 63
Max. :2020-01-13 Meesho : 6 Logistics : 32
(Other) :2996 (Other) :1165
SubVertical CityLocation
nan : 931 Bangalore:863
Online Lending Platform : 11 Mumbai :609
Online Pharmacy : 10 New Delhi:424
Food Delivery Platform : 8 Gurgaon :344
Education : 5 Hyderabad:164
Online Education Platform: 5 Chennai :142
(Other) :2068 (Other) :492
Investors InvestmentType
Undisclosed Investors: 104 Private Equity :1355
Ratan Tata : 25 Seed Funding :1350
Indian Angel Network : 24 Seed/ Angel Funding : 60
Shell Foundation : 21 Seed / Angel Funding: 47
Kalaari Capital : 16 Seed\\\\nFunding : 30
Sequoia Capital : 15 Debt Funding : 25
(Other) :2833 (Other) : 171
AmountUSD year
100,000 : 224 Min. :2015
25,000 : 199 1st Qu.:2015
10,00,000: 165 Median :2016
231,046 : 136 Mean :2016
50,000 : 116 3rd Qu.:2017
5,00,000 : 107 Max. :2020
(Other) :2091
Cleanstartup1 <- Cleanstartup
str(Cleanstartup1)
'data.frame': 3038 obs. of 9 variables:
$ Date : Date, format: "2020-01-09" ...
$ StartupName : Factor w/ 2453 levels "#Fame","121Policy",..: 276 1911 1308 2307 656 1579 2441 568 295 495 ...
$ IndustryVertical: Factor w/ 878 levels "360-degree view creating platform",..: 200 827 190 275 254 440 358 822 191 7 ...
$ SubVertical : Factor w/ 1943 levels "\"Women\\\\'s Fashion Clothing Online Platform\"",..: 461 75 1647 1279 515 1487 1216 28 112 1681 ...
$ CityLocation : Factor w/ 100 levels "Agra","Ahemadabad",..: 5 29 5 62 55 19 29 74 29 5 ...
$ Investors : Factor w/ 2405 levels " Sandeep Aggarwal, Teruhide Sato",..: 2114 2061 1857 2316 1997 468 218 1827 1523 1365 ...
$ InvestmentType : Factor w/ 55 levels "Angel","Angel / Seed Funding",..: 26 43 41 21 36 40 26 40 44 30 ...
$ AmountUSD : Factor w/ 476 levels "1,00,00,00,000",..: 184 451 75 263 129 469 115 386 406 336 ...
$ year : num 2020 2020 2020 2020 2020 ...
#Startups city wise
Cleanstartup1$CityLocation <- as.character(Cleanstartup1$CityLocation)
Cleanstartup1$CityLocation[Cleanstartup1$CityLocation != "Bangalore" & Cleanstartup1$CityLocation != "Mumbai" & Cleanstartup1$CityLocation != "New Delhi" & Cleanstartup1$CityLocation != "Gurgaon" & Cleanstartup1$CityLocation != "Pune" & Cleanstartup1$CityLocation != "Hyderabad"] <- "Others"
table(Cleanstartup1$CityLocation)
Bangalore Gurgaon Hyderabad Mumbai New Delhi Others Pune
863 344 164 609 424 528 106
startup2 <- Cleanstartup1
startup2$AmountUSD <- as.numeric(gsub(",","",startup2$AmountUSD))
#converting to lowercase to prevent case-sensitivity
startup2$IndustryVertical <- as.factor(tolower(startup2$IndustryVertical))
#Total amounts raised by startups
Top5 <- as.data.frame(aggregate(AmountUSD ~ IndustryVertical, startup2, sum))
Top5 <- Top5[order(Top5$AmountUSD, decreasing = TRUE), ] %>%
drop_na(IndustryVertical) %>%
head(5)
Top5
IndustryVertical AmountUSD
1 e-commerce 8279033705
2 consumer internet 6318222695
3 transportation 3916632394
4 technology 2267804310
5 finance 1981978000
#plotting top5 industry verticles
ggplot(data = Top5, aes(IndustryVertical, AmountUSD, fill = IndustryVertical)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_y_continuous(labels = unit_format(unit = "M", scale = 0.0000001)) +
labs(x = "Industry Category", y = "Investment Amount (USD)", title = "TOP 5 Start-up Investment in India")
This clearly shows us that e-commerce is the most widely funded Industry vertical followed by consumer internet, transportation, technology and finance.
startup2$IndustryVertical <- as.character(startup2$IndustryVertical)
#clean code
startup2$IndustryVertical[startup2$IndustryVertical != "consumer internet" & startup2$IndustryVertical != "technology" & startup2$IndustryVertical != "e-commerce" & startup2$IndustryVertical != "finance" & startup2$IndustryVertical != "transportation"] <- "OtherSectors"
table(startup2$IndustryVertical)
consumer internet e-commerce finance
942 299 63
OtherSectors technology transportation
1252 478 4
#Creating a dataframe of top startups sorted per IndustryVerticle column
set.seed(5642)
sample_data <- data.frame(name = c("Consumer Internet","Technology","E-Commerce", "Finance", "Transportation", "other sectors") ,
value = c(941,478,287,63,4,1265))
#Creating barplot
plot<-ggplot(sample_data,
aes(name,value)) +
geom_bar(stat = "identity")+ theme_minimal()+
geom_text(aes(label = signif(value)), nudge_y = 3,)
plot+
coord_flip()
This diagram clearly shows us that there are many startups working in the consumer Internet followed by technology, e-commerce, finance, transportation.
If we compare the funding raised it has a very different story all together:
e-commerce start ups are 1/3 in number to consumer internet by they raised almost 4x amounts than consumer internet on average.
The transportation Industry clearly stands out. This clearly explains the supply chain issues.
temp = startup2[complete.cases(startup2),] %>% filter(CityLocation == startup2$CityLocation[3])
#Looking for outliers in the data and removing them
outliers = boxplot(temp$AmountUSD ~ temp$IndustryVertical, plot=FALSE)$out
temp.out = temp[-which(temp$AmountUSD %in% outliers),]
# Plotting a boxplot
p2 = ggplot(temp.out, aes(x = reorder(IndustryVertical, AmountUSD), y = log(AmountUSD))) +
geom_boxplot(outlier.shape = NA, show.legend = FALSE) + coord_flip() +
stat_summary(fun = mean, col = "honeydew4", geom = 'point') +
labs(x = "IndustryVerticle", y = "Amount in USD(Millions)", title = "Plotting outliers per Top5 IndustryVerticle on average", subtitle = "average given by dot")
p2
The graph clearly tells us that transportation industry is clearly the outlier.
Finance is maintaining the balance between no of start ups to Investments which says the fierce competetion in the industry.
#subsetting
fund.type <- startup2[startup2$IndustryVertical %in% c("technology", "consumer internet", "e-commerce"), ] %>%
drop_na(AmountUSD)
#grouping
fund.type <- fund.type %>% group_by(IndustryVertical, InvestmentType) %>%
summarise(count = n()) %>%
mutate(percentage = (round(count/sum(count)*100, 2))) %>%
ungroup()
ggplot(fund.type, aes(x = IndustryVertical, percentage, fill = InvestmentType)) +
geom_col(aes(fill = InvestmentType), position = "stack") +
labs(title = "Investment Category per Industry Vertical", subtitle = "Based on TOP 3 Industry Vertical in India (2015 - 2020)",
x = "Investment Category", y = "Percentage (%)") +
geom_label(aes(label = percentage), position = position_stack(vjust = 0.5))+
theme(legend.position = "bottom", legend.title = element_blank()) +
guides(fill = guide_legend(override.aes = aes(label = ""))) + #remove "a" character in legend
coord_flip()
The Investment types are clearly dominated by two types mainly but the interesting thing is technology and e-commerce are sharing almost the similar share in the type of investments which can be translated to technology is powering the rise of e-commerce and likewise.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gundeti (2022, May 19). Data Analytics and Computational Social Science: 601_Blogpost4 . Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomrahulgdacss601hw4/
BibTeX citation
@misc{gundeti2022601_blogpost4, author = {Gundeti, Rahul}, title = {Data Analytics and Computational Social Science: 601_Blogpost4 }, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomrahulgdacss601hw4/}, year = {2022} }