601_Blogpost4

Blog post 4 describing the initial Data Analysis and data wrangling as a part of the course “Data Science Fundamentals”

Rahul Gundeti (Graduate student, Data Analytics & Computational Social Sciences (DACSS), UMass Amherst.)
2022-05-12
id <- data.frame(read.csv("C:/Users/gunde/Downloads/Indian_Startup_Funding.csv",stringsAsFactors = TRUE))
glimpse(id)
Rows: 3,044
Columns: 8
$ Date             <fct> 09/01/2020, 13/01/2020, 09/01/2020, 02/01/2~
$ StartupName      <fct> BYJUS, Shuttl, Mamaearth, WealthBucket, Fas~
$ IndustryVertical <fct> "E-Tech", "Transportation", "E-commerce", "~
$ SubVertical      <fct> "E-learning", "App based shuttle service", ~
$ CityLocation     <fct> "Bangalore", "Gurgaon", "Bangalore", "New D~
$ Investors        <fct> "Tiger Global Management", "Susquehanna Gro~
$ InvestmentType   <fct> Private Equity Round, Series C, Series B, P~
$ AmountUSD        <fct> "20,00,00,000", "80,48,394", "1,83,58,860",~
#check NA availability for each column
colSums(is.na(id))
            Date      StartupName IndustryVertical      SubVertical 
               0                0                0                0 
    CityLocation        Investors   InvestmentType        AmountUSD 
               0                0                0                0 
#Modifying date variable to proper format
id$Date <- as.Date(id$Date,format("%d/%m/%Y"))

#Checking for changes
str(id)
'data.frame':   3044 obs. of  8 variables:
 $ Date            : Date, format: "2020-01-09" ...
 $ StartupName     : Factor w/ 2453 levels "#Fame","121Policy",..: 276 1911 1308 2307 656 1579 2441 568 295 495 ...
 $ IndustryVertical: Factor w/ 878 levels "360-degree view creating platform",..: 200 827 190 275 254 440 358 822 191 7 ...
 $ SubVertical     : Factor w/ 1943 levels "\"Women\\\\'s Fashion Clothing Online Platform\"",..: 461 75 1647 1279 515 1487 1216 28 112 1681 ...
 $ CityLocation    : Factor w/ 100 levels "Agra","Ahemadabad",..: 5 29 5 62 55 19 29 74 29 5 ...
 $ Investors       : Factor w/ 2405 levels " Sandeep Aggarwal, Teruhide Sato",..: 2114 2061 1857 2316 1997 468 218 1827 1523 1365 ...
 $ InvestmentType  : Factor w/ 55 levels "Angel","Angel / Seed Funding",..: 26 43 41 21 36 40 26 40 44 30 ...
 $ AmountUSD       : Factor w/ 476 levels "1,00,00,00,000",..: 184 451 75 263 129 469 115 386 406 336 ...
## Make a new column for year
id$year <- as.numeric(format(id$Date,"%Y"))
## Year frequency table
yeartable <- table(id$year)
Cleanstartup <- id[complete.cases(id), ]
dim(Cleanstartup)
[1] 3038    9
summary(Cleanstartup)
      Date              StartupName            IndustryVertical
 Min.   :2015-01-02   Ola Cabs:   8   Consumer Internet: 941   
 1st Qu.:2015-11-03   Swiggy  :   8   Technology       : 478   
 Median :2016-07-15   BYJUS   :   7   E-Commerce       : 287   
 Mean   :2016-09-22   Paytm   :   7   Healthcare       :  72   
 3rd Qu.:2017-06-12   Medinfi :   6   Finance          :  63   
 Max.   :2020-01-13   Meesho  :   6   Logistics        :  32   
                      (Other) :2996   (Other)          :1165   
                    SubVertical      CityLocation
 nan                      : 931   Bangalore:863  
 Online Lending Platform  :  11   Mumbai   :609  
 Online Pharmacy          :  10   New Delhi:424  
 Food Delivery Platform   :   8   Gurgaon  :344  
 Education                :   5   Hyderabad:164  
 Online Education Platform:   5   Chennai  :142  
 (Other)                  :2068   (Other)  :492  
                 Investors                 InvestmentType
 Undisclosed Investors: 104   Private Equity      :1355  
 Ratan Tata           :  25   Seed Funding        :1350  
 Indian Angel Network :  24   Seed/ Angel Funding :  60  
 Shell Foundation     :  21   Seed / Angel Funding:  47  
 Kalaari Capital      :  16   Seed\\\\nFunding    :  30  
 Sequoia Capital      :  15   Debt Funding        :  25  
 (Other)              :2833   (Other)             : 171  
     AmountUSD         year     
 100,000  : 224   Min.   :2015  
 25,000   : 199   1st Qu.:2015  
 10,00,000: 165   Median :2016  
 231,046  : 136   Mean   :2016  
 50,000   : 116   3rd Qu.:2017  
 5,00,000 : 107   Max.   :2020  
 (Other)  :2091                 
Cleanstartup1 <- Cleanstartup
str(Cleanstartup1)
'data.frame':   3038 obs. of  9 variables:
 $ Date            : Date, format: "2020-01-09" ...
 $ StartupName     : Factor w/ 2453 levels "#Fame","121Policy",..: 276 1911 1308 2307 656 1579 2441 568 295 495 ...
 $ IndustryVertical: Factor w/ 878 levels "360-degree view creating platform",..: 200 827 190 275 254 440 358 822 191 7 ...
 $ SubVertical     : Factor w/ 1943 levels "\"Women\\\\'s Fashion Clothing Online Platform\"",..: 461 75 1647 1279 515 1487 1216 28 112 1681 ...
 $ CityLocation    : Factor w/ 100 levels "Agra","Ahemadabad",..: 5 29 5 62 55 19 29 74 29 5 ...
 $ Investors       : Factor w/ 2405 levels " Sandeep Aggarwal, Teruhide Sato",..: 2114 2061 1857 2316 1997 468 218 1827 1523 1365 ...
 $ InvestmentType  : Factor w/ 55 levels "Angel","Angel / Seed Funding",..: 26 43 41 21 36 40 26 40 44 30 ...
 $ AmountUSD       : Factor w/ 476 levels "1,00,00,00,000",..: 184 451 75 263 129 469 115 386 406 336 ...
 $ year            : num  2020 2020 2020 2020 2020 ...
#Startups city wise
Cleanstartup1$CityLocation <- as.character(Cleanstartup1$CityLocation)
Cleanstartup1$CityLocation[Cleanstartup1$CityLocation != "Bangalore" & Cleanstartup1$CityLocation != "Mumbai" & Cleanstartup1$CityLocation != "New Delhi" & Cleanstartup1$CityLocation != "Gurgaon" & Cleanstartup1$CityLocation != "Pune" & Cleanstartup1$CityLocation != "Hyderabad"] <- "Others"


table(Cleanstartup1$CityLocation)

Bangalore   Gurgaon Hyderabad    Mumbai New Delhi    Others      Pune 
      863       344       164       609       424       528       106 
startup2 <- Cleanstartup1
startup2$AmountUSD <- as.numeric(gsub(",","",startup2$AmountUSD))
#converting to lowercase to prevent case-sensitivity
startup2$IndustryVertical <- as.factor(tolower(startup2$IndustryVertical))

#Total amounts raised by startups
Top5 <- as.data.frame(aggregate(AmountUSD ~ IndustryVertical, startup2, sum))
Top5 <- Top5[order(Top5$AmountUSD, decreasing = TRUE), ] %>% 
  drop_na(IndustryVertical) %>%
  head(5)
Top5
   IndustryVertical  AmountUSD
1        e-commerce 8279033705
2 consumer internet 6318222695
3    transportation 3916632394
4        technology 2267804310
5           finance 1981978000
#plotting top5 industry verticles
ggplot(data = Top5, aes(IndustryVertical, AmountUSD, fill = IndustryVertical)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_y_continuous(labels = unit_format(unit = "M", scale = 0.0000001)) +
  labs(x = "Industry Category", y = "Investment Amount (USD)", title = "TOP 5 Start-up Investment in India")

This clearly shows us that e-commerce is the most widely funded Industry vertical followed by consumer internet, transportation, technology and finance.

startup2$IndustryVertical <- as.character(startup2$IndustryVertical)
#clean code
startup2$IndustryVertical[startup2$IndustryVertical != "consumer internet" & startup2$IndustryVertical != "technology" & startup2$IndustryVertical != "e-commerce" & startup2$IndustryVertical != "finance" & startup2$IndustryVertical != "transportation"] <- "OtherSectors"
table(startup2$IndustryVertical)

consumer internet        e-commerce           finance 
              942               299                63 
     OtherSectors        technology    transportation 
             1252               478                 4 
#Creating a dataframe of top startups sorted per IndustryVerticle column
set.seed(5642)                            
sample_data <- data.frame(name = c("Consumer Internet","Technology","E-Commerce", "Finance", "Transportation", "other sectors") ,
                          value = c(941,478,287,63,4,1265))
 
#Creating barplot 
plot<-ggplot(sample_data,
             aes(name,value)) +
geom_bar(stat = "identity")+ theme_minimal()+
geom_text(aes(label = signif(value)), nudge_y = 3,)
plot+
coord_flip()

This diagram clearly shows us that there are many startups working in the consumer Internet followed by technology, e-commerce, finance, transportation.

If we compare the funding raised it has a very different story all together:

e-commerce start ups are 1/3 in number to consumer internet by they raised almost 4x amounts than consumer internet on average.

The transportation Industry clearly stands out. This clearly explains the supply chain issues.

temp = startup2[complete.cases(startup2),] %>% filter(CityLocation == startup2$CityLocation[3])

#Looking for outliers in the data and removing them
outliers = boxplot(temp$AmountUSD ~ temp$IndustryVertical, plot=FALSE)$out
temp.out = temp[-which(temp$AmountUSD %in% outliers),]

# Plotting a boxplot 
p2 = ggplot(temp.out, aes(x = reorder(IndustryVertical, AmountUSD), y = log(AmountUSD))) +
  geom_boxplot(outlier.shape = NA, show.legend = FALSE) + coord_flip() +
  stat_summary(fun = mean, col = "honeydew4", geom = 'point') +
  labs(x = "IndustryVerticle", y = "Amount in USD(Millions)", title = "Plotting outliers per Top5 IndustryVerticle on average", subtitle = "average given by dot")
p2

The graph clearly tells us that transportation industry is clearly the outlier.

Finance is maintaining the balance between no of start ups to Investments which says the fierce competetion in the industry.

#subsetting
fund.type <- startup2[startup2$IndustryVertical %in% c("technology", "consumer internet", "e-commerce"), ] %>% 
  drop_na(AmountUSD)

#grouping
fund.type <- fund.type %>% group_by(IndustryVertical, InvestmentType) %>%
  summarise(count = n()) %>%
  mutate(percentage = (round(count/sum(count)*100, 2))) %>% 
  ungroup()

ggplot(fund.type, aes(x = IndustryVertical, percentage, fill = InvestmentType)) +
  geom_col(aes(fill = InvestmentType), position = "stack") +
  labs(title = "Investment Category per Industry Vertical", subtitle = "Based on TOP 3 Industry Vertical in India (2015 - 2020)", 
       x = "Investment Category", y = "Percentage (%)") +
  geom_label(aes(label = percentage), position = position_stack(vjust = 0.5))+
  theme(legend.position = "bottom", legend.title = element_blank()) +
  guides(fill = guide_legend(override.aes = aes(label = ""))) + #remove "a" character in legend
  coord_flip()

The Investment types are clearly dominated by two types mainly but the interesting thing is technology and e-commerce are sharing almost the similar share in the type of investments which can be translated to technology is powering the rise of e-commerce and likewise.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Gundeti (2022, May 19). Data Analytics and Computational Social Science: 601_Blogpost4 . Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomrahulgdacss601hw4/

BibTeX citation

@misc{gundeti2022601_blogpost4,
  author = {Gundeti, Rahul},
  title = {Data Analytics and Computational Social Science: 601_Blogpost4 },
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomrahulgdacss601hw4/},
  year = {2022}
}