601_HW3

Blog post 3 describing the dataset and research questions as a part of the course “Data Science Fundamentals”

Rahul Gundeti (Graduate student, Data Analytics & Computational Social Sciences (DACSS), UMass Amherst.)
2022-05-12

Introduction to Dataset:

The dataset that I have choose to work on is related to startup ecosystem. The word startup has becoeme so popular in he last 5 years that everyone knows about them which is very obvious because of the digitization and the technological advancements and people adapting to the modern way of life. Unlike the olden days hardships, humans are leading a very comfortable life. Everything is at their fingertip and if there is anything they feel short of then there comes a startup to solve that problem and help us to live a very comfortable life.

There are many companies that have become a part of our day to day life. It is very much likely that some have become very synonymous that we are using the startup name more than the actual word just like how google replaced the term search on the net.

My dataset talks about a fraction of those startups that are prominent in India but also has its market presence outside of India.

I have found the dataset on kaggle.

What my dataset contains?

My dataset contains 4350 rows and 8 columns.

Date: Time period in the date format (DD/MM/YYYY)

StartupName: Name of the Startup

IndustryVertical: Which industry the startup is working in

SubVertical: Which sub-industry the industry is categorized

CityLocation: where the startup is headquartered

Investors: Who invested in the startup

InvestmentType: What type of investement it was

AmountUSD: How much amount the startup raised in USD

What I want to do with the dataset?

I am looking to understand certain patterns from my analysis and answer the below questions:

  1. Which startup received the most funding?
  2. Which Industry Vertical received th most funding?
  3. Does location really matter to raise funding?
  4. Which type of funding is most prominent among startups?
  5. Who is the biggest Investor?

id <- data.frame(read.csv("C:/Users/gunde/Downloads/Indian_Startup_Funding.csv",stringsAsFactors = TRUE))
head(id)
        Date  StartupName    IndustryVertical
1 09/01/2020        BYJUS              E-Tech
2 13/01/2020       Shuttl      Transportation
3 09/01/2020    Mamaearth          E-commerce
4 02/01/2020 WealthBucket             FinTech
5 02/01/2020       Fashor Fashion and Apparel
6 13/01/2020        Pando           Logistics
                               SubVertical CityLocation
1                               E-learning    Bangalore
2                App based shuttle service      Gurgaon
3    Retailer of baby and toddler products    Bangalore
4                        Online Investment    New Delhi
5              Embroiled Clothes For Women       Mumbai
6 Open-market, freight management platform      Chennai
                  Investors       InvestmentType    AmountUSD
1   Tiger Global Management Private Equity Round 20,00,00,000
2 Susquehanna Growth Equity             Series C    80,48,394
3     Sequoia Capital India             Series B  1,83,58,860
4            Vinod Khatumal         Pre-series A    30,00,000
5   Sprout Venture Partners           Seed Round    18,00,000
6         Chiratae Ventures             Series A    90,00,000
#Looking all columns for na values
colSums(is.na(id))
            Date      StartupName IndustryVertical      SubVertical 
               0                0                0                0 
    CityLocation        Investors   InvestmentType        AmountUSD 
               0                0                0                0 
summary(id)
         Date        StartupName            IndustryVertical
 02/02/2015:  11   Ola Cabs:   8   Consumer Internet: 942   
 08/07/2015:  11   Swiggy  :   8   Technology       : 478   
 30/11/2016:  11   BYJUS   :   7   E-Commerce       : 287   
 04/10/2016:  10   Paytm   :   7   Healthcare       :  72   
 01/06/2015:   9   Medinfi :   6   Finance          :  63   
 04/05/2016:   9   Meesho  :   6   Logistics        :  32   
 (Other)   :2983   (Other) :3002   (Other)          :1170   
                    SubVertical      CityLocation
 nan                      : 936   Bangalore:863  
 Online Lending Platform  :  11   Mumbai   :612  
 Online Pharmacy          :  10   New Delhi:424  
 Food Delivery Platform   :   8   Gurgaon  :344  
 Education                :   5   Hyderabad:166  
 Online Education Platform:   5   Chennai  :142  
 (Other)                  :2069   (Other)  :493  
                    Investors                 InvestmentType
 Undisclosed Investors   : 104   Private Equity      :1356  
 Ratan Tata              :  25   Seed Funding        :1355  
 Indian Angel Network    :  24   Seed/ Angel Funding :  60  
 Shell Foundation        :  21   Seed / Angel Funding:  47  
 Kalaari Capital         :  16   Seed\\\\nFunding    :  30  
 Group of Angel Investors:  15   Debt Funding        :  25  
 (Other)                 :2839   (Other)             : 171  
     AmountUSD   
 100,000  : 224  
 25,000   : 199  
 10,00,000: 165  
 231,046  : 136  
 50,000   : 117  
 5,00,000 : 108  
 (Other)  :2095  
#Modifying date variable to proper format
id$Date <- as.Date(id$Date,format("%d/%m/%Y"))

#Checking for changes
str(id)
'data.frame':   3044 obs. of  8 variables:
 $ Date            : Date, format: "2020-01-09" ...
 $ StartupName     : Factor w/ 2453 levels "#Fame","121Policy",..: 276 1911 1308 2307 656 1579 2441 568 295 495 ...
 $ IndustryVertical: Factor w/ 878 levels "360-degree view creating platform",..: 200 827 190 275 254 440 358 822 191 7 ...
 $ SubVertical     : Factor w/ 1943 levels "\"Women\\\\'s Fashion Clothing Online Platform\"",..: 461 75 1647 1279 515 1487 1216 28 112 1681 ...
 $ CityLocation    : Factor w/ 100 levels "Agra","Ahemadabad",..: 5 29 5 62 55 19 29 74 29 5 ...
 $ Investors       : Factor w/ 2405 levels " Sandeep Aggarwal, Teruhide Sato",..: 2114 2061 1857 2316 1997 468 218 1827 1523 1365 ...
 $ InvestmentType  : Factor w/ 55 levels "Angel","Angel / Seed Funding",..: 26 43 41 21 36 40 26 40 44 30 ...
 $ AmountUSD       : Factor w/ 476 levels "1,00,00,00,000",..: 184 451 75 263 129 469 115 386 406 336 ...
#Extracting year and creating a new column
id$year <- as.numeric(format(id$Date,"%Y"))
#Creating year table
id_year <- table(id$year)
id_year

2015 2016 2017 2018 2019 2020 
 931  993  687  309  111    7 
#Plotting No of startup's funded each year using piechart
pie((id_year),edges=10, main = "No. of startup's funded each year")
#Please Find the below table showing the no of startup's funded each year
#2015 2016 2017 2018 2019 2020 
#931  993  687  309  111    7 
#Top 10 Startup's 
head(sort(table(id$StartupName), decreasing=TRUE),10)

     Ola Cabs        Swiggy         BYJUS         Paytm       Medinfi 
            8             8             7             7             6 
       Meesho      NoBroker         Nykaa     UrbanClap Capital Float 
            6             6             6             6             5 
#Bottom 10 SStartup's
tail(sort(table(id$StartupName), decreasing=TRUE),10)

              ZoomCar                ZopHop                ZopNow 
                    1                     1                     1 
               Zopper Zovi.com / Little App             ZuperMeal 
                    1                     1                     1 
              Zuppler                 Zuver                Zwayam 
                    1                     1                     1 
              Zzungry 
                    1 
#Creating a dataframe of top startups sorted per IndustryVerticle column
set.seed(5642)                            
sample_data <- data.frame(name = c("Consumer Internet","Technology","E-Commerce","Healthcare","Finance","Logistics","Saas","Education","Food & Beverage","FinTech") ,
                          value = c(942,478,287,72,63,32,28,25,23,19))
 
#Creating barplot 
plot<-ggplot(sample_data,
             aes(name,value)) +
geom_bar(stat = "identity")+ theme_minimal()+
geom_text(aes(label = signif(value)), nudge_y = 3,)
plot+
coord_flip()

head(sort(table(id$Investors), decreasing=TRUE),10)

   Undisclosed Investors               Ratan Tata 
                     104                       25 
    Indian Angel Network         Shell Foundation 
                      24                       21 
         Kalaari Capital Group of Angel Investors 
                      16                       15 
         Sequoia Capital           Accel Partners 
                      15                       12 
           Brand Capital              Undisclosed 
                      11                       11 
## Visualization using Bar chart
b = barplot(head(sort(table(id$Investors), decreasing=T),10),col=rainbow(10,0.5), las=2, ylim=c(0,150), xlab="Investors", ylab="No Of Investments")
text(b,head(sort(table(id$Investors), decreasing=T),10),head(sort(table(id$Investors), decreasing=T),10),srt=90, pos=4)

##Industry vertical frequency table
industrytable_head <- head(sort(table(id$IndustryVertical), decreasing=TRUE),4)
industrytable_head

Consumer Internet        Technology        E-Commerce 
              942               478               287 
       Healthcare 
               72 
#Creating dataframe
i3 <- data.frame(category = c("Consumer Internet","Technology","E-Commerce","Healthcare"),
                          count = c(942,478,287,72))
#Calculating percentages
i3$fraction <- i3$count / sum(i3$count)
i3$ymax <- cumsum(i3$fraction)
i3$ymin <- c(0, head(i3$ymax, n=-1))
i3$labelPosition <- (i3$ymax + i3$ymin) / 2
i3$label <- paste0(i3$category, "\n value: ", i3$count)

# Make the plot
ggplot(i3, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
  geom_rect() +
  geom_label( x=3, aes(y=labelPosition, label=label), size=4) +
  scale_fill_brewer(palette=4) +
  coord_polar(theta="y") +
  xlim(c(1, 5)) +
  theme_void()

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Gundeti (2022, May 19). Data Analytics and Computational Social Science: 601_HW3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomrahulgdacss601hw3/

BibTeX citation

@misc{gundeti2022601_hw3,
  author = {Gundeti, Rahul},
  title = {Data Analytics and Computational Social Science: 601_HW3},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomrahulgdacss601hw3/},
  year = {2022}
}