Blog post 3 describing the dataset and research questions as a part of the course “Data Science Fundamentals”
The dataset that I have choose to work on is related to startup ecosystem. The word startup has becoeme so popular in he last 5 years that everyone knows about them which is very obvious because of the digitization and the technological advancements and people adapting to the modern way of life. Unlike the olden days hardships, humans are leading a very comfortable life. Everything is at their fingertip and if there is anything they feel short of then there comes a startup to solve that problem and help us to live a very comfortable life.
There are many companies that have become a part of our day to day life. It is very much likely that some have become very synonymous that we are using the startup name more than the actual word just like how google replaced the term search on the net.
My dataset talks about a fraction of those startups that are prominent in India but also has its market presence outside of India.
I have found the dataset on kaggle.
My dataset contains 4350 rows and 8 columns.
Date: Time period in the date format (DD/MM/YYYY)
StartupName: Name of the Startup
IndustryVertical: Which industry the startup is working in
SubVertical: Which sub-industry the industry is categorized
CityLocation: where the startup is headquartered
Investors: Who invested in the startup
InvestmentType: What type of investement it was
AmountUSD: How much amount the startup raised in USD
I am looking to understand certain patterns from my analysis and answer the below questions:
id <- data.frame(read.csv("C:/Users/gunde/Downloads/Indian_Startup_Funding.csv",stringsAsFactors = TRUE))
head(id)
Date StartupName IndustryVertical
1 09/01/2020 BYJUS E-Tech
2 13/01/2020 Shuttl Transportation
3 09/01/2020 Mamaearth E-commerce
4 02/01/2020 WealthBucket FinTech
5 02/01/2020 Fashor Fashion and Apparel
6 13/01/2020 Pando Logistics
SubVertical CityLocation
1 E-learning Bangalore
2 App based shuttle service Gurgaon
3 Retailer of baby and toddler products Bangalore
4 Online Investment New Delhi
5 Embroiled Clothes For Women Mumbai
6 Open-market, freight management platform Chennai
Investors InvestmentType AmountUSD
1 Tiger Global Management Private Equity Round 20,00,00,000
2 Susquehanna Growth Equity Series C 80,48,394
3 Sequoia Capital India Series B 1,83,58,860
4 Vinod Khatumal Pre-series A 30,00,000
5 Sprout Venture Partners Seed Round 18,00,000
6 Chiratae Ventures Series A 90,00,000
Date StartupName IndustryVertical SubVertical
0 0 0 0
CityLocation Investors InvestmentType AmountUSD
0 0 0 0
summary(id)
Date StartupName IndustryVertical
02/02/2015: 11 Ola Cabs: 8 Consumer Internet: 942
08/07/2015: 11 Swiggy : 8 Technology : 478
30/11/2016: 11 BYJUS : 7 E-Commerce : 287
04/10/2016: 10 Paytm : 7 Healthcare : 72
01/06/2015: 9 Medinfi : 6 Finance : 63
04/05/2016: 9 Meesho : 6 Logistics : 32
(Other) :2983 (Other) :3002 (Other) :1170
SubVertical CityLocation
nan : 936 Bangalore:863
Online Lending Platform : 11 Mumbai :612
Online Pharmacy : 10 New Delhi:424
Food Delivery Platform : 8 Gurgaon :344
Education : 5 Hyderabad:166
Online Education Platform: 5 Chennai :142
(Other) :2069 (Other) :493
Investors InvestmentType
Undisclosed Investors : 104 Private Equity :1356
Ratan Tata : 25 Seed Funding :1355
Indian Angel Network : 24 Seed/ Angel Funding : 60
Shell Foundation : 21 Seed / Angel Funding: 47
Kalaari Capital : 16 Seed\\\\nFunding : 30
Group of Angel Investors: 15 Debt Funding : 25
(Other) :2839 (Other) : 171
AmountUSD
100,000 : 224
25,000 : 199
10,00,000: 165
231,046 : 136
50,000 : 117
5,00,000 : 108
(Other) :2095
#Modifying date variable to proper format
id$Date <- as.Date(id$Date,format("%d/%m/%Y"))
#Checking for changes
str(id)
'data.frame': 3044 obs. of 8 variables:
$ Date : Date, format: "2020-01-09" ...
$ StartupName : Factor w/ 2453 levels "#Fame","121Policy",..: 276 1911 1308 2307 656 1579 2441 568 295 495 ...
$ IndustryVertical: Factor w/ 878 levels "360-degree view creating platform",..: 200 827 190 275 254 440 358 822 191 7 ...
$ SubVertical : Factor w/ 1943 levels "\"Women\\\\'s Fashion Clothing Online Platform\"",..: 461 75 1647 1279 515 1487 1216 28 112 1681 ...
$ CityLocation : Factor w/ 100 levels "Agra","Ahemadabad",..: 5 29 5 62 55 19 29 74 29 5 ...
$ Investors : Factor w/ 2405 levels " Sandeep Aggarwal, Teruhide Sato",..: 2114 2061 1857 2316 1997 468 218 1827 1523 1365 ...
$ InvestmentType : Factor w/ 55 levels "Angel","Angel / Seed Funding",..: 26 43 41 21 36 40 26 40 44 30 ...
$ AmountUSD : Factor w/ 476 levels "1,00,00,00,000",..: 184 451 75 263 129 469 115 386 406 336 ...
#Extracting year and creating a new column
id$year <- as.numeric(format(id$Date,"%Y"))
#Creating year table
id_year <- table(id$year)
id_year
2015 2016 2017 2018 2019 2020
931 993 687 309 111 7
#Plotting No of startup's funded each year using piechart
pie((id_year),edges=10, main = "No. of startup's funded each year")
#Please Find the below table showing the no of startup's funded each year
#2015 2016 2017 2018 2019 2020
#931 993 687 309 111 7
Ola Cabs Swiggy BYJUS Paytm Medinfi
8 8 7 7 6
Meesho NoBroker Nykaa UrbanClap Capital Float
6 6 6 6 5
ZoomCar ZopHop ZopNow
1 1 1
Zopper Zovi.com / Little App ZuperMeal
1 1 1
Zuppler Zuver Zwayam
1 1 1
Zzungry
1
#Creating a dataframe of top startups sorted per IndustryVerticle column
set.seed(5642)
sample_data <- data.frame(name = c("Consumer Internet","Technology","E-Commerce","Healthcare","Finance","Logistics","Saas","Education","Food & Beverage","FinTech") ,
value = c(942,478,287,72,63,32,28,25,23,19))
#Creating barplot
plot<-ggplot(sample_data,
aes(name,value)) +
geom_bar(stat = "identity")+ theme_minimal()+
geom_text(aes(label = signif(value)), nudge_y = 3,)
plot+
coord_flip()
Undisclosed Investors Ratan Tata
104 25
Indian Angel Network Shell Foundation
24 21
Kalaari Capital Group of Angel Investors
16 15
Sequoia Capital Accel Partners
15 12
Brand Capital Undisclosed
11 11
## Visualization using Bar chart
b = barplot(head(sort(table(id$Investors), decreasing=T),10),col=rainbow(10,0.5), las=2, ylim=c(0,150), xlab="Investors", ylab="No Of Investments")
text(b,head(sort(table(id$Investors), decreasing=T),10),head(sort(table(id$Investors), decreasing=T),10),srt=90, pos=4)
##Industry vertical frequency table
industrytable_head <- head(sort(table(id$IndustryVertical), decreasing=TRUE),4)
industrytable_head
Consumer Internet Technology E-Commerce
942 478 287
Healthcare
72
#Creating dataframe
i3 <- data.frame(category = c("Consumer Internet","Technology","E-Commerce","Healthcare"),
count = c(942,478,287,72))
#Calculating percentages
i3$fraction <- i3$count / sum(i3$count)
i3$ymax <- cumsum(i3$fraction)
i3$ymin <- c(0, head(i3$ymax, n=-1))
i3$labelPosition <- (i3$ymax + i3$ymin) / 2
i3$label <- paste0(i3$category, "\n value: ", i3$count)
# Make the plot
ggplot(i3, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
geom_rect() +
geom_label( x=3, aes(y=labelPosition, label=label), size=4) +
scale_fill_brewer(palette=4) +
coord_polar(theta="y") +
xlim(c(1, 5)) +
theme_void()
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gundeti (2022, May 19). Data Analytics and Computational Social Science: 601_HW3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomrahulgdacss601hw3/
BibTeX citation
@misc{gundeti2022601_hw3, author = {Gundeti, Rahul}, title = {Data Analytics and Computational Social Science: 601_HW3}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomrahulgdacss601hw3/}, year = {2022} }