DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

HW3

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in a dataset
  • Brief description of the data
  • Clean the data
  • Provide a narrative about the data set
    • Descriptive statistics

HW3

  • Show All Code
  • Hide All Code

  • View Source
HW3
Author

Mariia Dubyk

Published

October 12, 2022

Code
library(tidyverse)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE)

HW3 - Tasks from HW2 (Read in a dataset, clean data, provide narrative) - Include descriptive statistics (e.g, mean, median, and standard deviation for numerical variables, and frequencies and/or mode for categorical variables - Include relevant visualizations using ggplot2 to complement these descriptive statistics. Be sure to use faceting, coloring, and titles as needed. Each visualization should be accompanied by descriptive text that highlights: the variable(s) used what questions might be answered with the visualizations what conclusions you can draw - Use group_by() and summarize() to compute descriptive stats and/or visualizations for any relevant groupings. For example, if you were interested in how average income varies by state, you might compute mean income for all states combined, and then compare this to the range and distribution of mean income for each individual state in the US. - Identify limitations of your visualization, such as: What questions are left unanswered with your visualizations What about the visualizations may be unclear to a naive viewer How could you improve the visualizations for the final project

Read in a dataset

I decided to use the same data as for HW2 from open source called The Armed Conflict Location & Event Data Project (ACLED) https://acleddata.com/about-acled/.

Code
library(readr)
library(summarytools)

Attaching package: 'summarytools'
The following object is masked from 'package:tibble':

    view
Code
protest_orig <- read.csv("_data/protest.csv", head = TRUE, sep=";")
print(
  dfSummary(protest_orig, 
            varnumbers   = FALSE,
            na.col       = FALSE,
            style        = "multiline",
            plain.ascii  = FALSE,
            headings     = FALSE,
            graph.magnif = .8),
  method = "render"
)
Variable Stats / Values Freqs (% of Valid) Graph Valid
data_id [integer]
Mean (sd) : 8344785 (451513.8)
min ≤ med ≤ max:
7458963 ≤ 8471949 ≤ 9589314
IQR (CV) : 745585.8 (0.1)
3734 distinct values 3734 (100.0%)
iso [integer]
Mean (sd) : 490.3 (203.9)
min ≤ med ≤ max:
100 ≤ 616 ≤ 703
IQR (CV) : 268 (0.4)
100:506(13.6%)
203:255(6.8%)
233:116(3.1%)
348:145(3.9%)
428:70(1.9%)
440:158(4.2%)
616:1925(51.6%)
642:449(12.0%)
703:110(2.9%)
3734 (100.0%)
event_id_cnty [character]
1. BGR1705
2. BGR1706
3. BGR1707
4. BGR1708
5. BGR1709
6. BGR1710
7. BGR1711
8. BGR1712
9. BGR1713
10. BGR1714
[ 3724 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
3724(99.7%)
3734 (100.0%)
event_id_no_cnty [integer]
Mean (sd) : 1960.7 (1110.8)
min ≤ med ≤ max:
62 ≤ 2053.5 ≤ 3835
IQR (CV) : 1385.5 (0.6)
2817 distinct values 3734 (100.0%)
event_date [character]
1. 19.12.2021
2. 10.10.2021
3. 10.08.2021
4. 06.11.2021
5. 28.01.2021
6. 29.03.2021
7. 27.01.2021
8. 18.02.2021
9. 29.01.2021
10. 08.03.2021
[ 352 others ]
94(2.5%)
93(2.5%)
83(2.2%)
55(1.5%)
47(1.3%)
45(1.2%)
42(1.1%)
37(1.0%)
37(1.0%)
34(0.9%)
3167(84.8%)
3734 (100.0%)
year [integer] 1 distinct value
2021:3734(100.0%)
3734 (100.0%)
time_precision [integer]
Mean (sd) : 1 (0.1)
min ≤ med ≤ max:
1 ≤ 1 ≤ 3
IQR (CV) : 0 (0.1)
1:3693(98.9%)
2:35(0.9%)
3:6(0.2%)
3734 (100.0%)
event_type [character]
1. Protests
2. Riots
3. Strategic developments
4. Violence against civilian
3525(94.4%)
128(3.4%)
63(1.7%)
18(0.5%)
3734 (100.0%)
sub_event_type [character]
1. Arrests
2. Attack
3. Change to group/activity
4. Disrupted weapons use
5. Looting/property destruct
6. Mob violence
7. Other
8. Peaceful protest
9. Protest with intervention
10. Violent demonstration
19(0.5%)
18(0.5%)
23(0.6%)
1(0.0%)
9(0.2%)
96(2.6%)
11(0.3%)
3427(91.8%)
98(2.6%)
32(0.9%)
3734 (100.0%)
actor1 [character]
1. Protesters (Poland)
2. Protesters (Bulgaria)
3. Protesters (Romania)
4. Protesters (Czech Republi
5. Protesters (Belarus)
6. Protesters (Hungary)
7. Protesters (Lithuania)
8. Protesters (Estonia)
9. Protesters (Slovakia)
10. Rioters (International)
[ 51 others ]
1693(45.3%)
483(12.9%)
422(11.3%)
244(6.5%)
134(3.6%)
132(3.5%)
126(3.4%)
113(3.0%)
103(2.8%)
70(1.9%)
214(5.7%)
3734 (100.0%)
assoc_actor_1 [character]
1. (Empty string)
2. KOD: Committee for the De
3. Women (Poland); Women's S
4. Women's Strike; Women (Po
5. Labour Group (Bulgaria)
6. Labour Group (Romania)
7. MMD: Million Moments for
8. Labour Group (Poland)
9. FFF: Fridays for Future;
10. Refugees/IDPs (Internatio
[ 440 others ]
1235(33.1%)
382(10.2%)
200(5.4%)
161(4.3%)
106(2.8%)
106(2.8%)
68(1.8%)
57(1.5%)
49(1.3%)
49(1.3%)
1321(35.4%)
3734 (100.0%)
inter1 [integer]
Mean (sd) : 5.9 (0.7)
min ≤ med ≤ max:
1 ≤ 6 ≤ 8
IQR (CV) : 0 (0.1)
1:54(1.4%)
3:20(0.5%)
5:128(3.4%)
6:3525(94.4%)
8:7(0.2%)
3734 (100.0%)
actor2 [character]
1. (Empty string)
2. Police Forces of Poland (
3. Military Forces of Poland
4. Protesters (Poland)
5. Civilians (Bulgaria)
6. Police Forces of Slovakia
7. Police Forces of the Czec
8. Civilians (Hungary)
9. Civilians (International)
10. Civilians (Romania)
[ 36 others ]
3415(91.5%)
72(1.9%)
69(1.8%)
29(0.8%)
12(0.3%)
12(0.3%)
12(0.3%)
10(0.3%)
10(0.3%)
8(0.2%)
85(2.3%)
3734 (100.0%)
assoc_actor_2 [character]
1. (Empty string)
2. Police Forces of Poland (
3. Refugees/IDPs (Internatio
4. MW: All-Poland Youth
5. Catholic Christian Group
6. Refugees/IDPs (Internatio
7. Government of Bulgaria (2
8. Health Workers (Poland)
9. LGBT (Lithuania)
10. LGBT (Poland)
[ 56 others ]
3583(96.0%)
68(1.8%)
6(0.2%)
5(0.1%)
4(0.1%)
3(0.1%)
2(0.1%)
2(0.1%)
2(0.1%)
2(0.1%)
57(1.5%)
3734 (100.0%)
inter2 [integer]
Mean (sd) : 0.3 (1.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (4.5)
0:3415(91.5%)
1:200(5.4%)
3:5(0.1%)
5:9(0.2%)
6:47(1.3%)
7:57(1.5%)
8:1(0.0%)
3734 (100.0%)
interaction [integer]
Mean (sd) : 56.9 (11.5)
min ≤ med ≤ max:
10 ≤ 60 ≤ 80
IQR (CV) : 0 (0.2)
17 distinct values 3734 (100.0%)
region [character] 1. Europe
3734(100.0%)
3734 (100.0%)
country [character]
1. Bulgaria
2. Czech Republic
3. Estonia
4. Hungary
5. Latvia
6. Lithuania
7. Poland
8. Romania
9. Slovakia
506(13.6%)
255(6.8%)
116(3.1%)
145(3.9%)
70(1.9%)
158(4.2%)
1925(51.6%)
449(12.0%)
110(2.9%)
3734 (100.0%)
admin1 [character]
1. Mazowieckie
2. Sofia City
3. Malopolskie
4. Dolnoslaskie
5. Pomorskie
6. Slaskie
7. Kujawsko-Pomorskie
8. Podlaskie
9. Wielkopolskie
10. Bucharest
[ 140 others ]
329(8.8%)
226(6.1%)
190(5.1%)
167(4.5%)
156(4.2%)
156(4.2%)
154(4.1%)
145(3.9%)
142(3.8%)
138(3.7%)
1931(51.7%)
3734 (100.0%)
admin2 [character]
1. Warszawa
2. Sofia
3. Municipality of Bucharest
4. Praha
5. Vilniaus
6. Krakow
7. Wroclaw
8. (Empty string)
9. Tallinn
10. Gdansk
[ 503 others ]
286(7.7%)
226(6.1%)
138(3.7%)
116(3.1%)
115(3.1%)
99(2.7%)
88(2.4%)
87(2.3%)
78(2.1%)
70(1.9%)
2431(65.1%)
3734 (100.0%)
admin3 [logical]
All NA's
0 (0.0%)
location [character]
1. Warsaw
2. Sofia
3. Bucharest
4. Prague
5. Vilnius
6. Krakow
7. Wroclaw
8. Tallinn
9. Gdansk
10. Poznan
[ 613 others ]
286(7.7%)
220(5.9%)
138(3.7%)
116(3.1%)
105(2.8%)
99(2.7%)
88(2.4%)
78(2.1%)
70(1.9%)
70(1.9%)
2464(66.0%)
3734 (100.0%)
latitude [numeric]
Mean (sd) : 50 (4.2)
min ≤ med ≤ max:
41.4 ≤ 50.7 ≤ 59.4
IQR (CV) : 5.5 (0.1)
615 distinct values 3734 (100.0%)
longitude [numeric]
Mean (sd) : 21 (3.7)
min ≤ med ≤ max:
12.4 ≤ 21 ≤ 28.8
IQR (CV) : 5.5 (0.2)
611 distinct values 3734 (100.0%)
geo_precision [integer]
Mean (sd) : 1.1 (0.3)
min ≤ med ≤ max:
1 ≤ 1 ≤ 3
IQR (CV) : 0 (0.3)
1:3497(93.7%)
2:211(5.7%)
3:26(0.7%)
3734 (100.0%)
source [character]
1. Wyborcza
2. Committee for the Defence
3. Fakti.bg
4. Tvn24
5. Charter-97
6. Ogolnopolski Strajk Kobie
7. Agerpres
8. Dnes.bg
9. Adevarul
10. FridaysForFuture
[ 611 others ]
261(7.0%)
223(6.0%)
182(4.9%)
128(3.4%)
123(3.3%)
105(2.8%)
100(2.7%)
98(2.6%)
83(2.2%)
80(2.1%)
2351(63.0%)
3734 (100.0%)
source_scale [character]
1. National
2. Other
3. New media
4. Regional
5. Subnational
6. International
7. Subnational-National
8. Other-National
9. National-Regional
10. National-International
[ 6 others ]
2679(71.7%)
340(9.1%)
320(8.6%)
180(4.8%)
83(2.2%)
38(1.0%)
25(0.7%)
20(0.5%)
19(0.5%)
15(0.4%)
15(0.4%)
3734 (100.0%)
notes [character]
1. On 10 June 2021, Parbesze
2. On 10 June 2021, workers
3. On 20 June 2021, activist
4. On 5 June 2021, around 60
5. Around 10 December 2021 (
6. Around 10 January 2021 (a
7. Around 11 May 2021 (as re
8. Around 12 November 2021 (
9. Around 12 November 2021 (
10. Around 13 December 2021 (
[ 3720 others ]
2(0.1%)
2(0.1%)
2(0.1%)
2(0.1%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
3720(99.6%)
3734 (100.0%)
fatalities [integer]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 0 (31.9)
0:3728(99.8%)
1:4(0.1%)
2:1(0.0%)
5:1(0.0%)
3734 (100.0%)
timestamp [integer]
Mean (sd) : 1629855755 (11258369)
min ≤ med ≤ max:
1610472167 ≤ 1631042322 ≤ 1666732458
IQR (CV) : 18745403 (0)
341 distinct values 3734 (100.0%)
iso3 [character]
1. BGR
2. CZE
3. EST
4. HUN
5. LTU
6. LVA
7. POL
8. ROM
9. SVK
506(13.6%)
255(6.8%)
116(3.1%)
145(3.9%)
158(4.2%)
70(1.9%)
1925(51.6%)
449(12.0%)
110(2.9%)
3734 (100.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Brief description of the data

The dataset contains information about protests and acts of political violence (including state) in 9 EU countries of Eastern and Central Europe in 2021.

  • The dataframe includes 3734 cases. Each case is event (protest/terrorist attack/armed forces action etc)
  • There are 31 variables
  1. Variables contain information about the event: date, event type, sub event type, type of actors and their interaction (“inter1”, “inter2”, “interaction”), actors names, place, source of information, notes (details of what happened during the protest) and number of fatalities. Columns 5-6,8-24,26-29.

  2. Variables that contain different numeric and character identifiers of the country and event.Columns 1-4,7,25,30-31.

Clean the data

  • Remove certain columns.The same as in HW2.

I also decided to remove columns 10-11, 13-14. They contain information about a specific organization (name, country). It seems to me that this kind of data needs qualitative analysis or some categorization. Variable inter1 and inter2 contain information about actors. They categorize actors to 5 (inter1) and 6 (inter2) groups. Probably some detailed categorization is needed (like ideology of the organization) but that is a task for bigger research or a question of how data is gethered.

  • Organize dates (in a different way than in HW2)

  • Replace empty sting

  • Mutate variables due to the codebook https://acleddata.com/acleddatanew/wp-content/uploads/2021/11/ACLED_Codebook_v1_January-2021.pdf

  • rename variables

Code
# Leave selected columns
protests <- select(protest_orig, "data_id", "event_date", "event_type", "sub_event_type", "inter1", "inter2", "interaction", "country", "admin1", "location", "latitude", "longitude", "source_scale", "fatalities")

# Organize dates
library(date)
Error in library(date): there is no package called 'date'
Code
protests$event_date <- as.Date(protests$event_date, format("%d.%m.%Y"))
class(protests$event_date)
[1] "Date"
Code
# Replace empty sting with NA's
library(dplyr)  
protests <- na_if(protests, '')

# Change variables due to the codebook
protests <- protests %>%
  mutate(inter1 = case_when(
         inter1 == 1 ~ "State Forces",
         inter1 == 3 ~ "Political Militas",
         inter1 == 5 ~ "Rioters",
         inter1 == 6 ~ "Protesters",
         inter1 == 8 ~ "External/Other Forces"))

protests <- protests %>%
  mutate(inter2 = case_when(
         inter2 == 1 ~ "State Forces",
         inter2 == 3 ~ "Political Militas",
         inter2 == 5 ~ "Rioters",
         inter2 == 6 ~ "Protesters",
         inter2 == 7 ~ "Civilians",
         inter2 == 8 ~ "External/Other Forces",
         inter2 == 0 ~ "NA"))
protests$inter2 <- na_if(protests$inter2, 'NA')
protests <- protests %>%
  mutate(interaction = case_when(interaction == 10 ~ "sole military action",
         interaction == 13 ~ "military versus political militia",
         interaction == 15 ~ "military versus rioters",
         interaction == 16 ~ "military versus protesters",
         interaction == 17 ~ "military versus civilians",
         interaction == 18 ~ "military versus other",
         interaction == 33 ~ "political militia versus political militia",
         interaction == 37 ~ "political militia versus civilians",
         interaction == 50 ~ "sole rioter action",
         interaction == 55 ~ "rioters versus rioters",
         interaction == 56 ~ "rioters versus protesters",
         interaction == 57 ~ "rioters versus civilians",
         interaction == 58 ~ "rioters versus others",
         interaction == 60 ~ "sole protester action",
         interaction == 66 ~ "protesters versus protesters",
         interaction == 78 ~ "other actor versus civilians",
         interaction == 80 ~ "sole other action"))

# Rename
protests <- protests %>%
  rename("actor1_type" = "inter1", "actor2_type" = "inter2")

# Reorder variables
library(forcats)
protests <- protests %>%
  mutate(sub_event_type = fct_relevel(sub_event_type, "Peaceful protest", "Protest with intervention", "Mob violence", "Violent demonstration", "Change to group/activity", "Arrests", "Attack", "Other", "Looting/property destruction", "Disrupted weapons use"))
protests <- protests %>%
  mutate(actor1_type = fct_relevel(actor1_type, "Protesters", "Rioters", "State Forces", "Political Militas", "External/Other Forces"))
protests <- protests %>%
   mutate(actor2_type = fct_relevel(actor2_type, "State Forces", "Civilians", "Protesters", "Rioters", "Political Militas", "External/Other Forces"))
protests <- protests %>%
  mutate(country = fct_relevel(country, "Poland", "Bulgaria", "Romania", "Czech Republic", "Lithuania", "Hungary", "Estonia", "Slovakia", "Latvia"))
table(protests$source_scale)

            International                  National    National-International 
                       38                      2679                        15 
        National-Regional                 New media        New media-National 
                       19                       320                         3 
                    Other       Other-International            Other-National 
                      340                         3                        20 
          Other-New media         Other-Subnational                  Regional 
                        1                         2                       180 
   Regional-International               Subnational Subnational-International 
                        2                        83                         4 
     Subnational-National 
                       25 
Code
protests <- protests %>%
  mutate(source_scale = fct_relevel(source_scale, "National", "Other", "New media", "Regional", "Subnational", "International", "Subnational-National", "Other-National", "National-Regional", "National-International", "Subnational-International", "New media-National", "Other-International", "Regional-International", "Other-Subnational", "Other-New media"))

Provide a narrative about the data set

The dataset gives information about protests and acts of political violence in 9 countries of Eastern and Central Europe (all counties are in the European Union) in 2021. The countries included in the dataframe are Bulgaria, Czech Republic, Estonia, Hungary, Latvia, Lithuania, Poland, Romania, Slovakia. The cases are events. So we have all information about the protest event. When and where it happened, who took part in it, was it peaceful or violet, how many fatalities it had and from which source we know about it.

  1. What happened? Variables “event_type”, “sub_event_type” and “fatalities” gives information about how we categorize political event. “event_type”, “sub_event_type” are categorical variables which mean type of the event and basically refer to different levels and types of radicalization during the protest.

  2. When the event happened? Variable “event_date” contains information about event day, month and year.

  3. Where it happened? There are several variables which help to understand where protest took place. They all are categorical. First, we know “country”, second, administrative division (“admin1”),third, city or village (“location”). Columns “longitude” and “latitude” may be used to visualize a map of events.

  4. Who participated?

  • Variables “actor1_type” and “actor2_type” give information about type of group which took part in the event. actor1_type has 5 types, actor2_type 6. “Interaction” contains 17 types of interaction which are different combinations of actors. These variables are categorical.

It is important to understand that actors and divided into two main groups. First is main actor, the one who organizes the protest (column “actor1”, “assoc_actor”, “inter1”) and the one who plays role of the opposition at the event for example police or opposing organization (“actor2”, “assoc_actor2”, “inter2”). Interaction between them is categorized in variable “interaction”

Descriptive statistics

In 2021 most frequent type of event was protest. 3525 of 3734 which constitutes 94,4% of all events (variable event type). Protests and Strategic Developments do not contain violence while Riots and Violence against civilians are radical actions which contain different type of violent behavior. At the next chart we see proportions of these 4 event types in all nine countries.

*Strategic Developments are non violent actions of state and non-state groups which possibly can use violence (for example peace agreements). This type is captured to track all activities of possibly violent groups.

The chart answers the questions about what type of political activism and political violence is most frequent in Eastern Europe countries. We see that vast majority of events are peaceful. It states for some level of political stability in the region. It is important to remember that data refers to Eastern European countries that are members of the European Union, so the conclusion is representational not for Eastern Europe in general but for a geographical group of countries within the EU.

Code
library(viridis)
Loading required package: viridisLite
Code
ggplot(protests, aes(x = "", y = "", fill = event_type)) + 
  geom_col() + 
  guides(fill = guide_legend(title = "Event Type")) +
  scale_fill_viridis_d() +
  coord_polar(theta = "y") + 
  theme_void() + ggtitle("Chart 1 Proportion of different event types") +
  theme(plot.title = element_text(hjust = 0.5))

Four event types are divided into 10 subcategories which constitute variable “Sub event types”. Next two charts show - proportion of each sub event type in each event type (for all countries in 2021), - number of events of each sub event type (for all countries in 2021).

“Protest” can be “peaceful” or “with intervention” (means that peaceful protesters are intervened by some opposition group). Vast majority of protests are peaceful (chart 2). Also “Peaceful protest” constituted for 91,8% of all events (3427 of 3734) in nine countries in 2021 (chart 3).

“Riots” are divided into “mob violence” (violent action of group against another group) and “violent demonstration” (vandalism, road-blocking, using barricades etc). Mob violence is more frequent and than violent demonstration (chart 2).

“Sttrategic development” has 5 categories. Most frequent if them are “change of group/activity” (refers to state forces) and “arrests” (chart 2).

“Violence against civilians” contained “Attack” (chart 2). Attacks against civilians are not frequent - 18 cases of 3734 (chart 3) which is 0,5% of all events.

So these two charts answer the question about which subcategory dominates in event types and which subcategory is more common among subcategories in general. I think that popularity of peaceful protests and relatively little percent of interrupted protests is a sign of stability, freedom of political activism and sefty of protesters. Also it is interesting that when we refer to Riots we refer more to violence of group against group than about vandalism. This fact may be interesting for further analysis.

Code
ggplot(protests, aes(y="", x=event_type, fill = sub_event_type)) + geom_bar(position="fill", stat="identity") + coord_flip() + xlab("Event type") + ylab ("") + theme_bw() + scale_fill_viridis(discrete = T) + ggtitle ("Chart 2 Proportion of sub event type in four event types") + labs (fill = "Sub event type") + theme(plot.title = element_text(hjust = 0.5))

Code
ggplot(protests, aes(sub_event_type)) + geom_bar(fill="#440154ff") + coord_flip() + xlab("Sub event type") + ylab ("Number of events") + theme_bw() + scale_fill_viridis(discrete = T) + ggtitle ("Chart 3 Sub event type") + theme(plot.title = element_text(hjust = 0.5))

Next chart shows number of events by country. We can clearly see difference between Poland and other countries. Polish protests are 51% of all in the region.

Code
ggplot(protests, aes(country)) + geom_bar(fill="#440154ff") + theme_light() + theme(axis.text = element_text(size = 6)) + ggtitle("Chart 4 Number of events by country in 2021") + ylab("Number of events") + xlab("Country") + theme(plot.title = element_text(hjust = 0.5))

Chart 5 and 6 answer to the question what type of event is more popular in each country and who is actor at these events. At chart 5 we can see which event type was more frequent in each country. For example there were more events of violence against civilians than in other countries. Latvia has more strategic developments. Chart 6 shows who frequently is an actor in political activism and political violence.

Code
ggplot(protests, aes(y="", x=country, fill = event_type)) + geom_bar(position="fill", stat="identity") + xlab("") + ylab ("") + theme_bw() + theme(axis.text = element_text(size = 6)) + scale_fill_viridis(discrete = T) + ggtitle ("Chart 5 Event type by country") + labs (fill = "Event type") + theme(plot.title = element_text(hjust = 0.5))

Code
ggplot(protests, aes(y="", x=country, fill = actor1_type)) + geom_bar(position="fill", stat="identity") + coord_flip() + xlab("") + ylab ("") + theme_bw() + scale_fill_viridis(discrete = T) + ggtitle ("Chart 6 Actor type by country") + labs (fill = "Actor") + theme(plot.title = element_text(hjust = 0.5))

Next chart shows during which season in 2021 different types of event where more frequent. It gives information only about 2021. It should answer the question if there are some period of year when protest/riots or other event are less popular. This data do not show enough cases to make some presumption. But this graph reflects the idea that weather influences protest activities.

Code
ggplot(protests, aes(x = event_date, y = country, color = event_type)) +
  geom_boxplot() +
  labs(title = "Chart 7 Changes in nubmer of events during 2021",
       x = "Date",
       y = "Number of events") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(color = "Event type")

Last chart shows which type of event is more frequent in different media. For example international media have bigger proportion of mentions about riots. It may be due to different reasons but I think it would be worth to make further exploration. Radicals tend to seek media attention so we may look where they get it.

Code
ggplot(protests, aes(y="", x=source_scale, fill = event_type)) + geom_bar(position="fill", stat="identity") + coord_flip() + xlab("Scale of source") + ylab ("") + theme_bw() + scale_fill_viridis(discrete = T) + ggtitle ("Chart 8 Scale of source and news about different event types") + labs (fill = "Event type") + theme(plot.title = element_text(hjust = 0.5))

  • the visualization do not answer how two actors interact and in which country state actor is more active with protesters
  • when I finished working with the data I understood that charts are very simple because data is categorical so I am not sure how if I can choose this dataset for final project
Source Code
---
title: "HW3"
author: "Mariia Dubyk"
desription: "HW3"
date: "10/12/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - HW3
---

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE)
```
HW3
- Tasks from HW2 (Read in a dataset, clean data, provide narrative)
- Include descriptive statistics (e.g, mean, median, and standard deviation for numerical variables, and frequencies and/or mode for categorical variables
- Include relevant visualizations using ggplot2 to complement these descriptive statistics. Be sure to use faceting, coloring, and titles as needed. Each visualization should be accompanied by descriptive text that highlights:
     the variable(s) used
     what questions might be answered with the visualizations
     what conclusions you can draw
- Use group_by() and summarize() to compute descriptive stats and/or visualizations for any relevant groupings. For example, if you were interested in how average income varies by state, you might compute mean income for all states combined, and then compare this to the range and distribution of mean income for each individual state in the US.
- Identify limitations of your visualization, such as:
    What questions are left unanswered with your visualizations
    What about the visualizations may be unclear to a naive viewer
    How could you improve the visualizations for the final project

 
## Read in a dataset
I decided to use the same data as for HW2 from open source called The Armed Conflict Location & Event Data Project (ACLED) https://acleddata.com/about-acled/.
```{r}
library(readr)
library(summarytools)
protest_orig <- read.csv("_data/protest.csv", head = TRUE, sep=";")
print(
  dfSummary(protest_orig, 
            varnumbers   = FALSE,
            na.col       = FALSE,
            style        = "multiline",
            plain.ascii  = FALSE,
            headings     = FALSE,
            graph.magnif = .8),
  method = "render"
)
```

## Brief description of the data
The dataset contains information about protests and acts of political violence (including state) in 9 EU countries of Eastern and Central Europe in 2021. 

- The dataframe includes 3734 cases. Each case is event (protest/terrorist attack/armed forces action etc)
- There are 31 variables
(1) Variables contain information about the event: date, event type, sub event type, type of actors and their interaction ("inter1", "inter2", "interaction"), actors names, place, source of information, notes (details of what happened during the protest) and number of fatalities. Columns 5-6,8-24,26-29.

(2) Variables that contain different numeric and character identifiers of the country and event.Columns 1-4,7,25,30-31.

## Clean the data
- Remove certain columns.The same as in HW2.

I also decided to remove columns 10-11, 13-14. They contain information about a specific organization (name, country). It seems to me that this kind of data needs qualitative analysis or some categorization. Variable inter1 and inter2 contain information about actors. They categorize actors to 5 (inter1) and 6 (inter2) groups. Probably some detailed categorization is needed (like ideology of the organization) but that is a task for bigger research or a question of how data is gethered.

- Organize dates (in a different way than in HW2)

- Replace empty sting

- Mutate variables due to the codebook https://acleddata.com/acleddatanew/wp-content/uploads/2021/11/ACLED_Codebook_v1_January-2021.pdf

- rename variables

```{r}
# Leave selected columns
protests <- select(protest_orig, "data_id", "event_date", "event_type", "sub_event_type", "inter1", "inter2", "interaction", "country", "admin1", "location", "latitude", "longitude", "source_scale", "fatalities")

# Organize dates
library(date)
protests$event_date <- as.Date(protests$event_date, format("%d.%m.%Y"))
class(protests$event_date)

# Replace empty sting with NA's
library(dplyr)  
protests <- na_if(protests, '')

# Change variables due to the codebook
protests <- protests %>%
  mutate(inter1 = case_when(
         inter1 == 1 ~ "State Forces",
         inter1 == 3 ~ "Political Militas",
         inter1 == 5 ~ "Rioters",
         inter1 == 6 ~ "Protesters",
         inter1 == 8 ~ "External/Other Forces"))

protests <- protests %>%
  mutate(inter2 = case_when(
         inter2 == 1 ~ "State Forces",
         inter2 == 3 ~ "Political Militas",
         inter2 == 5 ~ "Rioters",
         inter2 == 6 ~ "Protesters",
         inter2 == 7 ~ "Civilians",
         inter2 == 8 ~ "External/Other Forces",
         inter2 == 0 ~ "NA"))
protests$inter2 <- na_if(protests$inter2, 'NA')
protests <- protests %>%
  mutate(interaction = case_when(interaction == 10 ~ "sole military action",
         interaction == 13 ~ "military versus political militia",
         interaction == 15 ~ "military versus rioters",
         interaction == 16 ~ "military versus protesters",
         interaction == 17 ~ "military versus civilians",
         interaction == 18 ~ "military versus other",
         interaction == 33 ~ "political militia versus political militia",
         interaction == 37 ~ "political militia versus civilians",
         interaction == 50 ~ "sole rioter action",
         interaction == 55 ~ "rioters versus rioters",
         interaction == 56 ~ "rioters versus protesters",
         interaction == 57 ~ "rioters versus civilians",
         interaction == 58 ~ "rioters versus others",
         interaction == 60 ~ "sole protester action",
         interaction == 66 ~ "protesters versus protesters",
         interaction == 78 ~ "other actor versus civilians",
         interaction == 80 ~ "sole other action"))

# Rename
protests <- protests %>%
  rename("actor1_type" = "inter1", "actor2_type" = "inter2")

# Reorder variables
library(forcats)
protests <- protests %>%
  mutate(sub_event_type = fct_relevel(sub_event_type, "Peaceful protest", "Protest with intervention", "Mob violence", "Violent demonstration", "Change to group/activity", "Arrests", "Attack", "Other", "Looting/property destruction", "Disrupted weapons use"))
protests <- protests %>%
  mutate(actor1_type = fct_relevel(actor1_type, "Protesters", "Rioters", "State Forces", "Political Militas", "External/Other Forces"))
protests <- protests %>%
   mutate(actor2_type = fct_relevel(actor2_type, "State Forces", "Civilians", "Protesters", "Rioters", "Political Militas", "External/Other Forces"))
protests <- protests %>%
  mutate(country = fct_relevel(country, "Poland", "Bulgaria", "Romania", "Czech Republic", "Lithuania", "Hungary", "Estonia", "Slovakia", "Latvia"))
table(protests$source_scale)
protests <- protests %>%
  mutate(source_scale = fct_relevel(source_scale, "National", "Other", "New media", "Regional", "Subnational", "International", "Subnational-National", "Other-National", "National-Regional", "National-International", "Subnational-International", "New media-National", "Other-International", "Regional-International", "Other-Subnational", "Other-New media"))




```

## Provide a narrative about the data set

The dataset gives information about protests and acts of political violence in 9 countries of Eastern and Central Europe (all counties are in the European Union) in 2021. The countries included in the dataframe are Bulgaria, Czech Republic, Estonia, Hungary, Latvia, Lithuania, Poland, Romania, Slovakia. The cases are events. So we have all information about the protest event. When and where it happened, who took part in it, was it peaceful or violet, how many fatalities it had and from which source we know about it.

(1) What happened?
Variables "event_type", "sub_event_type" and "fatalities" gives information about how we categorize political event. "event_type", "sub_event_type" are categorical variables which mean type of the event and basically refer to different levels and types of radicalization during the protest.

(2) When the event happened? Variable "event_date" contains information about event day, month and year.

(3) Where it happened? There are several variables which help to understand where protest took place. They all are categorical. First, we know "country", second, administrative division ("admin1"),third, city or village ("location"). Columns "longitude" and "latitude" may be used to visualize a map of events.

(4) Who participated?
- Variables "actor1_type" and "actor2_type" give information about type of group which took part in the event. actor1_type has 5 types, actor2_type 6. "Interaction" contains 17 types of interaction which are different combinations of actors. These variables are categorical.

It is important to understand that actors and divided into two main groups. First is main actor, the one who organizes the protest (column "actor1", "assoc_actor", "inter1") and the one who plays role of the opposition at the event for example police or opposing organization ("actor2", "assoc_actor2", "inter2"). Interaction between them is categorized in variable "interaction"

### Descriptive statistics
In 2021 most frequent type of event was protest. 3525 of 3734 which constitutes 94,4% of all events (variable event type). Protests and Strategic Developments do not contain violence while Riots and Violence against civilians are radical actions which contain different type of violent behavior. At the next chart we see proportions of these 4 event types in all nine countries.

*Strategic Developments are non violent actions of state and non-state groups which possibly can use violence (for example peace agreements). This type is captured to track all activities of possibly violent groups.

The chart answers the questions about what type of political activism and political violence is most frequent in Eastern Europe countries. We see that vast majority of events are peaceful. It states for some level of political stability in the region. It is important to remember that data refers to Eastern European countries that are members of the European Union, so the conclusion is representational not for Eastern Europe in general but for a geographical group of countries within the EU.
```{r}
library(viridis)
ggplot(protests, aes(x = "", y = "", fill = event_type)) + 
  geom_col() + 
  guides(fill = guide_legend(title = "Event Type")) +
  scale_fill_viridis_d() +
  coord_polar(theta = "y") + 
  theme_void() + ggtitle("Chart 1 Proportion of different event types") +
  theme(plot.title = element_text(hjust = 0.5))
```
Four event types are divided into 10 subcategories which constitute variable "Sub event types". Next two charts show
- proportion of each sub event type in each event type (for all countries in 2021),
- number of events of each sub event type (for all countries in 2021).

"Protest" can be "peaceful" or "with intervention" (means that peaceful protesters are intervened by some opposition group). Vast majority of protests are peaceful (chart 2). Also "Peaceful protest" constituted for 91,8% of all events (3427 of 3734) in nine countries in 2021 (chart 3).

"Riots" are divided into "mob violence" (violent action of group against another group) and "violent demonstration" (vandalism, road-blocking, using barricades etc). Mob violence is more frequent and than violent demonstration (chart 2).

"Sttrategic development" has 5 categories. Most frequent if them are "change of group/activity" (refers to state forces) and "arrests" (chart 2).

"Violence against civilians" contained "Attack" (chart 2). Attacks against civilians are not frequent - 18 cases of 3734 (chart 3) which is 0,5% of all events.

So these two charts answer the question about which subcategory dominates in event types and which subcategory is more common among subcategories in general. I think that popularity of peaceful protests and relatively little percent of interrupted protests is a sign of stability, freedom of political activism and sefty of protesters.
Also it is interesting that when we refer to Riots we refer more to violence of group against group than about vandalism. This fact may be interesting for further analysis.
```{r}
ggplot(protests, aes(y="", x=event_type, fill = sub_event_type)) + geom_bar(position="fill", stat="identity") + coord_flip() + xlab("Event type") + ylab ("") + theme_bw() + scale_fill_viridis(discrete = T) + ggtitle ("Chart 2 Proportion of sub event type in four event types") + labs (fill = "Sub event type") + theme(plot.title = element_text(hjust = 0.5))
```

```{r}
ggplot(protests, aes(sub_event_type)) + geom_bar(fill="#440154ff") + coord_flip() + xlab("Sub event type") + ylab ("Number of events") + theme_bw() + scale_fill_viridis(discrete = T) + ggtitle ("Chart 3 Sub event type") + theme(plot.title = element_text(hjust = 0.5))
```


Next chart shows number of events by country. We can clearly see difference between Poland and other countries. Polish protests are 51% of all in the region. 


```{r}
ggplot(protests, aes(country)) + geom_bar(fill="#440154ff") + theme_light() + theme(axis.text = element_text(size = 6)) + ggtitle("Chart 4 Number of events by country in 2021") + ylab("Number of events") + xlab("Country") + theme(plot.title = element_text(hjust = 0.5))

```
Chart 5 and 6 answer to the question what type of event is more popular in each country and who is actor at these events. At chart 5 we can see which event type was more frequent in each country. For example there were more events of violence against civilians than in other countries. Latvia has more strategic developments. Chart 6 shows who frequently is an actor in political activism and political violence.

```{r}
ggplot(protests, aes(y="", x=country, fill = event_type)) + geom_bar(position="fill", stat="identity") + xlab("") + ylab ("") + theme_bw() + theme(axis.text = element_text(size = 6)) + scale_fill_viridis(discrete = T) + ggtitle ("Chart 5 Event type by country") + labs (fill = "Event type") + theme(plot.title = element_text(hjust = 0.5))

```


```{r}
ggplot(protests, aes(y="", x=country, fill = actor1_type)) + geom_bar(position="fill", stat="identity") + coord_flip() + xlab("") + ylab ("") + theme_bw() + scale_fill_viridis(discrete = T) + ggtitle ("Chart 6 Actor type by country") + labs (fill = "Actor") + theme(plot.title = element_text(hjust = 0.5))
```
Next chart shows during which season in 2021 different types of event where more frequent. It gives information only about 2021. It should answer the question if there are some period of year when protest/riots or other event are less popular. This data do not show enough cases to make some presumption. But this graph reflects the idea that weather influences protest activities.
```{r}
ggplot(protests, aes(x = event_date, y = country, color = event_type)) +
  geom_boxplot() +
  labs(title = "Chart 7 Changes in nubmer of events during 2021",
       x = "Date",
       y = "Number of events") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(color = "Event type")
```
Last chart shows which type of event is more frequent in different media. For example international media have bigger proportion of mentions about riots. It may be due to different reasons but I think it would be worth to make further exploration. Radicals tend to seek media attention so we may look where they get it.
```{r}
ggplot(protests, aes(y="", x=source_scale, fill = event_type)) + geom_bar(position="fill", stat="identity") + coord_flip() + xlab("Scale of source") + ylab ("") + theme_bw() + scale_fill_viridis(discrete = T) + ggtitle ("Chart 8 Scale of source and news about different event types") + labs (fill = "Event type") + theme(plot.title = element_text(hjust = 0.5))
```
- the visualization do not answer how two actors interact and in which country state actor is more active with protesters 
- when I finished working with the data I understood that charts are very simple because data is categorical so I am not sure how if I can choose this dataset for final project