DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

(Untitled)

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in a dataset
  • Brief description of the data
  • Clean the data
  • Clean the data: continuing
  • Clean the data: continuing
  • Provide a narrative about the data set
  • Identify potential research questions that your dataset can help answer
library(tidyverse)
library(ggplot2)

knitr::opts_chunk$set(echo = TRUE)

Read in a dataset

I decided to use a dataset from open source called The Armed Conflict Location & Event Data Project (ACLED) https://acleddata.com/about-acled/. One of my sphere of interests is social movements and protests but I have never worked with qualitative data related to this topic. So I decided to find related dataset.

library(readr)
protest_orig <- read.csv("_data/protest.csv", head = TRUE, sep=";")
library(summarytools)

Attaching package: 'summarytools'
The following object is masked from 'package:tibble':

    view
print(
  dfSummary(protest_orig, 
            varnumbers   = FALSE,
            na.col       = FALSE,
            style        = "multiline",
            plain.ascii  = FALSE,
            headings     = FALSE,
            graph.magnif = .8),
  method = "render"
)
Variable Stats / Values Freqs (% of Valid) Graph Valid
data_id [integer]
Mean (sd) : 8344785 (451513.8)
min ≤ med ≤ max:
7458963 ≤ 8471949 ≤ 9589314
IQR (CV) : 745585.8 (0.1)
3734 distinct values 3734 (100.0%)
iso [integer]
Mean (sd) : 490.3 (203.9)
min ≤ med ≤ max:
100 ≤ 616 ≤ 703
IQR (CV) : 268 (0.4)
100:506(13.6%)
203:255(6.8%)
233:116(3.1%)
348:145(3.9%)
428:70(1.9%)
440:158(4.2%)
616:1925(51.6%)
642:449(12.0%)
703:110(2.9%)
3734 (100.0%)
event_id_cnty [character]
1. BGR1705
2. BGR1706
3. BGR1707
4. BGR1708
5. BGR1709
6. BGR1710
7. BGR1711
8. BGR1712
9. BGR1713
10. BGR1714
[ 3724 others ]
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
3724(99.7%)
3734 (100.0%)
event_id_no_cnty [integer]
Mean (sd) : 1960.7 (1110.8)
min ≤ med ≤ max:
62 ≤ 2053.5 ≤ 3835
IQR (CV) : 1385.5 (0.6)
2817 distinct values 3734 (100.0%)
event_date [character]
1. 19.12.2021
2. 10.10.2021
3. 10.08.2021
4. 06.11.2021
5. 28.01.2021
6. 29.03.2021
7. 27.01.2021
8. 18.02.2021
9. 29.01.2021
10. 08.03.2021
[ 352 others ]
94(2.5%)
93(2.5%)
83(2.2%)
55(1.5%)
47(1.3%)
45(1.2%)
42(1.1%)
37(1.0%)
37(1.0%)
34(0.9%)
3167(84.8%)
3734 (100.0%)
year [integer] 1 distinct value
2021:3734(100.0%)
3734 (100.0%)
time_precision [integer]
Mean (sd) : 1 (0.1)
min ≤ med ≤ max:
1 ≤ 1 ≤ 3
IQR (CV) : 0 (0.1)
1:3693(98.9%)
2:35(0.9%)
3:6(0.2%)
3734 (100.0%)
event_type [character]
1. Protests
2. Riots
3. Strategic developments
4. Violence against civilian
3525(94.4%)
128(3.4%)
63(1.7%)
18(0.5%)
3734 (100.0%)
sub_event_type [character]
1. Arrests
2. Attack
3. Change to group/activity
4. Disrupted weapons use
5. Looting/property destruct
6. Mob violence
7. Other
8. Peaceful protest
9. Protest with intervention
10. Violent demonstration
19(0.5%)
18(0.5%)
23(0.6%)
1(0.0%)
9(0.2%)
96(2.6%)
11(0.3%)
3427(91.8%)
98(2.6%)
32(0.9%)
3734 (100.0%)
actor1 [character]
1. Protesters (Poland)
2. Protesters (Bulgaria)
3. Protesters (Romania)
4. Protesters (Czech Republi
5. Protesters (Belarus)
6. Protesters (Hungary)
7. Protesters (Lithuania)
8. Protesters (Estonia)
9. Protesters (Slovakia)
10. Rioters (International)
[ 51 others ]
1693(45.3%)
483(12.9%)
422(11.3%)
244(6.5%)
134(3.6%)
132(3.5%)
126(3.4%)
113(3.0%)
103(2.8%)
70(1.9%)
214(5.7%)
3734 (100.0%)
assoc_actor_1 [character]
1. (Empty string)
2. KOD: Committee for the De
3. Women (Poland); Women's S
4. Women's Strike; Women (Po
5. Labour Group (Bulgaria)
6. Labour Group (Romania)
7. MMD: Million Moments for
8. Labour Group (Poland)
9. FFF: Fridays for Future;
10. Refugees/IDPs (Internatio
[ 440 others ]
1235(33.1%)
382(10.2%)
200(5.4%)
161(4.3%)
106(2.8%)
106(2.8%)
68(1.8%)
57(1.5%)
49(1.3%)
49(1.3%)
1321(35.4%)
3734 (100.0%)
inter1 [integer]
Mean (sd) : 5.9 (0.7)
min ≤ med ≤ max:
1 ≤ 6 ≤ 8
IQR (CV) : 0 (0.1)
1:54(1.4%)
3:20(0.5%)
5:128(3.4%)
6:3525(94.4%)
8:7(0.2%)
3734 (100.0%)
actor2 [character]
1. (Empty string)
2. Police Forces of Poland (
3. Military Forces of Poland
4. Protesters (Poland)
5. Civilians (Bulgaria)
6. Police Forces of Slovakia
7. Police Forces of the Czec
8. Civilians (Hungary)
9. Civilians (International)
10. Civilians (Romania)
[ 36 others ]
3415(91.5%)
72(1.9%)
69(1.8%)
29(0.8%)
12(0.3%)
12(0.3%)
12(0.3%)
10(0.3%)
10(0.3%)
8(0.2%)
85(2.3%)
3734 (100.0%)
assoc_actor_2 [character]
1. (Empty string)
2. Police Forces of Poland (
3. Refugees/IDPs (Internatio
4. MW: All-Poland Youth
5. Catholic Christian Group
6. Refugees/IDPs (Internatio
7. Government of Bulgaria (2
8. Health Workers (Poland)
9. LGBT (Lithuania)
10. LGBT (Poland)
[ 56 others ]
3583(96.0%)
68(1.8%)
6(0.2%)
5(0.1%)
4(0.1%)
3(0.1%)
2(0.1%)
2(0.1%)
2(0.1%)
2(0.1%)
57(1.5%)
3734 (100.0%)
inter2 [integer]
Mean (sd) : 0.3 (1.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 8
IQR (CV) : 0 (4.5)
0:3415(91.5%)
1:200(5.4%)
3:5(0.1%)
5:9(0.2%)
6:47(1.3%)
7:57(1.5%)
8:1(0.0%)
3734 (100.0%)
interaction [integer]
Mean (sd) : 56.9 (11.5)
min ≤ med ≤ max:
10 ≤ 60 ≤ 80
IQR (CV) : 0 (0.2)
17 distinct values 3734 (100.0%)
region [character] 1. Europe
3734(100.0%)
3734 (100.0%)
country [character]
1. Bulgaria
2. Czech Republic
3. Estonia
4. Hungary
5. Latvia
6. Lithuania
7. Poland
8. Romania
9. Slovakia
506(13.6%)
255(6.8%)
116(3.1%)
145(3.9%)
70(1.9%)
158(4.2%)
1925(51.6%)
449(12.0%)
110(2.9%)
3734 (100.0%)
admin1 [character]
1. Mazowieckie
2. Sofia City
3. Malopolskie
4. Dolnoslaskie
5. Pomorskie
6. Slaskie
7. Kujawsko-Pomorskie
8. Podlaskie
9. Wielkopolskie
10. Bucharest
[ 140 others ]
329(8.8%)
226(6.1%)
190(5.1%)
167(4.5%)
156(4.2%)
156(4.2%)
154(4.1%)
145(3.9%)
142(3.8%)
138(3.7%)
1931(51.7%)
3734 (100.0%)
admin2 [character]
1. Warszawa
2. Sofia
3. Municipality of Bucharest
4. Praha
5. Vilniaus
6. Krakow
7. Wroclaw
8. (Empty string)
9. Tallinn
10. Gdansk
[ 503 others ]
286(7.7%)
226(6.1%)
138(3.7%)
116(3.1%)
115(3.1%)
99(2.7%)
88(2.4%)
87(2.3%)
78(2.1%)
70(1.9%)
2431(65.1%)
3734 (100.0%)
admin3 [logical]
All NA's
0 (0.0%)
location [character]
1. Warsaw
2. Sofia
3. Bucharest
4. Prague
5. Vilnius
6. Krakow
7. Wroclaw
8. Tallinn
9. Gdansk
10. Poznan
[ 613 others ]
286(7.7%)
220(5.9%)
138(3.7%)
116(3.1%)
105(2.8%)
99(2.7%)
88(2.4%)
78(2.1%)
70(1.9%)
70(1.9%)
2464(66.0%)
3734 (100.0%)
latitude [numeric]
Mean (sd) : 50 (4.2)
min ≤ med ≤ max:
41.4 ≤ 50.7 ≤ 59.4
IQR (CV) : 5.5 (0.1)
615 distinct values 3734 (100.0%)
longitude [numeric]
Mean (sd) : 21 (3.7)
min ≤ med ≤ max:
12.4 ≤ 21 ≤ 28.8
IQR (CV) : 5.5 (0.2)
611 distinct values 3734 (100.0%)
geo_precision [integer]
Mean (sd) : 1.1 (0.3)
min ≤ med ≤ max:
1 ≤ 1 ≤ 3
IQR (CV) : 0 (0.3)
1:3497(93.7%)
2:211(5.7%)
3:26(0.7%)
3734 (100.0%)
source [character]
1. Wyborcza
2. Committee for the Defence
3. Fakti.bg
4. Tvn24
5. Charter-97
6. Ogolnopolski Strajk Kobie
7. Agerpres
8. Dnes.bg
9. Adevarul
10. FridaysForFuture
[ 611 others ]
261(7.0%)
223(6.0%)
182(4.9%)
128(3.4%)
123(3.3%)
105(2.8%)
100(2.7%)
98(2.6%)
83(2.2%)
80(2.1%)
2351(63.0%)
3734 (100.0%)
source_scale [character]
1. National
2. Other
3. New media
4. Regional
5. Subnational
6. International
7. Subnational-National
8. Other-National
9. National-Regional
10. National-International
[ 6 others ]
2679(71.7%)
340(9.1%)
320(8.6%)
180(4.8%)
83(2.2%)
38(1.0%)
25(0.7%)
20(0.5%)
19(0.5%)
15(0.4%)
15(0.4%)
3734 (100.0%)
notes [character]
1. On 10 June 2021, Parbesze
2. On 10 June 2021, workers
3. On 20 June 2021, activist
4. On 5 June 2021, around 60
5. Around 10 December 2021 (
6. Around 10 January 2021 (a
7. Around 11 May 2021 (as re
8. Around 12 November 2021 (
9. Around 12 November 2021 (
10. Around 13 December 2021 (
[ 3720 others ]
2(0.1%)
2(0.1%)
2(0.1%)
2(0.1%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
3720(99.6%)
3734 (100.0%)
fatalities [integer]
Mean (sd) : 0 (0.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 5
IQR (CV) : 0 (31.9)
0:3728(99.8%)
1:4(0.1%)
2:1(0.0%)
5:1(0.0%)
3734 (100.0%)
timestamp [integer]
Mean (sd) : 1629855755 (11258369)
min ≤ med ≤ max:
1610472167 ≤ 1631042322 ≤ 1666732458
IQR (CV) : 18745403 (0)
341 distinct values 3734 (100.0%)
iso3 [character]
1. BGR
2. CZE
3. EST
4. HUN
5. LTU
6. LVA
7. POL
8. ROM
9. SVK
506(13.6%)
255(6.8%)
116(3.1%)
145(3.9%)
158(4.2%)
70(1.9%)
1925(51.6%)
449(12.0%)
110(2.9%)
3734 (100.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-20

Brief description of the data

The dataset I use in this task contains information about protests and acts of political violence in 9 countries of Eastern and Central Europe in 2021.

  • The dataframe includes 3734 cases. Each case is a protest event.
  • There are 31 variables
  1. Variables that contain information about the event: date, event type, sub event type, actors (participants), place, source of information, notes (details of what happened during the protest) and number of fatalities.
  2. Variables that contain different numeric and character identifiers of the country, event, type, actors, interaction.

Clean the data

From the first sight, the data does not need much cleaning. For example, we do not see much missing data. However, it makes sense to

  • Remove certain columns. (1) For example, we can see name of the country and character identifier so we can leave one of them. (2) Also there is column “notes” which contains details of the event. It may be used only for qualitative analysis. (3) “Admin3” has all NA’s. “Admin2” is the same as “location” but in English, so I leave “location” and remove “admin2”. (4) All cases happened in Europe so it probably makes sense to remove column “region”. I will leave latitude and longitude in case one want to visualize a map. (5) I will remove some of the identifiers.

  • Organize dates

  • Replace empty sting

  • Mutate variables due to the codebook https://acleddata.com/acleddatanew/wp-content/uploads/2021/11/ACLED_Codebook_v1_January-2021.pdf

  • Transform variables with information about actors. In current dataset many actor (participant groups) has information about origin country in brackets. It may be some country or international. I would like to make from information in brackets another column.

# Leave selected columns
protest_data <- select(protest_orig, "data_id", "event_date", "year", "event_type", "sub_event_type", "actor1", "assoc_actor_1", "inter1", "actor2", "assoc_actor_2", "inter2", "interaction", "country", "admin1", "location", "latitude", "longitude", "source", "source_scale", "fatalities")

# Organize dates
library(stringr)
protest_data <- select(protest_data, !contains("year"))
protest_data <- protest_data %>%
  mutate(event_date=str_remove(event_date,".2021"))%>%
  mutate(event_date=str_replace_all(event_date, c(".12" = " Dec", ".11" = " Nov", ".10" = " Oct", ".09" = " Sep", ".08" = " Aug", ".07" = " Jul", ".06" = " Jun", ".05" = " May", ".04" = " Apr", ".03" = " Mar", ".02" = " Feb", ".01" = " Jan"))) %>%
  separate(col=event_date, into=c("day", "month"), sep=" ", remove = TRUE)


# Replace empty sting with NA's
library(dplyr)  
protest_data <- na_if(protest_data, '')

# Mutate variables due to the codebook
protest_data <- protest_data %>%
  mutate(actor1_type = case_when(
         inter1 == 1 ~ "State Forces",
         inter1 == 3 ~ "Political Militas",
         inter1 == 5 ~ "Rioters",
         inter1 == 6 ~ "Protesters",
         inter1 == 8 ~ "External/Other Forces"))
protest_data <- protest_data %>%
  mutate(actor2_type = case_when(
         inter2 == 1 ~ "State Forces",
         inter2 == 3 ~ "Political Militas",
         inter2 == 5 ~ "Rioters",
         inter2 == 6 ~ "Protesters",
         inter2 == 7 ~ "Civilians",
         inter2 == 8 ~ "External/Other Forces",
         inter2 == 0 ~ "NA"))
protest_data <- na_if(protest_data, 'NA')
protest_data <- select(protest_data, !contains("Inter1"))
protest_data <- select(protest_data, !contains("Inter2"))
protest_data <- protest_data %>%
  mutate(interact_type = case_when(interaction == 10 ~ "sole military action",
         interaction == 13 ~ "military versus political militia",
         interaction == 15 ~ "military versus rioters",
         interaction == 16 ~ "military versus protesters",
         interaction == 17 ~ "military versus civilians",
         interaction == 18 ~ "military versus other",
         interaction == 33 ~ "political militia versus political militia",
         interaction == 37 ~ "political militia versus civilians",
         interaction == 50 ~ "sole rioter action",
         interaction == 55 ~ "rioters versus rioters",
         interaction == 56 ~ "rioters versus protesters",
         interaction == 57 ~ "rioters versus civilians",
         interaction == 58 ~ "rioters versus others",
         interaction == 60 ~ "sole protester action",
         interaction == 66 ~ "protesters versus protesters",
         interaction == 78 ~ "other actor versus civilians",
         interaction == 80 ~ "sole other action"))
protest_data <- select(protest_data, !contains("Interaction"))

Clean the data: continuing

Variables with actors (actor1, assoc_actor_1, actor2, assoc_actor_2) are not clear. Some of them include countries in brackets, some contain several organizations or movements.

For actor1” and “actor2” I decided to separate name and country.

Next tasks: - Turn variables with information about actor1 into two variables (add country_from) - Turn variables with information about actor2 into two variables (add country_from2)

# Turn variables with information about actor1 into two variables (add country_from)
## Remove " " to prepare for separation
protest_data <- protest_data %>%
  mutate(actor1=str_replace_all(actor1, c (" the Czech Republic" = " Czechia", "Czech Republic" = "Czechia", " the United Kingdom" = " UK", "United Kingdom" = "UK"))) %>%
  mutate(actor1=str_replace_all(actor1, c ("Police Forces of" = "Police_Forces", "Military Forces of" = "Military_Forces", "Private Security Forces" = "Private_Security_Forces", "Unidentified Armed Group" = "Unidentified_Armed_Group", "Government of" = "Government"))) %>%
  mutate(actor1=str_replace_all(actor1, c("PNL: National Liberal Party" = "PLN:_National_Liberal_Party (Poland)"))) %>%
  mutate(actor1=str_replace_all(actor1, c (" Prison Guards" = " Prison_Guards", " Border Guards" = " Border_Guards", " Border Police" = " Border_Police")))

## Separate columns and tidy data
protest_data <- protest_data %>%
  separate(col=actor1, into=c("actor_1", "country_from", "info3", "info4"), sep=" ", remove = FALSE)
Warning: Expected 4 pieces. Missing pieces filled with `NA` in 3730 rows [1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
protest_data <- select(protest_data, !contains("info3"))
protest_data <- protest_data %>%
  unite("actor_1", actor_1, info4, remove = TRUE)
protest_data <- protest_data %>%
  mutate(actor_1 = str_remove(actor_1, "_NA"))
protest_data <- select(protest_data, !contains("info4"))
protest_data <- protest_data %>%
  mutate(country_from=str_replace_all(country_from, "\\*|\\(|\\)", ""))
protest_data <- protest_data %>%
  mutate(actor_1=str_replace_all(actor_1, c("_" = " ")))
protest_data  <-  rename(protest_data, "actor1_combined" = "actor1")

# Turn variables with information about actor2 into two variables (add country_from2)
## Remove " " to prepare for separation
protest_data <- protest_data %>%
  mutate(actor2=str_replace_all(actor2, c (" the Czech Republic" = " Czechia", "Czech Republic" = "Czechia", " the United Kingdom" = " UK", "United Kingdom" = "UK"))) %>%
  mutate(actor2=str_replace_all(actor2, c ("Police Forces of" = "Police_Forces", "Military Forces of" = "Military_Forces", "Private Security Forces" = "Private_Security_Forces", "Unidentified Armed Group" = "Unidentified_Armed_Group"))) %>%
  mutate(actor2=str_replace_all(actor2, c("TISAP: There is Such A People" = "TISAP:_There_is_Such_A_People", "GERB: Citizens for European Development of" = "GERB:_Citizens_for_European_Development")))

## Separate columns and tidy data
protest_data <- protest_data %>%
  separate(col=actor2, into=c("actor_2", "country_from2", "info3", "info4"), sep=" ", remove = FALSE)
Warning: Expected 4 pieces. Additional pieces discarded in 1 rows [1981].
Warning: Expected 4 pieces. Missing pieces filled with `NA` in 311 rows [1, 2,
4, 14, 15, 24, 28, 29, 30, 37, 48, 55, 67, 68, 69, 70, 159, 180, 191, 192, ...].
protest_data <- select(protest_data, !contains("info3"))
protest_data <- protest_data %>%
  unite("actor_2", actor_2, info4, remove = TRUE)
protest_data <- protest_data %>%
  mutate(actor_2 = str_remove(actor_2, "_NA"))
protest_data <- select(protest_data, !contains("info4"))
protest_data <- protest_data %>%
  mutate(country_from2=str_replace_all(country_from2, "\\*|\\(|\\)", "")) %>%
  mutate(actor_2=str_replace_all(actor_2, c("_" = " ")))
protest_data <- rename(protest_data, "actor2_combined" = "actor2")
protest_data <- na_if(protest_data, 'NA')

# Unify countries names
protest_data <- protest_data %>%
  mutate(actor1_combined = str_replace_all(actor1_combined, c (" Czechia" = " the Czech Republic", "Czechia" = "Czech Republic", " UK" = " the United Kingdom", "UK" = "United Kingdom"))) %>%
  mutate(country_from = str_replace_all(country_from, c (" Czechia" = " the Czech Republic", "Czechia" = "Czech Republic", " UK" = " the United Kingdom", "UK" = "United Kingdom"))) %>%
  mutate(actor2_combined = str_replace_all(actor2_combined, c (" Czechia" = " the Czech Republic", "Czechia" = "Czech Republic", " UK" = " the United Kingdom", "UK" = "United Kingdom"))) %>%
  mutate(country_from2 = str_replace_all(country_from2, c (" Czechia" = " the Czech Republic", "Czechia" = "Czech Republic", " UK" = " the United Kingdom", "UK" = "United Kingdom")))

Clean the data: continuing

Columns “assoc_actor_1” and “assoc_actor_2” contain information about organizations, movements or institutions which took part in the protest and countries they are from. But one cell may contain several organisation and several countries. So first I am going to separate each organization.

Another problem is that unlike “actor1” and “actor2”, “assoc_actor_1” and “assoc_actor_2” have many names of specific organizations. Sometimes name include countries sometimes country is written in brackets. So to me it seems impossible to separate information about organization and country because we need to check a lot of strings and write down the countries by hand. So for now I think one who will work with such dataset need to work with a specific names. I decided to leave them as well as “actor1_combined” and “actor2_combined” but I am not sure if it is right thing to do.

# Turn variables with information about assoc_actor_1 into several variables
## Separate organizations
protest_data <- protest_data %>%
  separate(col=assoc_actor_1, into=c("assoc_actor1_A", "assoc_actor1_B", "assoc_actor1_C", "assoc_actor1_D", "assoc_actor1_E", "assoc_actor1_F", "assoc_actor1_G", "assoc_actor1_H", "assoc_actor1_I", "assoc_actor1_J"), sep=";", remove = TRUE)
Warning: Expected 10 pieces. Missing pieces filled with `NA` in 2497 rows [1, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 20, 21, 22, 24, ...].
# Turn variables with information about assoc_actor_2 into several variables
## Separate organizations
protest_data <- protest_data %>%
  separate(col=assoc_actor_2, into=c("assoc_actor2_A", "assoc_actor2_B", "assoc_actor2_C", "assoc_actor2_D", "assoc_actor2_E"), sep=";", remove = TRUE)
Warning: Expected 5 pieces. Missing pieces filled with `NA` in 147 rows [1,
4, 14, 15, 24, 28, 29, 30, 48, 55, 67, 68, 69, 70, 159, 180, 191, 192, 199,
203, ...].
# Relocate columns
protest_data <- protest_data %>%
  relocate(actor1_type, .after = sub_event_type) %>%
  relocate(actor2_type, .after = assoc_actor1_J) %>%
  relocate(interact_type, .after = assoc_actor2_E)

Provide a narrative about the data set

The dataset gives information about protests and acts of political violence in 9 countries of Eastern and Central Europe in 2021. The countries included in the dataframe are Bulgaria, Czech Republic, Estonia, Hungary, Latvia, Lithuania, Poland, Romania, Slovakia. The cases are events. So we have all information about the protest event. When and where it happened, who took part in it, was it peaceful or violet, how many fatalities it had and from which source we know about it.

  1. What happened? Variables “event_type”, “sub_event_type” and “fatalities” gives information about how we categorize political event. “event_type”, “sub_event_type” are categorical variables which mean type of the event and basically refer to different levels and types of radicalization during the protest.
ggplot(protest_data, aes(event_type)) + geom_bar()

table(protest_data$fatalities)

   0    1    2    5 
3728    4    1    1 
  1. When the event happened? Variables “day” and “month” contains information about event date.

  2. Where it happened? There are several variables which help to understand where protest took place. They all are categorical. First, we know “country”, second, administrative division (“admin1”),third, city or village (“location”). Columns “longitude” and “latitude” may be used to visualize a map of events.

ggplot(protest_data, aes(country)) + geom_bar()

  1. Who participated?
  • Variables “actor1_type” and “actor2_type” give information about type of group which took part in the protest. Actor one has 5 types, actor two 6. “Interact_type” contains 17 types of interaction which are different combinations of “actor1_type” and “actor2_type”. These variables are categorical.

  • Variables “actor_1” and “actor_2” contain more specific information about groups which participated. “Country_from1” and “country_from2” refer to countries these groups are from.

  • These values were taken from “actor1_combined” and “actor2_combined” which contain information about group and county in one column.

  • From “Assoc_actor1_A” to “Assoc_actor1_J” and from “Assoc_actor2_A” to “Assoc_actor2_E” we can see names of organizations or more specific information about groups.

table(protest_data$actor1_type)

External/Other Forces     Political Militas            Protesters 
                    7                    20                  3525 
              Rioters          State Forces 
                  128                    54 
  1. Also we have source of data in “source” which indicates name of the paper or web source. “source_scale” refers to type of source.

Identify potential research questions that your dataset can help answer

Data may be used to see how different countries react to different types of violence at political protests. Also we can do some descriptive statistics and look which type of protest in which countries happened more often in 2021.

I think it might be also interesting to look if news about more violent protests went to national or international scale of papers and magazines. And if news for example about peaceful protest are as important for national scale media as violent.