A new article created using the Distill format.
Load a built-in dataset from igraph.
#Load the data.
library(igraphdata)
data("enron",package="igraphdata")
ls()
[1] "enron"
This is an Enron email dataset has been made public by the U.S. Department of Justice. Let’s look at some of its basic descriptive statistics.
[1] 184
ecount(enron)
[1] 125409
[1] 681.5707
[1] 3.72443
#Find network features:
is_bipartite(enron)
[1] FALSE
is_directed(enron)
[1] TRUE
is_weighted(enron)
[1] FALSE
#display vertex attributes
vertex_attr_names(enron)
[1] "Email" "Name" "Note"
#display edge attributes
edge_attr_names(enron)
[1] "Time" "Reciptype" "Topic" "LDC_topic"
[1] "albert.meyers" "a..martin" "andrea.ring" "andrew.lewis"
[5] "andy.zipper" "a..shankman"
[1] "Albert Meyers" "Thomas Martin" "Andrea Ring"
[4] "Andrew Lewis" "Andy Zipper" "Jeffrey Shankman"
[1] "Employee, Specialist" "Vice President"
[3] "NA" "Director"
[5] "Vice President, Enron Online" "President, Enron Global Mkts"
[1] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
[4] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
[1] "to" "to" "cc" "cc" "bcc" "bcc"
[1] 1 1 3 3 3 3
[1] 0 -1 -1 -1 -1 -1
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 1.000 1.711 3.000 3.000
Length Class Mode
0 NULL NULL
This network has 184 nodes and 125409 edges. It is a non-bipartite, directed and unweighted network. Each node is a employee of Enron. The edge represents the email communication from one person to the other one. (Noticing it’s directed. So it is “from…to…” rather than “between”.) Each employee sent 681.57 messages on average, and averagely sent 3.72 messages to each potential receiver (collegue).
Every node has three attributes, which are Email, Name and Note. “Email” is the email address (omitting the domain name) of the person. “Name” is just the real name. “Note” is mainly about position and department.
Each edge has four attributes, which are Time, Reciptype, Topic, and LDC-topic. “Time” is a string telling the when the email was sent. To make further analysis, we need to transfer the string to some kind of numeric data (which we haven’t learnt.) The “Reciptype” is how the recipient receive the email, ‘to’,‘cc’ or ‘bcc’.
(Not sure about what Topic nad LDC_topic is about. Maybe categories assigned by the publisher describing types of topics.)
#Classify all dyads in the network:
igraph::dyad.census(enron)
$mut
[1] 30600
$asym
[1] 64208
$null
[1] -77972
There are 30600 mutual dyads, and 64208 asymmetric dyads. (Didn’t understand what negative 77972 means.)
#Classify all triads in the network:
igraph::triad_census(enron)
[1] 700234 19530 249694 8409 2695 5176 7060 13227 1180
[10] 59 6781 1023 1137 786 2782 1611
sum(igraph::triad_census(enron))
[1] 1021384
(700234 / 1021384)
[1] 0.6855737
((19530 + 249694) / 1021384)
[1] 0.2635874
#get global clustering coefficient(i.e. network transitivity):
transitivity(enron, type="global")
[1] 0.3725138
##get average local clustering coefficient:
transitivity(enron, type="average")
[1] 0.5055302
#find average shortest path for network
average.path.length(enron,directed=F)
[1] 2.085787
Above shows the census of 16 kinds of triads. Almost 69% triads are empty, and 26% are one egded. This implies a relatively loosely connected network. To learn more about the connectivity, let’s look at the transitivity. The global clustering coefficient is 0.3723, which means 37% of connected Triplets are closed, which is relatively loose. The average local clustering coefficient is 0.5055. It means on average, if two nodes are connected to a same node (two person respectively have email communication to one person), they have about 51% chance to connect to each other (communicate via email with each other.)
The local clustering coefficient is much larger than the global one, implying the network is clustering around, or concentrate to, some focus node. In other word, alters of employees who have few email communication are less likely to connect to each other than alters of employees with many other connections.(Not sure whether I understand it right. The Tuesday lecure recording link is wrong so I couldn’t watch it.)
The average geodesic distance is 2. It means onaverage, one need only one other person to have email connection to a random person in the network.
#Number of components
igraph::components(enron)$no
[1] 3
#Size of each component
igraph::components(enron)$csize
[1] 182 1 1
There are only 3 components in the network. 182 of the nodes are in the major componets, while there are two isolates.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Li (2022, Feb. 17). Data Analytics and Computational Social Science: Short Assignment 2. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomyli210813861702/
BibTeX citation
@misc{li2022short, author = {Li, Yifan}, title = {Data Analytics and Computational Social Science: Short Assignment 2}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomyli210813861702/}, year = {2022} }