Analyzing the Enron Emails dataset from the network package
“Enron Emails.R” is a file in the course repository that consists of a network of emails between enron employees from the igraphdata package. According to the import script, this is a large, un-weighted, directed network with employees as nodes and emails as edges.
The import script also indicates that there are no node attributes. I found that there was, in fact, node attributes to be found in the igraph dataset in the form of what seems to be titles as ‘Notes’ and email addresses in ‘email’ without the domain name, but that was not relevant for this assignment.
Additionally, the import script indicated that but topic and time information is stored as edge attributes. This is correct, and another thing that I learned through working on this assignment about the dataset is that there is a topical dataset directory in the LDC details data frame that serves as a codebook for the topic codes assembled in the edgelist, for future reference.
The import script has created three objects that represent the network: network_edgelist (a data frame of an edge list and edge attributes), network_igraph (an igraph object), and network_statnet (a network object compatible with statnet packages like sna & ergm).
With that contextual introduction, I’ll go back to the start, and execute the import script. I also look at the R Documentation to view the detailed information on this data set via: enron {igraphdata}
I load the libraries for statnet, igraph, and igraphdata
Next, I read the data into the environment. This imports the data as an adjacency matrix
Then, I create the edgelist
network_edgelist <- as.data.frame(as_edgelist(network_igraph))
and add edge attributes to the edge list
This collects details about the attribute “LDC Details” into a data frame
LDC_details <- data.frame(LDC_topic_name = network_igraph$LDC_names, LDC_topic_desc = network_igraph$LDC_desc, LDC_topic = 1:32)
The data frame can then be added as details to the edge list
network_edgelist <- merge(network_edgelist, LDC_details, by = 'LDC_topic', all.x = TRUE)
and then re-ordered within the edge list
network_edgelist <- network_edgelist[c(2:5,1,6,7)]
Now I can create a statnet network object from our edge list
network_statnet <- network(as.matrix(network_edgelist[1:2]), matrix.type = "edgelist", directed = TRUE)
and add attributes to the statnet network object
network_statnet%e%'Time' <- as.character(network_edgelist$Time)
network_statnet%e%'Reciptype' <- as.character(network_edgelist$Reciptype)
network_statnet%e%'Topic' <- as.character(network_edgelist$Topic)
network_statnet%e%'LDC_topic' <- as.character(network_edgelist$LDC_topic)
network_statnet%e%'LDC_topic_name' <- as.character(network_edgelist$LDC_topic_name)
network_statnet%e%'LDC_topic_desc' <- as.character(network_edgelist$LDC_topic_desc)
Finally, I can clean up and remove any unnecessary objects if I no longer need the details as a reference, as in this assignment.
rm(LDC_details)
Now, I’ll take a first look at the network
plot(network_statnet)
That’s interesting, but doesn’t tell me much about the network yet except that I may expect to see 2 isolates.
Using tools to inspect the network data and confirm the objects created through the import script are present
ls()
[1] "network_edgelist" "network_igraph" "network_statnet"
I’ll inspect vertices and edges using commands in both igraph and statnet
vcount(network_igraph)
[1] 184
ecount(network_igraph)
[1] 125409
print(network_statnet)
Network attributes:
vertices = 184
directed = TRUE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 3010
missing edges= 0
non-missing edges= 3010
Vertex attribute names:
vertex.names
Edge attribute names not shown
There is quite a difference between the number of edges in the igraph network (123,409) and the statnet network (3010), which leads me to believe there is something a bit off with the way the data was processed between the two network programs.
Looking at more comparisons in the two network files, I can look at the network features.
is_bipartite(network_igraph)
[1] FALSE
is_directed(network_igraph)
[1] TRUE
is_weighted(network_igraph)
[1] FALSE
vertex_attr_names(network_igraph)
[1] "Email" "Name" "Note"
edge_attr_names(network_igraph)
[1] "Time" "Reciptype" "Topic" "LDC_topic"
Looking at the same features of the statnet network with the appropriate commands
print(network_statnet)
Network attributes:
vertices = 184
directed = TRUE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 3010
missing edges= 0
non-missing edges= 3010
Vertex attribute names:
vertex.names
Edge attribute names not shown
network::list.vertex.attributes(network_statnet)
[1] "na" "vertex.names"
network::list.edge.attributes(network_statnet)
[1] "LDC_topic" "LDC_topic_desc" "LDC_topic_name"
[4] "na" "Reciptype" "Time"
[7] "Topic"
Using more tools to inspect the network data:
#List network attributes: igraph
igraph::vertex_attr_names(network_igraph)
[1] "Email" "Name" "Note"
igraph::edge_attr_names(network_igraph)
[1] "Time" "Reciptype" "Topic" "LDC_topic"
#List network attributes: statnet
network::list.vertex.attributes(network_statnet)
[1] "na" "vertex.names"
network::list.edge.attributes(network_statnet)
[1] "LDC_topic" "LDC_topic_desc" "LDC_topic_name"
[4] "na" "Reciptype" "Time"
[7] "Topic"
I want to look at specific attribute data. First using igraph
[1] "albert.meyers" "a..martin" "andrea.ring" "andrew.lewis"
[5] "andy.zipper" "a..shankman"
[1] "Albert Meyers" "Thomas Martin" "Andrea Ring"
[4] "Andrew Lewis" "Andy Zipper" "Jeffrey Shankman"
[1] "Employee, Specialist" "Vice President"
[3] "NA" "Director"
[5] "Vice President, Enron Online" "President, Enron Global Mkts"
[1] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
[4] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
[1] "to" "to" "cc" "cc" "bcc" "bcc"
[1] 1 1 3 3 3 3
[1] 0 -1 -1 -1 -1 -1
Next, using statnet
head(network_statnet %v% "na")
[1] FALSE FALSE FALSE FALSE FALSE FALSE
network_statnet %v% "vertex.names"
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[17] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
[49] 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
[81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
[113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
[129] 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
[161] 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
[177] 177 178 179 180 181 182 183 184
head(network_statnet %e% "LDC_topic")
[1] "-1" "-1" "-1" "-1" "-1" "-1"
head(network_statnet %e% "LDC_topic_desc")
[1] NA NA NA NA NA NA
head(network_statnet %e% "LDC_topic_name")
[1] NA NA NA NA NA NA
head(network_statnet %e% "na")
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(network_statnet %e% "Reciptype")
[1] "to" "cc" "cc" "bcc" "bcc" "to"
head(network_statnet %e% "Time")
[1] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
[4] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
head(network_statnet %e% "Topic")
[1] "1" "3" "3" "3" "3" "3"
Clearly, there are differences in how the vertices are represented in igraph v. statnet. For example, the anonymized names are node attributes in igraph, but in statnet they are represented by numbers.
Next, I want to look at the dyad census in igraph
igraph::dyad.census(network_igraph)
$mut
[1] 30600
$asym
[1] 64208
$null
[1] -77972
and in statnet
sna::dyad.census(network_statnet)
Mut Asym Null
[1,] 913 1184 14739
The dyad census clearly gives vastly different responses between the two programs, but I am not sure how or why they are represented so differently yet.
Next I’ll look at the triad census in igraph
igraph::triad.census(network_igraph)
[1] 700234 19530 249694 8409 2695 5176 7060 13227 1180
[10] 59 6781 1023 1137 786 2782 1611
and in statnet
sna::triad.census(network_statnet)
003 012 102 021D 021U 021C 111D 111U 030T 030C 201
[1,] 700234 150250 118974 8409 2695 5176 7060 13227 1180 59 6781
120D 120U 120C 210 300
[1,] 1023 1137 786 2782 1611
If I use the igraph data, the enron network has 184 vertices, so if I want to see if the triad census is working correctly, I want to compare the data:
#possible triads in network
184*183*182/6
[1] 1021384
sum(igraph::triad.census(network_igraph))
[1] 1021384
Similarly, if I use the statnet data, the enron network has 184 vertices, so if I want to see if the triad census is working correctly, I want to compare the data:
#possible triads in network
184*183*182/6
[1] 1021384
sum(sna::triad.census(network_statnet))
[1] 1021384
Now I’m getting somewhere! I don’t yet know exactly how the triad census informs my interpretations fully, but I know it is accurately being represented in this area of network analysis.
Looking next at the global transitivity in statnet:
gtrans(network_statnet)
[1] 0.3580924
Looking next at the network transitivity in igraph:
transitivity(network_igraph)
[1] 0.3725138
They are not the same, but not completely out of the realm of reasonable differences given the different algorithms each program uses.
Looking next at the ego transitivity for the employee names that appeared in the header of the igraph node information, but I cannot get the command to run which would give me the local transitivity for specific nodes, for some reason I will need to take more time to explore.
#transitivity(network_igraph, type=“local”, vids=V(network_igraph)[c(“Albert Meyers”, “Thomas Martin:, Andrea Ring”, “Andrew Lewis”, “Andy Zipper”, “Jeffrey Shankman”)])
Howevwer, I can look at global v. average local transitivity
transitivity(network_igraph, type="global")
[1] 0.3725138
transitivity(network_igraph, type="average")
[1] 0.5055302
This transitivity tells me that the average network transitivity is significantly higher than the global transitivity, indicating, from my still naive network knowledge, that the overall network is generally more loose, and that there is a more connected sub-network.
Looking at the geodesic distance:
average.path.length(network_igraph,directed=T)
[1] 2.390464
This tells me that on average, the path length is just over 2.
Getting to look at the components of the network in igraph:
names(igraph::components(network_igraph))
[1] "membership" "csize" "no"
igraph::components(network_igraph)$no
[1] 3
igraph::components(network_igraph)$csize
[1] 182 1 1
It shows that there are 3 components in the network, and 182 of the 182 nodes make up the giant component with 2 isolates.
Finally, I get my answer on isolates.
isolates(network_statnet)
[1] 72 118
Since I know that the nodes are Enron employees and they are assigned numbers in the statnet network, running the isolate command tells me that employee #72 and #118 are indeed the 2 isolates viewed in the initial graphic representation of the network.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Becvar (2022, Feb. 17). Data Analytics and Computational Social Science: Week 2 Assignment. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomkbecenron/
BibTeX citation
@misc{becvar2022week, author = {Becvar, Kristina}, title = {Data Analytics and Computational Social Science: Week 2 Assignment}, url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomkbecenron/}, year = {2022} }