Week 2 Assignment

Analyzing the Enron Emails dataset from the network package

Kristina Becvar
2022-02-04

“Enron Emails.R” is a file in the course repository that consists of a network of emails between enron employees from the igraphdata package. According to the import script, this is a large, un-weighted, directed network with employees as nodes and emails as edges.

The import script also indicates that there are no node attributes. I found that there was, in fact, node attributes to be found in the igraph dataset in the form of what seems to be titles as ‘Notes’ and email addresses in ‘email’ without the domain name, but that was not relevant for this assignment.

Additionally, the import script indicated that but topic and time information is stored as edge attributes. This is correct, and another thing that I learned through working on this assignment about the dataset is that there is a topical dataset directory in the LDC details data frame that serves as a codebook for the topic codes assembled in the edgelist, for future reference.

The import script has created three objects that represent the network: network_edgelist (a data frame of an edge list and edge attributes), network_igraph (an igraph object), and network_statnet (a network object compatible with statnet packages like sna & ergm).

With that contextual introduction, I’ll go back to the start, and execute the import script. I also look at the R Documentation to view the detailed information on this data set via: enron {igraphdata}

I load the libraries for statnet, igraph, and igraphdata

Next, I read the data into the environment. This imports the data as an adjacency matrix

data("enron", package = "igraphdata")
network_igraph <- enron
rm(enron)

Then, I create the edgelist

network_edgelist <- as.data.frame(as_edgelist(network_igraph))

and add edge attributes to the edge list

network_edgelist <-cbind(network_edgelist, Time      = E(network_igraph)$Time, 
                                               Reciptype = E(network_igraph)$Reciptype, 
                                               Topic     = E(network_igraph)$Topic, 
                                               LDC_topic = E(network_igraph)$LDC_topic)

This collects details about the attribute “LDC Details” into a data frame

LDC_details <- data.frame(LDC_topic_name = network_igraph$LDC_names, LDC_topic_desc = network_igraph$LDC_desc, LDC_topic = 1:32)

The data frame can then be added as details to the edge list

network_edgelist <- merge(network_edgelist, LDC_details, by = 'LDC_topic', all.x = TRUE)

and then re-ordered within the edge list

network_edgelist <- network_edgelist[c(2:5,1,6,7)]

Now I can create a statnet network object from our edge list

network_statnet <- network(as.matrix(network_edgelist[1:2]), matrix.type = "edgelist", directed = TRUE)

and add attributes to the statnet network object

network_statnet%e%'Time' <- as.character(network_edgelist$Time)
network_statnet%e%'Reciptype' <- as.character(network_edgelist$Reciptype)
network_statnet%e%'Topic' <- as.character(network_edgelist$Topic)
network_statnet%e%'LDC_topic' <- as.character(network_edgelist$LDC_topic)
network_statnet%e%'LDC_topic_name' <- as.character(network_edgelist$LDC_topic_name)
network_statnet%e%'LDC_topic_desc' <- as.character(network_edgelist$LDC_topic_desc)

Finally, I can clean up and remove any unnecessary objects if I no longer need the details as a reference, as in this assignment.

rm(LDC_details)

Now, I’ll take a first look at the network

plot(network_statnet)

That’s interesting, but doesn’t tell me much about the network yet except that I may expect to see 2 isolates.

Using tools to inspect the network data and confirm the objects created through the import script are present

ls()
[1] "network_edgelist" "network_igraph"   "network_statnet" 

I’ll inspect vertices and edges using commands in both igraph and statnet

vcount(network_igraph)
[1] 184
ecount(network_igraph)
[1] 125409
print(network_statnet) 
 Network attributes:
  vertices = 184 
  directed = TRUE 
  hyper = FALSE 
  loops = FALSE 
  multiple = FALSE 
  bipartite = FALSE 
  total edges= 3010 
    missing edges= 0 
    non-missing edges= 3010 

 Vertex attribute names: 
    vertex.names 

 Edge attribute names not shown 

There is quite a difference between the number of edges in the igraph network (123,409) and the statnet network (3010), which leads me to believe there is something a bit off with the way the data was processed between the two network programs.

Looking at more comparisons in the two network files, I can look at the network features.

is_bipartite(network_igraph)
[1] FALSE
is_directed(network_igraph)
[1] TRUE
is_weighted(network_igraph)
[1] FALSE
vertex_attr_names(network_igraph)
[1] "Email" "Name"  "Note" 
edge_attr_names(network_igraph)
[1] "Time"      "Reciptype" "Topic"     "LDC_topic"

Looking at the same features of the statnet network with the appropriate commands

print(network_statnet)
 Network attributes:
  vertices = 184 
  directed = TRUE 
  hyper = FALSE 
  loops = FALSE 
  multiple = FALSE 
  bipartite = FALSE 
  total edges= 3010 
    missing edges= 0 
    non-missing edges= 3010 

 Vertex attribute names: 
    vertex.names 

 Edge attribute names not shown 
network::list.vertex.attributes(network_statnet)
[1] "na"           "vertex.names"
network::list.edge.attributes(network_statnet)
[1] "LDC_topic"      "LDC_topic_desc" "LDC_topic_name"
[4] "na"             "Reciptype"      "Time"          
[7] "Topic"         

Using more tools to inspect the network data:

#List network attributes: igraph

igraph::vertex_attr_names(network_igraph)
[1] "Email" "Name"  "Note" 
igraph::edge_attr_names(network_igraph)
[1] "Time"      "Reciptype" "Topic"     "LDC_topic"
#List network attributes: statnet

network::list.vertex.attributes(network_statnet)
[1] "na"           "vertex.names"
network::list.edge.attributes(network_statnet)
[1] "LDC_topic"      "LDC_topic_desc" "LDC_topic_name"
[4] "na"             "Reciptype"      "Time"          
[7] "Topic"         

I want to look at specific attribute data. First using igraph

head(V(network_igraph)$Email)
[1] "albert.meyers" "a..martin"     "andrea.ring"   "andrew.lewis" 
[5] "andy.zipper"   "a..shankman"  
head(V(network_igraph)$Name)
[1] "Albert Meyers"    "Thomas Martin"    "Andrea Ring"     
[4] "Andrew Lewis"     "Andy Zipper"      "Jeffrey Shankman"
head(V(network_igraph)$Note)
[1] "Employee, Specialist"         "Vice President"              
[3] "NA"                           "Director"                    
[5] "Vice President, Enron Online" "President, Enron Global Mkts"
head(E(network_igraph)$Time)
[1] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
[4] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
head(E(network_igraph)$Reciptype)
[1] "to"  "to"  "cc"  "cc"  "bcc" "bcc"
head(E(network_igraph)$Topic)
[1] 1 1 3 3 3 3
head(E(network_igraph)$LDC_topic)
[1]  0 -1 -1 -1 -1 -1

Next, using statnet

head(network_statnet %v% "na")
[1] FALSE FALSE FALSE FALSE FALSE FALSE
network_statnet %v% "vertex.names"
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
 [17]  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32
 [33]  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48
 [49]  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64
 [65]  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80
 [81]  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
 [97]  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112
[113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
[129] 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
[161] 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
[177] 177 178 179 180 181 182 183 184
head(network_statnet %e% "LDC_topic")
[1] "-1" "-1" "-1" "-1" "-1" "-1"
head(network_statnet %e% "LDC_topic_desc")
[1] NA NA NA NA NA NA
head(network_statnet %e% "LDC_topic_name")
[1] NA NA NA NA NA NA
head(network_statnet %e% "na")
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(network_statnet %e% "Reciptype")
[1] "to"  "cc"  "cc"  "bcc" "bcc" "to" 
head(network_statnet %e% "Time")
[1] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
[4] "1979-12-31 21:00:00" "1979-12-31 21:00:00" "1979-12-31 21:00:00"
head(network_statnet %e% "Topic")
[1] "1" "3" "3" "3" "3" "3"

Clearly, there are differences in how the vertices are represented in igraph v. statnet. For example, the anonymized names are node attributes in igraph, but in statnet they are represented by numbers.

Next, I want to look at the dyad census in igraph

igraph::dyad.census(network_igraph)
$mut
[1] 30600

$asym
[1] 64208

$null
[1] -77972

and in statnet

sna::dyad.census(network_statnet)
     Mut Asym  Null
[1,] 913 1184 14739

The dyad census clearly gives vastly different responses between the two programs, but I am not sure how or why they are represented so differently yet.

Next I’ll look at the triad census in igraph

igraph::triad.census(network_igraph)
 [1] 700234  19530 249694   8409   2695   5176   7060  13227   1180
[10]     59   6781   1023   1137    786   2782   1611

and in statnet

sna::triad.census(network_statnet)
        003    012    102 021D 021U 021C 111D  111U 030T 030C  201
[1,] 700234 150250 118974 8409 2695 5176 7060 13227 1180   59 6781
     120D 120U 120C  210  300
[1,] 1023 1137  786 2782 1611

If I use the igraph data, the enron network has 184 vertices, so if I want to see if the triad census is working correctly, I want to compare the data:

#possible triads in network
184*183*182/6
[1] 1021384
sum(igraph::triad.census(network_igraph))
[1] 1021384

Similarly, if I use the statnet data, the enron network has 184 vertices, so if I want to see if the triad census is working correctly, I want to compare the data:

#possible triads in network
184*183*182/6
[1] 1021384
sum(sna::triad.census(network_statnet))
[1] 1021384

Now I’m getting somewhere! I don’t yet know exactly how the triad census informs my interpretations fully, but I know it is accurately being represented in this area of network analysis.

Looking next at the global transitivity in statnet:

gtrans(network_statnet)
[1] 0.3580924

Looking next at the network transitivity in igraph:

transitivity(network_igraph)
[1] 0.3725138

They are not the same, but not completely out of the realm of reasonable differences given the different algorithms each program uses.

Looking next at the ego transitivity for the employee names that appeared in the header of the igraph node information, but I cannot get the command to run which would give me the local transitivity for specific nodes, for some reason I will need to take more time to explore.

#transitivity(network_igraph, type=“local”, vids=V(network_igraph)[c(“Albert Meyers”, “Thomas Martin:, Andrea Ring”, “Andrew Lewis”, “Andy Zipper”, “Jeffrey Shankman”)])

Howevwer, I can look at global v. average local transitivity

transitivity(network_igraph, type="global")
[1] 0.3725138
transitivity(network_igraph, type="average")
[1] 0.5055302

This transitivity tells me that the average network transitivity is significantly higher than the global transitivity, indicating, from my still naive network knowledge, that the overall network is generally more loose, and that there is a more connected sub-network.

Looking at the geodesic distance:

average.path.length(network_igraph,directed=T)
[1] 2.390464

This tells me that on average, the path length is just over 2.

Getting to look at the components of the network in igraph:

names(igraph::components(network_igraph))
[1] "membership" "csize"      "no"        
igraph::components(network_igraph)$no 
[1] 3
igraph::components(network_igraph)$csize
[1] 182   1   1

It shows that there are 3 components in the network, and 182 of the 182 nodes make up the giant component with 2 isolates.

Finally, I get my answer on isolates.

isolates(network_statnet)
[1]  72 118

Since I know that the nodes are Enron employees and they are assigned numbers in the statnet network, running the isolate command tells me that employee #72 and #118 are indeed the 2 isolates viewed in the initial graphic representation of the network.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Becvar (2022, Feb. 17). Data Analytics and Computational Social Science: Week 2 Assignment. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomkbecenron/

BibTeX citation

@misc{becvar2022week,
  author = {Becvar, Kristina},
  title = {Data Analytics and Computational Social Science: Week 2 Assignment},
  url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomkbecenron/},
  year = {2022}
}