# Create a directory for data if it does not existif (!dir.exists("data")) {dir.create("data")}# Download the dataseturl <-"https://snap.stanford.edu/data/email-Eu-core.txt.gz"destfile <-"data/email-Eu-core.txt.gz"# Download if file does not already existif (!file.exists(destfile)) {download.file(url, destfile)cat("Dataset downloaded successfully!\n")} else {cat("Dataset already exists. Skipping download.\n")}
NoteFile Format
The dataset is in .txt.gz format (compressed text file). R can read this directly without manual decompression!
Step 3: Load the Network Data
Now we will read the edge list and examine its structure.
# Read the edge list# The file contains two columns: FromNodeId ToNodeIdedge_data <-read.table("data/email-Eu-core.txt.gz",header =FALSE,col.names =c("from", "to"))# View the first few rowscat("First 10 edges in the network:\n")head(edge_data, 10)# Check the structurecat("\nDataset dimensions:\n")cat("Number of edges:", nrow(edge_data), "\n")cat("Number of columns:", ncol(edge_data), "\n")
Expected Output:
First 10 edges in the network:
from to
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
...
Dataset dimensions:
Number of edges: 25571
Number of columns: 2
TipUnderstanding Edge Lists
An edge list is the simplest way to represent a network: - Each row represents one edge (connection) - First column: source node (sender) - Second column: target node (receiver)
Step 4: Create a Network Object
We will use igraph to create a graph object from the edge list.
# Create a directed graph from the edge listg <-graph_from_data_frame(edge_data, directed =TRUE)# Display basic informationcat("Network created successfully!\n\n")cat("Network Summary:\n")cat(" Number of nodes:", vcount(g), "\n")cat(" Number of edges:", ecount(g), "\n")cat(" Is directed:", is_directed(g), "\n")cat(" Is weighted:", is_weighted(g), "\n")
Expected Output:
Network created successfully!
Network Summary:
Number of nodes: 1005
Number of edges: 25571
Is directed: TRUE
Is weighted: FALSE
ImportantDirected vs Undirected
This network is directed because emails have a sender and receiver. The direction matters!
Edge A -> B: Person A sent email to Person B
This is different from B -> A
Step 5: Compute Node Degree
Now let us calculate degree centrality for all nodes in the network.
In-Degree and Out-Degree
In a directed network, each node has two types of degree:
In-degree: Number of incoming edges (emails received)
Out-degree: Number of outgoing edges (emails sent)
Degree Distribution: What percentage of nodes have in-degree greater than 50?
Network Density: Calculate the network density. Is this a sparse or dense network?
Isolates: How many nodes have total degree of 0? What might this represent?
Asymmetry: Identify nodes where the difference between in-degree and out-degree is greater than 100. What does this tell you about their communication pattern?
Quartiles: What is the in-degree at the 25th, 50th, and 75th percentiles?
Bonus Challenge: Create a function that takes a node ID and returns its degree rank (1st, 2nd, 3rd, etc.) for both in-degree and out-degree.
Key Takeaways
ImportantSummary
Loading network data: Edge lists are simple and common format for network data
Graph creation: graph_from_data_frame() converts edge lists to network objects
Degree centrality: Measures how connected each node is
In-degree: incoming connections
Out-degree: outgoing connections
Degree distribution: Often follows a power-law (many nodes with low degree, few with high degree)
Correlation analysis: Helps understand relationship between different centrality measures
Next Steps
Now that you can ingest network data and compute basic centrality measures, you are ready to:
Explore other centrality measures (betweenness, closeness, eigenvector)
Analyze community structure in networks
Compare multiple networks
Work with larger and more complex datasets
Check out Problem Set 1 to apply these skills to the Wikipedia Vote Network!