library(igraph)
library(ggplot2)
library(dplyr)
# Download the dataset
url <- "https://snap.stanford.edu/data/wiki-Vote.txt.gz"
download.file(url, destfile = "wiki-Vote.txt.gz")
# Read the edge list (skip comment lines that start with #)
edge_data <- read.table("wiki-Vote.txt.gz",
comment.char = "#",
col.names = c("from", "to"))
# Create a directed graph
g <- graph_from_data_frame(edge_data, directed = TRUE)
# Basic network statistics
cat("Number of nodes:", vcount(g), "\n")
cat("Number of edges:", ecount(g), "\n")
cat("Is directed:", is_directed(g), "\n")Problem Set 1: Degree Centrality and Degree Distribution
Week 2 - Network Analytics
Overview
In this problem set, you will work with real-world network data from Wikipedia to compute degree centrality measures and analyze degree distributions. This exercise will help you understand:
- How to load and process directed network data
- The difference between in-degree and out-degree centrality
- How to compute and visualize degree distributions
- What degree distributions reveal about network structure
Dataset: Wikipedia Vote Network
Description: This network represents voting relationships among Wikipedia users during administrator elections from the site’s inception through January 2008.
Network Properties:
- Nodes: 7,115 Wikipedia users
- Edges: 103,689 directed edges
- Edge meaning: A directed edge from user i to user j means that user i voted on user j
- Network type: Directed, unweighted
Task 1: Download and Load the Data
First, download the dataset and load it into R.
Task 2: Compute In-Degree and Out-Degree
Calculate the in-degree and out-degree for all nodes in the network.
Questions:
- What is the maximum in-degree? What does this represent in the context of Wikipedia voting?
- What is the maximum out-degree? What does this represent?
- How many nodes have in-degree of 0? What does this mean?
- How many nodes have out-degree of 0? What does this mean?
# Compute in-degree and out-degree
in_deg <- degree(g, mode = "in")
out_deg <- degree(g, mode = "out")
# Summary statistics
cat("In-Degree Statistics:\n")
cat(" Max:", max(in_deg), "\n")
cat(" Mean:", round(mean(in_deg), 2), "\n")
cat(" Median:", median(in_deg), "\n")
cat(" Nodes with in-degree = 0:", sum(in_deg == 0), "\n\n")
cat("Out-Degree Statistics:\n")
cat(" Max:", max(out_deg), "\n")
cat(" Mean:", round(mean(out_deg), 2), "\n")
cat(" Median:", median(out_deg), "\n")
cat(" Nodes with out-degree = 0:", sum(out_deg == 0), "\n")
# Find the most "voted-on" user
most_voted_user <- which.max(in_deg)
cat("\nMost voted-on user: Node", most_voted_user,
"with in-degree =", in_deg[most_voted_user], "\n")
# Find the most active voter
most_active_voter <- which.max(out_deg)
cat("Most active voter: Node", most_active_voter,
"with out-degree =", out_deg[most_active_voter], "\n")Task 3: Analyze the Top 10 Users
Identify and compare the top 10 users by in-degree and out-degree.
# Create a data frame with degree measures
degree_df <- data.frame(
node = V(g)$name,
in_degree = in_deg,
out_degree = out_deg,
total_degree = in_deg + out_deg
)
# Top 10 by in-degree (most voted-on)
top_in <- degree_df %>%
arrange(desc(in_degree)) %>%
head(10)
cat("Top 10 Users by In-Degree (Most Votes Received):\n")
print(top_in)
# Top 10 by out-degree (most active voters)
top_out <- degree_df %>%
arrange(desc(out_degree)) %>%
head(10)
cat("\nTop 10 Users by Out-Degree (Most Votes Cast):\n")
print(top_out)
# Are the top users in both categories the same?
cat("\nOverlap between top in-degree and top out-degree users:",
length(intersect(top_in$node, top_out$node)), "out of 10\n")Question: Do highly endorsed users (high in-degree) also tend to be active voters (high out-degree)? What does this tell you about participation patterns?
Task 4: Compute Degree Distribution
Calculate the degree distribution for both in-degree and out-degree.
# Compute degree distributions
in_degree_dist <- table(in_deg)
out_degree_dist <- table(out_deg)
# Convert to data frames for plotting
in_deg_df <- data.frame(
degree = as.numeric(names(in_degree_dist)),
frequency = as.numeric(in_degree_dist),
type = "In-Degree"
)
out_deg_df <- data.frame(
degree = as.numeric(names(out_degree_dist)),
frequency = as.numeric(out_degree_dist),
type = "Out-Degree"
)
# Combine for comparison
combined_deg_df <- rbind(in_deg_df, out_deg_df)
# View the first few rows
head(in_deg_df, 10)
head(out_deg_df, 10)Task 5: Plot Degree Distribution
Create visualizations of the degree distributions using both linear and log-log scales.
Linear Scale Plot
library(ggplot2)
# Plot on linear scale
ggplot(combined_deg_df, aes(x = degree, y = frequency, color = type)) +
geom_point(alpha = 0.6, size = 2) +
geom_line(alpha = 0.4) +
scale_color_manual(values = c("In-Degree" = "#c41c85", "Out-Degree" = "#50C878")) +
labs(
title = "Degree Distribution - Wikipedia Vote Network",
subtitle = "Linear scale",
x = "Degree (k)",
y = "Frequency (Number of Nodes)",
color = "Degree Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)Log-Log Scale Plot
# Plot on log-log scale
ggplot(combined_deg_df, aes(x = degree, y = frequency, color = type)) +
geom_point(alpha = 0.6, size = 2) +
geom_line(alpha = 0.4) +
scale_x_log10() +
scale_y_log10() +
scale_color_manual(values = c("In-Degree" = "#c41c85", "Out-Degree" = "#50C878")) +
labs(
title = "Degree Distribution - Wikipedia Vote Network",
subtitle = "Log-log scale",
x = "Degree (k) [log scale]",
y = "Frequency [log scale]",
color = "Degree Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)Questions:
- What shape does the degree distribution follow on the log-log scale?
- Does the distribution appear to follow a power law? If so, what does this indicate?
- How do the in-degree and out-degree distributions compare?
- What does the degree distribution tell you about the structure of Wikipedia’s voting network?
Task 6: Compute Complementary Cumulative Distribution Function (CCDF)
The CCDF is often more robust than the degree distribution for identifying power laws.
# Function to compute CCDF
compute_ccdf <- function(degrees) {
deg_table <- table(degrees)
deg_values <- as.numeric(names(deg_table))
deg_counts <- as.numeric(deg_table)
# Compute CCDF: P(X >= k)
ccdf <- sapply(deg_values, function(k) {
sum(deg_counts[deg_values >= k]) / sum(deg_counts)
})
data.frame(degree = deg_values, ccdf = ccdf)
}
# Compute CCDF for in-degree and out-degree
in_ccdf <- compute_ccdf(in_deg)
in_ccdf$type <- "In-Degree"
out_ccdf <- compute_ccdf(out_deg)
out_ccdf$type <- "Out-Degree"
ccdf_combined <- rbind(in_ccdf, out_ccdf)
# Plot CCDF on log-log scale
ggplot(ccdf_combined, aes(x = degree, y = ccdf, color = type)) +
geom_point(alpha = 0.6, size = 2) +
geom_line(alpha = 0.4) +
scale_x_log10() +
scale_y_log10() +
scale_color_manual(values = c("In-Degree" = "#c41c85", "Out-Degree" = "#50C878")) +
labs(
title = "Complementary Cumulative Distribution Function (CCDF)",
subtitle = "Wikipedia Vote Network - Log-log scale",
x = "Degree (k) [log scale]",
y = "P(X e k) [log scale]",
color = "Degree Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)Reflection Questions
After completing all tasks, answer the following:
Inequality in participation: What fraction of users received more than 100 votes? What fraction cast more than 100 votes?
Network interpretation: Based on the degree distributions, how would you characterize the Wikipedia voting system? Is it egalitarian or hierarchical?
Power-law hypothesis: Do the log-log plots suggest a power-law distribution? What would be the next steps to formally test this hypothesis?
Directed vs undirected: How would your analysis differ if this were an undirected network?
Practical implications: What does the degree distribution suggest about which users have influence in the Wikipedia administrator election process?
Solutions and Hints
Additional Challenges
Submission Guidelines
Your submission should include:
- R script or Quarto document with all code and outputs
- Visualizations of degree distributions (linear and log-log scales)
- Written answers to all reflection questions (200-400 words total)
- Interpretation of what the degree distribution reveals about Wikipedia’s voting network
File naming: week2_perform1_[YourName].qmd or .R
Due date: Check the course schedule