Fuller’s Beer Advocate Network Analysis

Week 8 - Network Analytics

Introduction

This notebook analyzes the BeerAdvocate forum participation data accompanying the Fuller’s teaching case. We examine how beer enthusiasts allocate their attention across different beer styles, revealing community structure through network analysis.

Analysis Pipeline

The analysis proceeds through five main stages:

  1. Data Loading & Network Construction: Load user-style participation data and build a two-mode (bipartite) network
  2. Network Projection: Transform the bipartite network into a one-mode user similarity network
  3. Structural Analysis: Compute network statistics and assess small-world properties
  4. Community Detection: Apply modularity-based algorithms (Louvain, Label Propagation, Greedy)
  5. Stochastic Block Modeling: Fit a hierarchical weighted SBM using graph-tool for probabilistic community inference

Data Structure

The data represents a two-mode (bipartite) network where:

  • Users (1,000 nodes) participate in forum discussions
  • Beer Styles (20 nodes) represent categories of discussion
  • Edges connect users to the styles they discuss, weighted by post count

When projected onto users, this network reveals attention-based communities—groups of enthusiasts who engage with similar styles. The projection creates an edge between two users whenever they share interest in at least one beer style, with edge weights reflecting the strength of shared attention.

import numpy as np
import pandas as pd
import networkx as nx
from networkx.algorithms import bipartite
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("NetworkX version:", nx.__version__)
NetworkX version: 3.5

1. Loading the Two-Mode Network Data

We load three CSV files:

  • beer_styles.csv: Reference table of 20 beer styles
  • ba_users.csv: User node attributes
  • ba_edges.csv: User-to-style participation edges
# Load beer styles reference
styles_df = pd.read_csv('../../../data/beerAdvocate/beer_styles.csv')
print("Beer Styles:")
print(styles_df.to_string(index=False))
Beer Styles:
style_id        style_name  community  popularity
     S01    English Bitter          1        0.70
     S02               ESB          1        0.60
     S03  English Pale Ale          1        0.50
     S04              Mild          1        0.30
     S05      American IPA          2        0.90
     S06        Double IPA          2        0.80
     S07 American Pale Ale          2        0.75
     S08       Session IPA          2        0.50
     S09    Belgian Tripel          3        0.60
     S10    Belgian Dubbel          3        0.55
     S11            Saison          3        0.65
     S12           Witbier          3        0.50
     S13    Imperial Stout          4        0.70
     S14            Porter          4        0.60
     S15     Oatmeal Stout          4        0.50
     S16     Baltic Porter          4        0.40
     S17    German Pilsner          5        0.65
     S18     Czech Pilsner          5        0.55
     S19            Helles          5        0.45
     S20       Schwarzbier          5        0.35
# Load user data
users_df = pd.read_csv('../../../data/beerAdvocate/ba_users.csv')
print(f"\nUsers: {len(users_df)} records")
print(f"\nUser attributes:")
print(users_df.head(10))
print(f"\nPrimary community distribution:")
print(users_df['primary_community'].value_counts().sort_index())

Users: 1000 records

User attributes:
  user_id  primary_community  is_bridge  n_styles  days_active  total_posts
0   U0001                  2      False         4           91            8
1   U0002                  1      False         6          365           92
2   U0003                  1      False         3          103            7
3   U0004                  2      False         4          236           17
4   U0005                  3      False         2          252           38
5   U0006                  2      False         2          357            5
6   U0007                  3      False         3          450           21
7   U0008                  1      False         3           42           90
8   U0009                  2      False         4          318           10
9   U0010                  5      False         2         1053           56

Primary community distribution:
primary_community
1    155
2    331
3    220
4    188
5    106
Name: count, dtype: int64
# Load edges (user -> style participation)
edges_df = pd.read_csv('../../../data/beerAdvocate/ba_edges.csv')
print(f"\nEdges: {len(edges_df)} user-style connections")
print(f"\nEdge sample:")
print(edges_df.head(10))
print(f"\nPost count statistics:")
print(edges_df['post_count'].describe())

Edges: 3613 user-style connections

Edge sample:
  user_id        beer_style  post_count
0   U0001        Double IPA           3
1   U0001           Witbier           3
2   U0001       Session IPA           1
3   U0001    Imperial Stout           1
4   U0002              Mild          12
5   U0002    English Bitter           1
6   U0002       Session IPA          54
7   U0002  English Pale Ale          11
8   U0002               ESB           5
9   U0002    German Pilsner           9

Post count statistics:
count    3613.000000
mean       12.459452
std        18.198178
min         1.000000
25%         2.000000
50%         5.000000
75%        15.000000
max       181.000000
Name: post_count, dtype: float64

2. Creating a NetworkX Bipartite Graph

We construct the two-mode network using NetworkX, with users and beer styles as the two node sets.

# Create empty bipartite graph
B = nx.Graph()

# Add user nodes (bipartite=0)
user_ids = users_df['user_id'].tolist()
for _, row in users_df.iterrows():
    B.add_node(row['user_id'],
               bipartite=0,
               node_type='user',
               primary_community=row['primary_community'],
               is_bridge=row['is_bridge'],
               days_active=row['days_active'],
               total_posts=row['total_posts'])

# Add style nodes (bipartite=1)
for _, row in styles_df.iterrows():
    B.add_node(row['style_name'],
               bipartite=1,
               node_type='style',
               community=row['community'],
               popularity=row['popularity'])

# Add weighted edges
for _, row in edges_df.iterrows():
    B.add_edge(row['user_id'], row['beer_style'], weight=row['post_count'])

# Verify bipartite structure
user_nodes = {n for n, d in B.nodes(data=True) if d['bipartite'] == 0}
style_nodes = {n for n, d in B.nodes(data=True) if d['bipartite'] == 1}

print("Bipartite Network Created")
print(f"  User nodes: {len(user_nodes)}")
print(f"  Style nodes: {len(style_nodes)}")
print(f"  Total edges: {B.number_of_edges()}")
print(f"  Is bipartite: {bipartite.is_bipartite(B)}")
Bipartite Network Created
  User nodes: 1000
  Style nodes: 20
  Total edges: 3613
  Is bipartite: True
# Basic bipartite network statistics
print("\nBipartite Network Statistics:")
print(f"  Network density: {nx.density(B):.4f}")
print(f"  Average degree (users): {np.mean([B.degree(n) for n in user_nodes]):.2f}")
print(f"  Average degree (styles): {np.mean([B.degree(n) for n in style_nodes]):.2f}")

# Degree distribution for styles
style_degrees = [(n, B.degree(n)) for n in style_nodes]
style_degrees.sort(key=lambda x: x[1], reverse=True)
print("\nStyle engagement (degree):")
for style, degree in style_degrees[:10]:
    print(f"  {style}: {degree} users")

Bipartite Network Statistics:
  Network density: 0.0070
  Average degree (users): 3.61
  Average degree (styles): 180.65

Style engagement (degree):
  American IPA: 286 users
  Double IPA: 280 users
  American Pale Ale: 273 users
  Session IPA: 233 users
  Saison: 200 users
  Imperial Stout: 200 users
  ESB: 195 users
  English Bitter: 194 users
  Belgian Dubbel: 187 users
  Witbier: 185 users

3. Projecting onto Users (One-Mode Network)

We project the bipartite network onto the user node set. In the projected network, two users are connected if they both participate in discussions about at least one common beer style. Edge weights reflect the strength of shared interests.

# Project bipartite network onto users
# Using weighted projection where weight = sum of shared style participations
G_users = bipartite.weighted_projected_graph(B, user_nodes)

print("User Projection (One-Mode Network)")
print(f"  Nodes: {G_users.number_of_nodes()}")
print(f"  Edges: {G_users.number_of_edges()}")
print(f"  Density: {nx.density(G_users):.4f}")
User Projection (One-Mode Network)
  Nodes: 1000
  Edges: 242861
  Density: 0.4862
# Analyze projected network
print("\nProjected Network Statistics:")

# Connectivity
if nx.is_connected(G_users):
    print(f"  Connected: Yes")
    print(f"  Diameter: {nx.diameter(G_users)}")
    print(f"  Average shortest path: {nx.average_shortest_path_length(G_users):.3f}")
else:
    largest_cc = max(nx.connected_components(G_users), key=len)
    print(f"  Connected: No")
    print(f"  Number of components: {nx.number_connected_components(G_users)}")
    print(f"  Largest component: {len(largest_cc)} nodes")

# Clustering
avg_clustering = nx.average_clustering(G_users)
print(f"  Average clustering coefficient: {avg_clustering:.4f}")

# Transitivity (global clustering)
transitivity = nx.transitivity(G_users)
print(f"  Transitivity (global clustering): {transitivity:.4f}")

# Degree statistics
degrees = [d for n, d in G_users.degree()]
print(f"\nDegree statistics:")
print(f"  Min: {min(degrees)}, Max: {max(degrees)}")
print(f"  Mean: {np.mean(degrees):.2f}, Std: {np.std(degrees):.2f}")

Projected Network Statistics:
  Connected: Yes
  Diameter: 2
  Average shortest path: 1.514
  Average clustering coefficient: 0.6934
  Transitivity (global clustering): 0.6575

Degree statistics:
  Min: 160, Max: 889
  Mean: 485.72, Std: 141.60
# Small-world analysis
print("\nSmall-World Analysis:")
n = G_users.number_of_nodes()
m = G_users.number_of_edges()
p = 2 * m / (n * (n - 1))  # Edge probability

# Compare to random graph expectations
expected_clustering_random = p
expected_path_random = np.log(n) / np.log(n * p) if n * p > 1 else float('inf')

print(f"  Actual clustering: {avg_clustering:.4f}")
print(f"  Expected (random): {expected_clustering_random:.4f}")
print(f"  Clustering ratio: {avg_clustering / expected_clustering_random:.2f}x")

if nx.is_connected(G_users):
    actual_path = nx.average_shortest_path_length(G_users)
    print(f"\n  Actual avg path length: {actual_path:.3f}")
    print(f"  Expected (random): {expected_path_random:.3f}")

    # Small-world coefficient (sigma)
    sigma = (avg_clustering / expected_clustering_random) / (actual_path / expected_path_random)
    print(f"\n  Small-world coefficient (σ): {sigma:.2f}")
    print(f"  (σ > 1 indicates small-world structure)")

Small-World Analysis:
  Actual clustering: 0.6934
  Expected (random): 0.4862
  Clustering ratio: 1.43x

  Actual avg path length: 1.514
  Expected (random): 1.117

  Small-world coefficient (σ): 1.05
  (σ > 1 indicates small-world structure)

4. Network Visualization

We visualize the projected user network, coloring nodes by their primary community affiliation.

# Prepare node colors based on primary community
community_colors = {
    1: '#E41A1C',  # Red - Traditional British
    2: '#377EB8',  # Blue - American Craft
    3: '#4DAF4A',  # Green - Belgian
    4: '#984EA3',  # Purple - Dark/Roasted
    5: '#FF7F00',  # Orange - Lager
}

node_colors = []
for node in G_users.nodes():
    comm = users_df[users_df['user_id'] == node]['primary_community'].values[0]
    node_colors.append(community_colors[comm])

# Identify bridge users for highlighting
bridge_users = set(users_df[users_df['is_bridge'] == True]['user_id'])
node_sizes = [50 if n in bridge_users else 20 for n in G_users.nodes()]
# Spring layout visualization (sample for readability)
# For large networks, we sample nodes
sample_size = min(300, G_users.number_of_nodes())
sample_nodes = list(G_users.nodes())[:sample_size]
G_sample = G_users.subgraph(sample_nodes)

# Get colors for sample
sample_colors = [node_colors[list(G_users.nodes()).index(n)] for n in G_sample.nodes()]
sample_sizes = [node_sizes[list(G_users.nodes()).index(n)] for n in G_sample.nodes()]

fig, ax = plt.subplots(figsize=(12, 10))

# Compute layout
pos = nx.spring_layout(G_sample, k=0.3, iterations=50, seed=42)

# Draw edges with low alpha
nx.draw_networkx_edges(G_sample, pos, alpha=0.05, edge_color='gray', ax=ax)

# Draw nodes
nx.draw_networkx_nodes(G_sample, pos, node_color=sample_colors,
                       node_size=sample_sizes, alpha=0.7, ax=ax)

# Legend
legend_patches = [
    mpatches.Patch(color=community_colors[1], label='Traditional British'),
    mpatches.Patch(color=community_colors[2], label='American Craft'),
    mpatches.Patch(color=community_colors[3], label='Belgian'),
    mpatches.Patch(color=community_colors[4], label='Dark/Roasted'),
    mpatches.Patch(color=community_colors[5], label='Lager'),
]
ax.legend(handles=legend_patches, loc='upper left', fontsize=10)

ax.set_title(f'Beer Enthusiast Network (sample of {sample_size} users)', fontsize=14)
ax.axis('off')
plt.tight_layout()
plt.show()

User network with spring layout (colored by primary community)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Degree histogram
degrees = [d for n, d in G_users.degree()]
axes[0].hist(degrees, bins=50, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].set_xlabel('Degree', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Degree Distribution', fontsize=14)
axes[0].axvline(np.mean(degrees), color='red', linestyle='--', label=f'Mean: {np.mean(degrees):.0f}')
axes[0].legend()

# Log-log degree distribution
degree_counts = Counter(degrees)
x = list(degree_counts.keys())
y = list(degree_counts.values())
axes[1].scatter(x, y, alpha=0.6, color='steelblue')
axes[1].set_xscale('log')
axes[1].set_yscale('log')
axes[1].set_xlabel('Degree (log)', fontsize=12)
axes[1].set_ylabel('Frequency (log)', fontsize=12)
axes[1].set_title('Degree Distribution (Log-Log)', fontsize=14)

plt.tight_layout()
plt.show()

Degree distribution of the projected user network

5. Community Detection Algorithms

We apply several community detection algorithms to the projected network and compare their results.

5.1 Louvain Algorithm

# Louvain community detection
louvain_communities = nx.community.louvain_communities(G_users, seed=42)

print("Louvain Community Detection")
print(f"  Number of communities: {len(louvain_communities)}")
print(f"\n  Community sizes:")
for i, comm in enumerate(sorted(louvain_communities, key=len, reverse=True)):
    print(f"    Community {i+1}: {len(comm)} users")

# Calculate modularity
louvain_modularity = nx.community.modularity(G_users, louvain_communities)
print(f"\n  Modularity: {louvain_modularity:.4f}")
Louvain Community Detection
  Number of communities: 5

  Community sizes:
    Community 1: 306 users
    Community 2: 243 users
    Community 3: 200 users
    Community 4: 184 users
    Community 5: 67 users

  Modularity: 0.2742
# Analyze community composition
print("\nLouvain Community Composition (by original community):")
for i, comm in enumerate(sorted(louvain_communities, key=len, reverse=True)[:5]):
    comm_users = users_df[users_df['user_id'].isin(comm)]
    primary_dist = comm_users['primary_community'].value_counts()
    dominant = primary_dist.idxmax()
    dominant_pct = primary_dist.max() / len(comm) * 100

    print(f"\n  Detected Community {i+1} ({len(comm)} users):")
    print(f"    Dominant original community: {dominant} ({dominant_pct:.1f}%)")
    print(f"    Distribution: {dict(primary_dist.sort_index())}")

Louvain Community Composition (by original community):

  Detected Community 1 (306 users):
    Dominant original community: 2 (88.9%)
    Distribution: {1: np.int64(3), 2: np.int64(272), 3: np.int64(6), 4: np.int64(15), 5: np.int64(10)}

  Detected Community 2 (243 users):
    Dominant original community: 3 (81.9%)
    Distribution: {1: np.int64(7), 2: np.int64(16), 3: np.int64(199), 4: np.int64(8), 5: np.int64(13)}

  Detected Community 3 (200 users):
    Dominant original community: 1 (70.0%)
    Distribution: {1: np.int64(140), 2: np.int64(25), 3: np.int64(7), 4: np.int64(7), 5: np.int64(21)}

  Detected Community 4 (184 users):
    Dominant original community: 4 (84.8%)
    Distribution: {1: np.int64(5), 2: np.int64(16), 3: np.int64(6), 4: np.int64(156), 5: np.int64(1)}

  Detected Community 5 (67 users):
    Dominant original community: 5 (91.0%)
    Distribution: {2: np.int64(2), 3: np.int64(2), 4: np.int64(2), 5: np.int64(61)}

5.2 Label Propagation

# Label propagation
label_prop_communities = list(nx.community.label_propagation_communities(G_users))

print("Label Propagation Community Detection")
print(f"  Number of communities: {len(label_prop_communities)}")
print(f"\n  Community sizes (top 10):")
for i, comm in enumerate(sorted(label_prop_communities, key=len, reverse=True)[:10]):
    print(f"    Community {i+1}: {len(comm)} users")

# Calculate modularity
label_prop_modularity = nx.community.modularity(G_users, label_prop_communities)
print(f"\n  Modularity: {label_prop_modularity:.4f}")
Label Propagation Community Detection
  Number of communities: 1

  Community sizes (top 10):
    Community 1: 1000 users

  Modularity: 0.0000

5.3 Greedy Modularity Optimization

# Greedy modularity optimization
greedy_communities = list(nx.community.greedy_modularity_communities(G_users))

print("Greedy Modularity Optimization")
print(f"  Number of communities: {len(greedy_communities)}")
print(f"\n  Community sizes:")
for i, comm in enumerate(sorted(greedy_communities, key=len, reverse=True)):
    print(f"    Community {i+1}: {len(comm)} users")

# Calculate modularity
greedy_modularity = nx.community.modularity(G_users, greedy_communities)
print(f"\n  Modularity: {greedy_modularity:.4f}")
Greedy Modularity Optimization
  Number of communities: 4

  Community sizes:
    Community 1: 351 users
    Community 2: 312 users
    Community 3: 275 users
    Community 4: 62 users

  Modularity: 0.1387

5.4 Algorithm Comparison

# Summary comparison
results = {
    'Algorithm': ['Louvain', 'Label Propagation', 'Greedy Modularity'],
    'Communities': [len(louvain_communities), len(label_prop_communities), len(greedy_communities)],
    'Modularity': [louvain_modularity, label_prop_modularity, greedy_modularity],
    'Largest Community': [
        max(len(c) for c in louvain_communities),
        max(len(c) for c in label_prop_communities),
        max(len(c) for c in greedy_communities)
    ]
}
comparison_df = pd.DataFrame(results)
print("\nAlgorithm Comparison:")
print(comparison_df.to_string(index=False))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Modularity comparison
colors = ['#2ecc71', '#3498db', '#e74c3c']
axes[0].bar(results['Algorithm'], results['Modularity'], color=colors, edgecolor='black')
axes[0].set_ylabel('Modularity', fontsize=12)
axes[0].set_title('Modularity by Algorithm', fontsize=14)
axes[0].set_ylim(0, max(results['Modularity']) * 1.2)
for i, v in enumerate(results['Modularity']):
    axes[0].text(i, v + 0.01, f'{v:.3f}', ha='center', fontsize=11)

# Community count comparison
axes[1].bar(results['Algorithm'], results['Communities'], color=colors, edgecolor='black')
axes[1].set_ylabel('Number of Communities', fontsize=12)
axes[1].set_title('Communities Detected', fontsize=14)
for i, v in enumerate(results['Communities']):
    axes[1].text(i, v + 0.5, str(v), ha='center', fontsize=11)

plt.tight_layout()
plt.show()

Algorithm Comparison:
        Algorithm  Communities  Modularity  Largest Community
          Louvain            5    0.274228                306
Label Propagation            1    0.000000               1000
Greedy Modularity            4    0.138717                351

Comparison of community detection algorithms

6. Results Summary

print("=" * 70)
print("NETWORK ANALYSIS SUMMARY")
print("=" * 70)

print("\n1. BIPARTITE NETWORK")
print(f"   - Users: {len(user_nodes)}")
print(f"   - Beer Styles: {len(style_nodes)}")
print(f"   - Edges: {B.number_of_edges()}")

print("\n2. PROJECTED USER NETWORK")
print(f"   - Nodes: {G_users.number_of_nodes()}")
print(f"   - Edges: {G_users.number_of_edges()}")
print(f"   - Density: {nx.density(G_users):.4f}")
print(f"   - Avg Clustering: {avg_clustering:.4f}")
if nx.is_connected(G_users):
    print(f"   - Avg Path Length: {nx.average_shortest_path_length(G_users):.3f}")
    print(f"   - Diameter: {nx.diameter(G_users)}")

print("\n3. SMALL-WORLD PROPERTIES")
print(f"   - Clustering ratio (vs random): {avg_clustering / expected_clustering_random:.2f}x")
print(f"   - Network exhibits small-world structure: {'Yes' if avg_clustering / expected_clustering_random > 1 else 'No'}")

print("\n4. COMMUNITY DETECTION")
print(f"   - Best algorithm (by modularity): {comparison_df.loc[comparison_df['Modularity'].idxmax(), 'Algorithm']}")
print(f"   - Best modularity score: {comparison_df['Modularity'].max():.4f}")
print(f"   - Consensus community count: ~{int(np.median(results['Communities']))}")

print("\n5. KEY INSIGHTS")
print("   - The network shows clear community structure aligned with beer style families")
print("   - High clustering indicates users form cohesive interest groups")
print("   - Short path lengths suggest bridging connections across communities")
print("   - Community detection recovers the underlying style-based segmentation")
print("=" * 70)
======================================================================
NETWORK ANALYSIS SUMMARY
======================================================================

1. BIPARTITE NETWORK
   - Users: 1000
   - Beer Styles: 20
   - Edges: 3613

2. PROJECTED USER NETWORK
   - Nodes: 1000
   - Edges: 242861
   - Density: 0.4862
   - Avg Clustering: 0.6934
   - Avg Path Length: 1.514
   - Diameter: 2

3. SMALL-WORLD PROPERTIES
   - Clustering ratio (vs random): 1.43x
   - Network exhibits small-world structure: Yes

4. COMMUNITY DETECTION
   - Best algorithm (by modularity): Louvain
   - Best modularity score: 0.2742
   - Consensus community count: ~4

5. KEY INSIGHTS
   - The network shows clear community structure aligned with beer style families
   - High clustering indicates users form cohesive interest groups
   - Short path lengths suggest bridging connections across communities
   - Community detection recovers the underlying style-based segmentation
======================================================================

7. Writing the Projected Network to File

We save the projected user network in multiple formats for further analysis.

# Create node attributes dataframe for the projected network
node_attrs = []
for node in G_users.nodes():
    user_data = users_df[users_df['user_id'] == node].iloc[0]

    # Find Louvain community assignment
    louvain_comm = None
    for i, comm in enumerate(louvain_communities):
        if node in comm:
            louvain_comm = i
            break

    node_attrs.append({
        'id': node,
        'primary_community': user_data['primary_community'],
        'is_bridge': user_data['is_bridge'],
        'days_active': user_data['days_active'],
        'total_posts': user_data['total_posts'],
        'degree': G_users.degree(node),
        'louvain_community': louvain_comm
    })

nodes_export_df = pd.DataFrame(node_attrs)
nodes_export_df.to_csv('../../../data/beerAdvocate/projected_nodes.csv', index=False)
print("Saved: projected_nodes.csv")
print(nodes_export_df.head())
Saved: projected_nodes.csv
      id  primary_community  is_bridge  days_active  total_posts  degree  \
0  U0363                  2      False         2077           28     624   
1  U0113                  2       True          101            5     385   
2  U0184                  2      False          154           10     369   
3  U0590                  2       True          349            7     341   
4  U0269                  2      False           44            8     615   

   louvain_community  
0                  0  
1                  2  
2                  4  
3                  2  
4                  0  
# Export edges with weights
edges_export = []
for u, v, d in G_users.edges(data=True):
    edges_export.append({
        'source': u,
        'target': v,
        'weight': d.get('weight', 1)
    })

edges_export_df = pd.DataFrame(edges_export)
edges_export_df.to_csv('../../../data/beerAdvocate/projected_edges.csv', index=False)
print(f"\nSaved: projected_edges.csv ({len(edges_export_df)} edges)")
print(edges_export_df.head())

Saved: projected_edges.csv (242861 edges)
  source target  weight
0  U0363  U0269       4
1  U0363  U0206       4
2  U0363  U0335       3
3  U0363  U0195       1
4  U0363  U0151       3
# Save as GraphML for compatibility with other tools
nx.write_graphml(G_users, '../../../data/beerAdvocate/user_network.graphml')
print("\nSaved: user_network.graphml")

# Also save edge list format
nx.write_edgelist(G_users, '../../../data/beerAdvocate/user_network.edgelist', data=['weight'])
print("Saved: user_network.edgelist")

Saved: user_network.graphml
Saved: user_network.edgelist

8. Reading the Network Back

We verify the saved files by reading them back and reconstructing the network.

# Read from CSV files
nodes_read = pd.read_csv('../../../data/beerAdvocate/projected_nodes.csv')
edges_read = pd.read_csv('../../../data/beerAdvocate/projected_edges.csv')

print("Reading network from CSV files...")
print(f"  Nodes: {len(nodes_read)}")
print(f"  Edges: {len(edges_read)}")

# Reconstruct NetworkX graph
G_reconstructed = nx.Graph()

# Add nodes with attributes
for _, row in nodes_read.iterrows():
    G_reconstructed.add_node(row['id'],
                             primary_community=row['primary_community'],
                             is_bridge=row['is_bridge'],
                             degree=row['degree'],
                             louvain_community=row['louvain_community'])

# Add edges with weights
for _, row in edges_read.iterrows():
    G_reconstructed.add_edge(row['source'], row['target'], weight=row['weight'])

print(f"\nReconstructed graph:")
print(f"  Nodes: {G_reconstructed.number_of_nodes()}")
print(f"  Edges: {G_reconstructed.number_of_edges()}")
print(f"  Matches original: {G_reconstructed.number_of_nodes() == G_users.number_of_nodes() and G_reconstructed.number_of_edges() == G_users.number_of_edges()}")
Reading network from CSV files...
  Nodes: 1000
  Edges: 242861

Reconstructed graph:
  Nodes: 1000
  Edges: 242861
  Matches original: True
# Read from GraphML
G_from_graphml = nx.read_graphml('../../../data/beerAdvocate/user_network.graphml')
print(f"\nRead from GraphML:")
print(f"  Nodes: {G_from_graphml.number_of_nodes()}")
print(f"  Edges: {G_from_graphml.number_of_edges()}")

Read from GraphML:
  Nodes: 1000
  Edges: 242861

9. Graph-Tool Network Instantiation

We now convert the network to graph-tool format for advanced analysis. Graph-tool is a highly efficient Python library for network analysis that provides sophisticated algorithms including stochastic block models.

# Import graph-tool
try:
    import graph_tool.all as gt
    GRAPH_TOOL_AVAILABLE = True
    print(f"graph-tool version: {gt.__version__}")
except ImportError:
    GRAPH_TOOL_AVAILABLE = False
    print("graph-tool is not installed.")
    print("To install: conda install -c conda-forge graph-tool")
    print("Or see: https://graph-tool.skewed.de/")
graph-tool version: 2.98 (commit 9c32966d, )
if GRAPH_TOOL_AVAILABLE:
    # Create graph-tool graph
    g = gt.Graph(directed=False)

    # Create vertex property map for node IDs
    v_id = g.new_vertex_property("string")
    v_primary_comm = g.new_vertex_property("int")
    v_is_bridge = g.new_vertex_property("bool")
    v_louvain = g.new_vertex_property("int")

    # Create edge property map for weights
    e_weight = g.new_edge_property("double")

    # Map from node ID to vertex
    node_to_vertex = {}

    # Add vertices
    for _, row in nodes_read.iterrows():
        v = g.add_vertex()
        node_to_vertex[row['id']] = v
        v_id[v] = row['id']
        v_primary_comm[v] = int(row['primary_community'])
        v_is_bridge[v] = bool(row['is_bridge'])
        v_louvain[v] = int(row['louvain_community'])

    # Add edges
    for _, row in edges_read.iterrows():
        e = g.add_edge(node_to_vertex[row['source']],
                       node_to_vertex[row['target']])
        e_weight[e] = row['weight']

    # Set as internal properties
    g.vertex_properties["id"] = v_id
    g.vertex_properties["primary_community"] = v_primary_comm
    g.vertex_properties["is_bridge"] = v_is_bridge
    g.vertex_properties["louvain_community"] = v_louvain
    g.edge_properties["weight"] = e_weight

    print("Graph-tool network created:")
    print(f"  Vertices: {g.num_vertices()}")
    print(f"  Edges: {g.num_edges()}")
else:
    print("Skipping graph-tool conversion (not installed)")
Graph-tool network created:
  Vertices: 1000
  Edges: 242861

10. Stochastic Weighted Block Model

The Stochastic Block Model (SBM) is a generative model for networks that assumes nodes belong to groups (blocks), and the probability of an edge between two nodes depends only on their group memberships.

Graph-tool provides a hierarchical nested SBM implementation that:

  • Discovers multi-level community structure automatically
  • Handles edge weights natively
  • Uses Bayesian inference to avoid overfitting
  • Provides statistically principled model selection

The weighted SBM approach follows Peixoto (2018), which extends the nonparametric SBM framework to incorporate edge weights through microcanonical distributions (here, real-exponential for continuous positive weights).

Peixoto, T. P. (2018). Nonparametric weighted stochastic block models. Physical Review E, 97(1), 012306. doi:10.1103/PhysRevE.97.012306 | arXiv:1708.01432

if GRAPH_TOOL_AVAILABLE:
    print("Fitting Hierarchical Nested Stochastic Block Model...")
    print("(This may take a few minutes for large networks)\n")

    # Fit weighted hierarchical nested SBM using minimize_nested_blockmodel_dl
    # This finds optimal hierarchical partition by minimizing description length
    # The nested model automatically discovers multiple levels of community structure
    nested_state = gt.minimize_nested_blockmodel_dl(
        g,
        state_args=dict(recs=[e_weight], rec_types=["real-exponential"])
    )

    print(f"Initial description length: {nested_state.entropy():.2f}")

    # Improve solution with merge-split MCMC sweeps
    print("\nRefining solution with merge-split MCMC...")
    for i in range(100):
        ret = nested_state.multiflip_mcmc_sweep(niter=10, beta=np.inf)

    print(f"Refined description length: {nested_state.entropy():.2f}")

    # Get the hierarchy of block states
    levels = nested_state.get_levels()
    print(f"\nHierarchical SBM Results:")
    print(f"  Number of hierarchy levels: {len(levels)}")

    print(f"\n  Hierarchy structure:")
    for i, level_state in enumerate(levels):
        level_blocks = level_state.get_blocks()
        level_block_counts = Counter(level_blocks[v] for v in level_state.g.vertices())
        n_level_blocks = len(level_block_counts)
        print(f"    Level {i}: {level_state.g.num_vertices()} nodes → {n_level_blocks} blocks")

    # Get bottom-level (finest) block assignments for nodes
    bottom_state = levels[0]
    blocks = bottom_state.get_blocks()

    # Count blocks at bottom level
    block_counts = Counter(blocks[v] for v in g.vertices())
    n_blocks = len(block_counts)

    print(f"\n  Bottom-level block sizes:")
    for block_id, count in sorted(block_counts.items(), key=lambda x: -x[1])[:10]:
        print(f"    Block {block_id}: {count} nodes")
    if n_blocks > 10:
        print(f"    ... and {n_blocks - 10} more blocks")
else:
    print("Skipping SBM (graph-tool not installed)")
Fitting Hierarchical Nested Stochastic Block Model...
(This may take a few minutes for large networks)

Initial description length: 656073.00

Refining solution with merge-split MCMC...
Refined description length: 653166.22

Hierarchical SBM Results:
  Number of hierarchy levels: 11

  Hierarchy structure:
    Level 0: 1000 nodes → 45 blocks
    Level 1: 1000 nodes → 25 blocks
    Level 2: 49 nodes → 11 blocks
    Level 3: 13 nodes → 5 blocks
    Level 4: 7 nodes → 1 blocks
    Level 5: 6 nodes → 1 blocks
    Level 6: 1 nodes → 1 blocks
    Level 7: 1 nodes → 1 blocks
    Level 8: 1 nodes → 1 blocks
    Level 9: 1 nodes → 1 blocks
    Level 10: 1 nodes → 1 blocks

  Bottom-level block sizes:
    Block 512: 61 nodes
    Block 483: 52 nodes
    Block 896: 47 nodes
    Block 921: 46 nodes
    Block 554: 39 nodes
    Block 160: 37 nodes
    Block 886: 37 nodes
    Block 38: 31 nodes
    Block 205: 28 nodes
    Block 371: 27 nodes
    ... and 35 more blocks
if GRAPH_TOOL_AVAILABLE:
    # Analyze correspondence between SBM blocks and original communities
    print("\nSBM Block Composition (by original community):")

    # Create mapping
    sbm_assignments = {}
    for v in g.vertices():
        node_id = v_id[v]
        block = blocks[v]
        orig_comm = v_primary_comm[v]
        if block not in sbm_assignments:
            sbm_assignments[block] = []
        sbm_assignments[block].append(orig_comm)

    # Print composition
    for block_id in sorted(sbm_assignments.keys()):
        members = sbm_assignments[block_id]
        comm_dist = Counter(members)
        dominant = comm_dist.most_common(1)[0]
        print(f"\n  Block {block_id} ({len(members)} nodes):")
        print(f"    Dominant community: {dominant[0]} ({100*dominant[1]/len(members):.1f}%)")
        print(f"    Distribution: {dict(sorted(comm_dist.items()))}")

SBM Block Composition (by original community):

  Block 9 (26 nodes):
    Dominant community: 2 (53.8%)
    Distribution: {2: 14, 3: 3, 4: 8, 5: 1}

  Block 11 (12 nodes):
    Dominant community: 5 (33.3%)
    Distribution: {1: 1, 3: 3, 4: 4, 5: 4}

  Block 22 (12 nodes):
    Dominant community: 5 (50.0%)
    Distribution: {2: 5, 3: 1, 5: 6}

  Block 38 (31 nodes):
    Dominant community: 4 (74.2%)
    Distribution: {2: 4, 3: 4, 4: 23}

  Block 83 (22 nodes):
    Dominant community: 2 (77.3%)
    Distribution: {2: 17, 3: 5}

  Block 122 (18 nodes):
    Dominant community: 1 (94.4%)
    Distribution: {1: 17, 5: 1}

  Block 149 (16 nodes):
    Dominant community: 2 (37.5%)
    Distribution: {1: 2, 2: 6, 3: 3, 4: 5}

  Block 160 (37 nodes):
    Dominant community: 2 (48.6%)
    Distribution: {1: 7, 2: 18, 3: 2, 4: 10}

  Block 188 (16 nodes):
    Dominant community: 5 (43.8%)
    Distribution: {1: 3, 3: 6, 5: 7}

  Block 205 (28 nodes):
    Dominant community: 5 (100.0%)
    Distribution: {5: 28}

  Block 242 (19 nodes):
    Dominant community: 3 (78.9%)
    Distribution: {1: 1, 2: 3, 3: 15}

  Block 255 (26 nodes):
    Dominant community: 3 (92.3%)
    Distribution: {2: 2, 3: 24}

  Block 267 (16 nodes):
    Dominant community: 2 (43.8%)
    Distribution: {1: 5, 2: 7, 5: 4}

  Block 290 (14 nodes):
    Dominant community: 2 (50.0%)
    Distribution: {2: 7, 4: 6, 5: 1}

  Block 324 (16 nodes):
    Dominant community: 4 (87.5%)
    Distribution: {3: 1, 4: 14, 5: 1}

  Block 362 (12 nodes):
    Dominant community: 2 (91.7%)
    Distribution: {2: 11, 3: 1}

  Block 371 (27 nodes):
    Dominant community: 2 (55.6%)
    Distribution: {1: 8, 2: 15, 3: 1, 4: 1, 5: 2}

  Block 438 (16 nodes):
    Dominant community: 3 (93.8%)
    Distribution: {2: 1, 3: 15}

  Block 447 (15 nodes):
    Dominant community: 3 (93.3%)
    Distribution: {3: 14, 5: 1}

  Block 474 (21 nodes):
    Dominant community: 2 (95.2%)
    Distribution: {2: 20, 5: 1}

  Block 483 (52 nodes):
    Dominant community: 4 (98.1%)
    Distribution: {3: 1, 4: 51}

  Block 493 (25 nodes):
    Dominant community: 5 (64.0%)
    Distribution: {1: 9, 5: 16}

  Block 512 (61 nodes):
    Dominant community: 3 (95.1%)
    Distribution: {2: 2, 3: 58, 5: 1}

  Block 517 (20 nodes):
    Dominant community: 3 (50.0%)
    Distribution: {2: 5, 3: 10, 5: 5}

  Block 525 (14 nodes):
    Dominant community: 3 (64.3%)
    Distribution: {2: 4, 3: 9, 5: 1}

  Block 535 (14 nodes):
    Dominant community: 5 (42.9%)
    Distribution: {1: 3, 4: 5, 5: 6}

  Block 540 (19 nodes):
    Dominant community: 2 (100.0%)
    Distribution: {2: 19}

  Block 542 (3 nodes):
    Dominant community: 4 (100.0%)
    Distribution: {4: 3}

  Block 554 (39 nodes):
    Dominant community: 2 (100.0%)
    Distribution: {2: 39}

  Block 565 (23 nodes):
    Dominant community: 3 (56.5%)
    Distribution: {3: 13, 4: 1, 5: 9}

  Block 567 (22 nodes):
    Dominant community: 4 (50.0%)
    Distribution: {1: 4, 2: 5, 4: 11, 5: 2}

  Block 610 (11 nodes):
    Dominant community: 1 (100.0%)
    Distribution: {1: 11}

  Block 624 (23 nodes):
    Dominant community: 3 (69.6%)
    Distribution: {1: 6, 2: 1, 3: 16}

  Block 649 (25 nodes):
    Dominant community: 3 (36.0%)
    Distribution: {1: 9, 2: 4, 3: 9, 5: 3}

  Block 665 (10 nodes):
    Dominant community: 2 (100.0%)
    Distribution: {2: 10}

  Block 673 (11 nodes):
    Dominant community: 1 (81.8%)
    Distribution: {1: 9, 2: 2}

  Block 855 (10 nodes):
    Dominant community: 4 (80.0%)
    Distribution: {2: 2, 4: 8}

  Block 859 (10 nodes):
    Dominant community: 1 (100.0%)
    Distribution: {1: 10}

  Block 886 (37 nodes):
    Dominant community: 2 (100.0%)
    Distribution: {2: 37}

  Block 896 (47 nodes):
    Dominant community: 1 (95.7%)
    Distribution: {1: 45, 5: 2}

  Block 921 (46 nodes):
    Dominant community: 4 (56.5%)
    Distribution: {2: 19, 3: 1, 4: 26}

  Block 942 (10 nodes):
    Dominant community: 2 (70.0%)
    Distribution: {2: 7, 4: 1, 5: 2}

  Block 950 (23 nodes):
    Dominant community: 2 (100.0%)
    Distribution: {2: 23}

  Block 965 (22 nodes):
    Dominant community: 4 (45.5%)
    Distribution: {1: 5, 3: 5, 4: 10, 5: 2}

  Block 995 (23 nodes):
    Dominant community: 2 (95.7%)
    Distribution: {2: 22, 4: 1}

11. Visualizing the Hierarchical Block Model

The nested SBM can be visualized in multiple ways: 1. Hierarchical tree: Shows the nested community structure 2. Network with block colors: Traditional network layout colored by blocks 3. Block matrix: Edge density between blocks

if GRAPH_TOOL_AVAILABLE:
    import matplotlib.cm
    print("Drawing hierarchical block model structure...")

    # Draw the nested block model hierarchy with edge weights
    # Edge color and width are mapped to weight using log scale
    nested_state.draw(
        edge_color=gt.prop_to_size(e_weight, power=1, log=True),
        ecmap=(matplotlib.cm.inferno, 0.6),
        eorder=e_weight,
        edge_pen_width=gt.prop_to_size(e_weight, 1, 4, power=1, log=True),
        edge_gradient=[],
        vertex_size=3,
        output_size=(1200, 1000),
        output="../../../data/beerAdvocate/sbm_hierarchy.png"
    )
    print("Saved hierarchical visualization to: sbm_hierarchy.png")

    # Display in notebook
    from IPython.display import Image
    Image(filename='../../../data/beerAdvocate/sbm_hierarchy.png')
else:
    print("Skipping SBM visualization (graph-tool not installed)")
Drawing hierarchical block model structure...
Saved hierarchical visualization to: sbm_hierarchy.png
if GRAPH_TOOL_AVAILABLE:
    # Visualize the block model structure at bottom level
    fig, ax = plt.subplots(figsize=(10, 8))

    # Get unique block IDs that actually exist
    unique_blocks = sorted(block_counts.keys())
    block_to_idx = {b: i for i, b in enumerate(unique_blocks)}
    n_blocks_viz = len(unique_blocks)

    # Create block matrix (edge counts between blocks)
    block_matrix = np.zeros((n_blocks_viz, n_blocks_viz))

    for e in g.edges():
        b1, b2 = blocks[e.source()], blocks[e.target()]
        if b1 in block_to_idx and b2 in block_to_idx:
            i1, i2 = block_to_idx[b1], block_to_idx[b2]
            block_matrix[i1, i2] += e_weight[e]
            if i1 != i2:
                block_matrix[i2, i1] += e_weight[e]

    # Normalize by block sizes for probability
    block_sizes = np.array([block_counts[b] for b in unique_blocks])
    block_probs = block_matrix / np.outer(block_sizes, block_sizes)

    # Plot
    im = ax.imshow(block_probs, cmap='YlOrRd')
    ax.set_xlabel('Block', fontsize=12)
    ax.set_ylabel('Block', fontsize=12)
    ax.set_title('Hierarchical SBM: Edge Density Between Bottom-Level Blocks', fontsize=14)

    # Only show tick labels if not too many blocks
    if n_blocks_viz <= 20:
        ax.set_xticks(range(n_blocks_viz))
        ax.set_yticks(range(n_blocks_viz))
        ax.set_xticklabels(unique_blocks)
        ax.set_yticklabels(unique_blocks)

    plt.colorbar(im, ax=ax, label='Edge Density')
    plt.tight_layout()
    plt.savefig('../../../data/beerAdvocate/block_matrix.png', dpi=150)
    plt.show()
    print("Saved: block_matrix.png")
else:
    print("Skipping block matrix visualization (graph-tool not installed)")

Block model edge probability matrix
Saved: block_matrix.png
if GRAPH_TOOL_AVAILABLE:
    print("\n" + "=" * 70)
    print("COMMUNITY DETECTION COMPARISON")
    print("=" * 70)

    # Calculate SBM modularity for comparison using bottom-level blocks
    # Use actual block IDs from block_counts (they may not be contiguous)
    sbm_communities = []
    for block_id in block_counts.keys():
        comm = set()
        for v in g.vertices():
            if blocks[v] == block_id:
                comm.add(v_id[v])
        if comm:
            sbm_communities.append(comm)

    # Verify we have a valid partition before calculating modularity
    all_sbm_nodes = set().union(*sbm_communities) if sbm_communities else set()
    if len(all_sbm_nodes) == G_users.number_of_nodes():
        sbm_modularity = nx.community.modularity(G_users, sbm_communities)
    else:
        print(f"  Warning: SBM partition covers {len(all_sbm_nodes)} of {G_users.number_of_nodes()} nodes")
        sbm_modularity = float('nan')

    print(f"\n{'Algorithm':<30} {'Communities':>12} {'Modularity':>12}")
    print("-" * 55)
    print(f"{'Louvain':<30} {len(louvain_communities):>12} {louvain_modularity:>12.4f}")
    print(f"{'Label Propagation':<30} {len(label_prop_communities):>12} {label_prop_modularity:>12.4f}")
    print(f"{'Greedy Modularity':<30} {len(greedy_communities):>12} {greedy_modularity:>12.4f}")
    sbm_mod_str = f"{sbm_modularity:>12.4f}" if not np.isnan(sbm_modularity) else "         N/A"
    print(f"{'Nested SBM (bottom level)':<30} {n_blocks:>12} {sbm_mod_str}")
    print("-" * 55)

    print(f"\nTrue communities (from data): 5")
    print(f"\nThe hierarchical nested SBM provides:")
    print(f"  - {len(levels)} levels of hierarchy")
    print(f"  - Bayesian model selection (no need to specify # of communities)")
    print(f"  - Description length: {nested_state.entropy():.2f} (lower = better fit)")
    print(f"  - Multi-scale community structure discovery")
    print("=" * 70)

======================================================================
COMMUNITY DETECTION COMPARISON
======================================================================

Algorithm                       Communities   Modularity
-------------------------------------------------------
Louvain                                   5       0.2742
Label Propagation                         1       0.0000
Greedy Modularity                         4       0.1387
Nested SBM (bottom level)                45       0.0436
-------------------------------------------------------

True communities (from data): 5

The hierarchical nested SBM provides:
  - 11 levels of hierarchy
  - Bayesian model selection (no need to specify # of communities)
  - Description length: 653166.22 (lower = better fit)
  - Multi-scale community structure discovery
======================================================================

Conclusion

This analysis demonstrates the workflow for:

  1. Loading and constructing a two-mode (bipartite) network from forum participation data
  2. Projecting the network onto users to reveal attention-based similarity
  3. Analyzing network properties including small-world characteristics
  4. Detecting communities using multiple algorithms (Louvain, Label Propagation, Greedy Modularity)
  5. Applying hierarchical nested stochastic block models for multi-scale probabilistic community detection
  6. Visualizing the network structure, community assignments, and hierarchical block model

The hierarchical nested SBM offers several advantages over flat community detection:

  • Multi-scale discovery: Reveals community structure at multiple resolutions simultaneously
  • Bayesian model selection: Automatically determines the optimal number of communities without user specification
  • Edge weight support: Natively incorporates connection strengths into the inference
  • Hierarchical interpretation: Shows how fine-grained communities nest within larger groupings

The results show that beer enthusiasts form distinct communities based on their attention patterns—engaging primarily with styles from their preferred family (British, American Craft, Belgian, Dark, or Lager) while some “bridge” users connect across communities. The hierarchical structure may reveal sub-communities within these major groupings.

For Fuller’s market research, these attention-based segments offer insights that traditional preference surveys cannot capture: they reveal not what consumers claim to like, but where they actually direct their engagement and curiosity.