Model Estimation
REM.jl estimates relational event models using case-control sampling with stratified conditional logistic regression. This approach is equivalent to Cox proportional hazards for survival data and enables efficient estimation even for large networks.
Overview
The estimation process follows three steps:
- Case-Control Sampling: For each observed event, sample non-events from the risk set
- Statistic Computation: Calculate statistics for cases and controls
- Maximum Likelihood Estimation: Fit stratified conditional logistic regression
Why Case-Control Sampling?
For a network with $n$ actors, there are $n(n-1)$ possible directed dyads at each time point. Computing statistics for all dyads at all time points is often computationally infeasible.
Case-control sampling solves this by:
- Treating observed events as "cases"
- Sampling a subset of non-events as "controls"
- Using stratified estimation to obtain consistent parameter estimates
This approach is statistically valid and dramatically reduces computation time.
Configuring the Sampler
sampler = CaseControlSampler(
n_controls = 100, # Controls per case
exclude_self_loops = true, # Exclude s→s from risk set
seed = 42 # Random seed for reproducibility
)Parameters
| Parameter | Description | Default |
|---|---|---|
n_controls | Number of controls sampled per case | Required |
exclude_self_loops | Whether to exclude self-events | true |
seed | Random seed (nothing = random each time) | nothing |
Choosing Number of Controls
The number of controls affects estimation accuracy and computation time:
| n_controls | Use Case |
|---|---|
| 20-50 | Quick exploratory analysis |
| 50-100 | Standard analysis |
| 100-200 | Final results, publication |
| 200+ | Very precise estimates needed |
More controls = more accurate standard errors, but diminishing returns beyond ~100-200.
Generating Observations
# Create observations DataFrame
obs = generate_observations(seq, stats, sampler)The resulting DataFrame contains:
| Column | Description |
|---|---|
event_index | Index of the focal event in the sequence |
sender | Sender ID |
receiver | Receiver ID |
is_event | true for cases, false for controls |
stratum | Stratum ID (groups each case with its controls) |
<stat_name> | One column per statistic |
Options
obs = generate_observations(seq, stats, sampler;
start_index = 1, # First event to include
end_index = length(seq), # Last event to include
decay = 0.0, # Exponential decay rate
at_risk = nothing # Custom set of actors at risk
)Excluding Early Events
The first few events may have unreliable statistics (no history):
# Skip first 5 events
obs = generate_observations(seq, stats, sampler; start_index=6)Custom Risk Sets
Specify a custom set of actors at risk:
# Only actors 1-100 can send/receive
at_risk = collect(1:100)
obs = generate_observations(seq, stats, sampler; at_risk=at_risk)Fitting Models
Direct Fitting (Recommended)
The simplest approach combines sampling and fitting:
result = fit_rem(seq, stats;
n_controls = 100,
seed = 42
)Two-Stage Fitting
For more control over the process:
# Stage 1: Generate observations
sampler = CaseControlSampler(n_controls=100, seed=42)
obs = generate_observations(seq, stats, sampler)
# Inspect observations if needed
println("Observations: ", nrow(obs))
println("Cases: ", sum(obs.is_event))
println("Controls: ", sum(.!obs.is_event))
# Stage 2: Fit model
stat_names = [name(s) for s in stats]
result = fit_rem(obs, stat_names)Fit Options
result = fit_rem(obs, stat_names;
maxiter = 100, # Maximum Newton-Raphson iterations
tol = 1e-8 # Convergence tolerance for log-likelihood
)Understanding Results
The REMResult object contains:
| Field | Type | Description |
|---|---|---|
coefficients | Vector{Float64} | Estimated coefficients |
std_errors | Vector{Float64} | Standard errors |
z_values | Vector{Float64} | Z-statistics (coef/se) |
p_values | Vector{Float64} | Two-sided p-values |
stat_names | Vector{String} | Names of statistics |
n_events | Int | Number of events (cases) |
n_observations | Int | Total observations |
log_likelihood | Float64 | Log-likelihood at convergence |
converged | Bool | Whether optimization converged |
Accessor Functions
coef(result) # Coefficient vector
stderror(result) # Standard errors vector
coeftable(result) # Full results as DataFrameDisplaying Results
println(result)Output:
Relational Event Model Results
==============================
Events: 100, Observations: 10100
Log-likelihood: -234.5678
Converged: true
Coefficients:
------------------------------------------------------------
Statistic Coef Std.Err z P>|z|
------------------------------------------------------------
repetition 0.4523 0.0812 5.5700 0.0000 ***
reciprocity 0.3156 0.0923 3.4200 0.0006 ***
sender_activity 0.0234 0.0156 1.5000 0.1336
receiver_popularity 0.0567 0.0189 3.0000 0.0027 **
------------------------------------------------------------
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Interpreting Coefficients
Log-Hazard Ratios
Coefficients are log-hazard ratios. A coefficient β means:
- exp(β) is the multiplicative effect on the event rate
- β > 0 increases the rate
- β < 0 decreases the rate
Example Interpretations
| Statistic | Coefficient | exp(β) | Interpretation |
|---|---|---|---|
| Repetition | 0.5 | 1.65 | Each past s→r event increases rate by 65% |
| Reciprocity | 0.8 | 2.23 | Past r→s events double the rate |
| SenderActivity | -0.1 | 0.90 | High-activity senders have 10% lower per-dyad rates |
| TransitiveClosure | 0.3 | 1.35 | Each shared partner increases rate by 35% |
| NodeMatch | 0.4 | 1.49 | Same-attribute dyads are 49% more likely |
Confidence Intervals
Approximate 95% confidence intervals:
using Distributions
alpha = 0.05
z = quantile(Normal(), 1 - alpha/2)
lower = coef(result) .- z .* stderror(result)
upper = coef(result) .+ z .* stderror(result)
# Hazard ratio confidence intervals
hr_lower = exp.(lower)
hr_upper = exp.(upper)Computing Statistics Without Sampling
To compute statistics for all events (without controls):
stats_df = compute_statistics(seq, stats; decay=0.0)Returns a DataFrame with one row per event:
| Column | Description |
|---|---|
sender | Sender ID |
receiver | Receiver ID |
time | Event time |
<stat_name> | Statistic values |
Use case: Exploratory analysis, visualization, or exporting statistics.
Model Comparison
Comparing Log-Likelihoods
# Nested models
stats1 = [Repetition(), Reciprocity()]
stats2 = [Repetition(), Reciprocity(), TransitiveClosure()]
result1 = fit_rem(seq, stats1; n_controls=100, seed=42)
result2 = fit_rem(seq, stats2; n_controls=100, seed=42)
println("Model 1 LL: ", result1.log_likelihood)
println("Model 2 LL: ", result2.log_likelihood)
# Likelihood ratio test (approximate)
using Distributions
LR = 2 * (result2.log_likelihood - result1.log_likelihood)
df = length(stats2) - length(stats1)
p_value = 1 - cdf(Chisq(df), LR)
println("LR test p-value: ", p_value)Multiple Seeds
For robustness, compare results across different random seeds:
results = [fit_rem(seq, stats; n_controls=100, seed=s) for s in 1:5]
# Check coefficient stability
for (i, r) in enumerate(results)
println("Seed $i: ", round.(coef(r), digits=3))
endConvergence Issues
Checking Convergence
if !result.converged
@warn "Model did not converge - results may be unreliable"
endCommon Causes and Solutions
| Issue | Symptom | Solution |
|---|---|---|
| Perfect separation | Very large coefficients | Remove or transform problematic statistic |
| Multicollinearity | Large standard errors | Remove correlated statistics |
| Sparse data | Non-convergence | Increase n_controls, simplify model |
| Too many parameters | Slow convergence | Reduce model complexity |
Handling Non-Convergence
# Increase iterations
result = fit_rem(seq, stats; n_controls=100, maxiter=500)
# Check for separation
for (i, name) in enumerate(result.stat_names)
if abs(result.coefficients[i]) > 10
@warn "Possible separation for $name"
end
endAdvanced Topics
Bootstrapping Standard Errors
For more robust standard errors:
function bootstrap_rem(seq, stats; n_bootstrap=100, n_controls=100)
n = length(seq)
boot_coefs = zeros(n_bootstrap, length(stats))
for b in 1:n_bootstrap
# Resample events with replacement
indices = rand(1:n, n)
boot_events = [seq[i] for i in sort(indices)]
boot_seq = EventSequence(boot_events)
result = fit_rem(boot_seq, stats; n_controls=n_controls, seed=b)
boot_coefs[b, :] = coef(result)
end
# Bootstrap standard errors
boot_se = vec(std(boot_coefs, dims=1))
return boot_se
endTime-Varying Effects
Model effects that change over time using event windows:
# Split sequence into periods
mid_point = length(seq) ÷ 2
# Fit separate models
result_early = fit_rem(seq, stats;
n_controls=100, start_index=1, end_index=mid_point)
result_late = fit_rem(seq, stats;
n_controls=100, start_index=mid_point+1)
# Compare coefficients
println("Early period: ", coef(result_early))
println("Late period: ", coef(result_late))Best Practices
- Set random seed: Always use a seed for reproducibility
- Check convergence: Verify
result.converged == true - Adequate controls: Use at least 50-100 controls
- Skip early events: Consider excluding first few events with
start_index - Avoid multicollinearity: Don't include highly correlated statistics
- Scale large counts: Use
LogDegreefor networks with high-degree hubs - Sufficient events: Aim for at least 10 events per parameter
- Compare seeds: Verify stability across different random seeds