One of the most powerful features of dplyr is the ability to perform operations on groups of data. The combination of group_by() and summarise() allows you to calculate statistics for each group in your data, implementing the split-apply-combine strategy that’s fundamental to data analysis.
The group_by() function doesn’t change the data itself - it adds metadata that tells dplyr to perform subsequent operations by group.
# Group by a single variablesales_by_region <- sales_raw %>%group_by(region)# The data looks the same, but it's now groupedcat("Grouped data has", n_groups(sales_by_region), "groups\n")
Grouped data has 4 groups
cat("Group sizes:\n")
Group sizes:
count(sales_by_region, region)
# A tibble: 4 × 2
# Groups: region [4]
region n
<chr> <int>
1 East 251
2 North 246
3 South 234
4 West 269
# Group by multiple variablessales_by_region_type <- sales_raw %>%group_by(region, customer_type)cat("\nGrouping by region and customer type creates", n_groups(sales_by_region_type), "groups\n")
Grouping by region and customer type creates 12 groups
Basic Summarization
summarise() (or summarize()) reduces grouped data to summary statistics:
# Calculate summary statistics by regionregional_summary <- sales_raw %>%group_by(region) %>%summarise(total_sales =sum(net_amount),avg_sale =mean(net_amount),median_sale =median(net_amount),total_orders =n(),unique_customers =n_distinct(customer_id),.groups ="drop"# Good practice to specify group handling )print(regional_summary)
# A tibble: 4 × 6
region total_sales avg_sale median_sale total_orders unique_customers
<chr> <dbl> <dbl> <dbl> <int> <int>
1 East 134961. 538. 402. 251 140
2 North 121848. 495. 354. 246 138
3 South 119161. 509. 387. 234 133
4 West 142025. 528. 374. 269 152
Proper handling of missing values is crucial in grouped operations:
# Compare different approaches to missing valuesmissing_comparison <- sales_raw %>%group_by(region) %>%summarise(# Different ways to handle NAstotal_orders =n(),orders_with_rep =sum(!is.na(sales_rep)),orders_missing_rep =sum(is.na(sales_rep)),# Average discount - different approachesavg_discount_na_rm =mean(discount_pct, na.rm =TRUE) *100,avg_discount_zeros =mean(coalesce(discount_pct, 0)) *100,# Count of discounted ordersdiscounted_orders =sum(discount_pct >0, na.rm =TRUE),.groups ="drop" )print(missing_comparison)
# A tibble: 4 × 7
region total_orders orders_with_rep orders_missing_rep avg_discount_na_rm
<chr> <int> <int> <int> <dbl>
1 East 251 237 14 3.08
2 North 246 236 10 3.45
3 South 234 221 13 4.13
4 West 269 256 13 4.10
# ℹ 2 more variables: avg_discount_zeros <dbl>, discounted_orders <int>
Advanced Grouping Techniques
Dynamic Grouping
# Function to analyze sales by any grouping variableanalyze_by_group <-function(data, group_var) { data %>%group_by({{ group_var }}) %>%summarise(revenue =sum(net_amount),orders =n(),avg_order =mean(net_amount),customers =n_distinct(customer_id),.groups ="drop" ) %>%mutate(revenue_pct = revenue /sum(revenue) *100 ) %>%arrange(desc(revenue))}# Apply to different grouping variablescat("Analysis by sales rep:\n")
Most efficient to summarize first, then calculate derived metrics
print(head(efficient_summary, 10))
# A tibble: 10 × 7
region customer_type n revenue avg_order revenue_share cumulative_share
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 West Individual 133 78443. 590. 15.2 92.8
2 East Individual 130 72121. 555. 14.0 18.8
3 South Individual 121 60550. 500. 11.7 64.8
4 North Individual 112 52854. 472. 10.2 41.0
5 North Small Business 81 44267. 547. 8.57 49.6
6 South Small Business 74 40350. 545. 7.81 72.6
7 East Small Business 73 37881. 519. 7.33 26.1
8 West Small Business 79 36981. 468. 7.16 100
9 West Enterprise 56 26349. 471. 5.10 77.7
10 East Enterprise 48 24959. 520. 4.83 4.83
Common Patterns and Use Cases
Top N per Group
# Find top 3 customers in each regiontop_customers <- sales_raw %>%filter(quantity >0) %>%group_by(region, customer_id) %>%summarise(total_spent =sum(net_amount),order_count =n(),.groups ="drop" ) %>%group_by(region) %>%slice_max(total_spent, n =3) %>%ungroup()cat("Top 3 customers by region:\n")
Top 3 customers by region:
print(top_customers)
# A tibble: 12 × 4
region customer_id total_spent order_count
<chr> <chr> <dbl> <int>
1 East CUST-135 3715. 4
2 East CUST-128 3493. 3
3 East CUST-100 2769. 3
4 North CUST-81 2952. 2
5 North CUST-87 2834. 4
6 North CUST-83 2754. 3
7 South CUST-161 3119. 2
8 South CUST-75 2704. 3
9 South CUST-99 2586. 4
10 West CUST-108 3905. 6
11 West CUST-94 3353. 5
12 West CUST-76 3006. 5