Module 5: Data Wrangling with dplyr

Author

IND215

Published

September 22, 2025

Welcome to Advanced Data Wrangling! 🔧📊

Now that you understand tidyverse fundamentals and can work with different data types, it’s time to master the art of data wrangling. The dplyr package is the Swiss Army knife of data manipulation, providing intuitive functions that correspond to the most common data analysis tasks.

What You’ll Learn

In this module, we’ll dive deep into the powerful world of dplyr, covering:

1. Filtering and Selecting

Advanced row filtering with complex conditions
Column selection with helper functions
Combining and chaining operations
Working with missing data in filters

2. Mutating and Arranging

Creating and modifying variables
Window functions and ranking
Conditional mutations with case_when()
Sorting data with arrange()

3. Grouping and Summarizing

Group-wise operations and split-apply-combine
Multiple summary statistics
Grouped mutations and filtering
Working with grouped data frames

4. Joining Data

Inner, left, right, and full joins
Anti-joins and semi-joins
Handling key conflicts and duplicates
Joining multiple tables

5. Advanced Techniques

Across() for operating on multiple columns
Programming with dplyr using {{ and !!
Custom functions and non-standard evaluation
Performance optimization and best practices

Learning Objectives

By the end of this module, you will be able to:

✅ Filter and select data using complex conditions and patterns
✅ Create new variables with sophisticated transformations
✅ Perform group-wise operations and aggregations
✅ Join multiple data sources effectively
✅ Use advanced dplyr techniques for complex data manipulation
✅ Write reusable data wrangling functions
✅ Apply best practices for readable and efficient data pipelines

Why Master dplyr?

dplyr is essential for data analysis because:

Intuitive verbs: Functions named like actions (filter, select, mutate)
Consistent syntax: All functions work the same way
Pipe-friendly: Designed for chaining operations together
Performance: Optimized for speed with large datasets
Versatility: Handles 90% of data manipulation tasks

The dplyr Philosophy: Grammar of Data Manipulation

library(tidyverse)

# dplyr provides a consistent set of verbs for data manipulation
cat("The main dplyr verbs:\n")

The main dplyr verbs:

cat("- filter(): choose rows based on conditions\n")

- filter(): choose rows based on conditions

cat("- select(): choose columns by name or pattern\n")

- select(): choose columns by name or pattern

cat("- mutate(): create or modify columns\n")

- mutate(): create or modify columns

cat("- arrange(): reorder rows\n")

- arrange(): reorder rows

cat("- summarise(): reduce to summary statistics\n")

- summarise(): reduce to summary statistics

cat("- group_by(): perform operations by groups\n")

- group_by(): perform operations by groups

cat("\nJoining verbs:\n")


Joining verbs:

cat("- left_join(), inner_join(), full_join(): combine tables\n")

- left_join(), inner_join(), full_join(): combine tables

cat("- anti_join(), semi_join(): filtering joins\n")

- anti_join(), semi_join(): filtering joins

Real-World Data Challenges

Let’s start with a realistic, messy dataset that demonstrates why we need powerful data wrangling tools:

# Create a realistic sales dataset with common data problems
set.seed(123)
sales_raw <- tibble(
  order_id = paste0("ORD-", str_pad(1:1000, 4, pad = "0")),
  customer_id = sample(paste0("CUST-", 1:200), 1000, replace = TRUE),
  product_id = sample(paste0("PROD-", 1:50), 1000, replace = TRUE),
  order_date = sample(seq(ymd("2023-01-01"), ymd("2024-12-31"), by = "day"), 1000, replace = TRUE),
  quantity = sample(1:10, 1000, replace = TRUE),
  unit_price = round(runif(1000, 5, 200), 2),
  discount_pct = sample(c(0, 0.05, 0.10, 0.15, 0.20), 1000, replace = TRUE, prob = c(0.6, 0.2, 0.1, 0.05, 0.05)),
  sales_rep = sample(c("Alice Johnson", "Bob Smith", "Charlie Brown", "Diana Prince", "Eve Wilson"), 1000, replace = TRUE),
  region = sample(c("North", "South", "East", "West"), 1000, replace = TRUE),
  customer_type = sample(c("Individual", "Small Business", "Enterprise"), 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2))
)

# Add some realistic data quality issues
sales_raw <- sales_raw %>%
  mutate(
    # Some missing values
    sales_rep = ifelse(row_number() %in% sample(1:1000, 50), NA, sales_rep),
    discount_pct = ifelse(row_number() %in% sample(1:1000, 30), NA, discount_pct),

    # Calculate derived fields
    gross_amount = quantity * unit_price,
    discount_amount = gross_amount * coalesce(discount_pct, 0),
    net_amount = gross_amount - discount_amount,

    # Add some data entry errors
    unit_price = ifelse(row_number() %in% sample(1:1000, 5), unit_price * 100, unit_price),  # Pricing errors
    quantity = ifelse(row_number() %in% sample(1:1000, 3), 0, quantity)  # Zero quantities
  )

# Add customer information
customers <- tibble(
  customer_id = paste0("CUST-", 1:200),
  customer_name = paste("Customer", 1:200),
  industry = sample(c("Technology", "Healthcare", "Finance", "Retail", "Manufacturing"), 200, replace = TRUE),
  signup_date = sample(seq(ymd("2020-01-01"), ymd("2023-12-31"), by = "day"), 200),
  credit_limit = sample(c(1000, 5000, 10000, 25000, 50000), 200, replace = TRUE)
)

# Add product information
products <- tibble(
  product_id = paste0("PROD-", 1:50),
  product_name = paste("Product", LETTERS[1:50]),
  category = sample(c("Electronics", "Software", "Services", "Hardware"), 50, replace = TRUE),
  cost = round(runif(50, 2, 100), 2),
  launch_date = sample(seq(ymd("2020-01-01"), ymd("2024-01-01"), by = "day"), 50)
)

cat("Created realistic business dataset:\n")

Created realistic business dataset:

cat("Sales transactions:", nrow(sales_raw), "rows\n")

Sales transactions: 1000 rows

cat("Unique customers:", n_distinct(sales_raw$customer_id), "\n")

Unique customers: 199

cat("Date range:", as.character(min(sales_raw$order_date)), "to", as.character(max(sales_raw$order_date)), "\n")

Date range: 2023-01-01 to 2024-12-31

cat("Total revenue: $", format(sum(sales_raw$net_amount), big.mark = ","), "\n")

Total revenue: $ 517,995.5

# Show data quality issues
cat("\nData quality issues to address:\n")


Data quality issues to address:

cat("Missing sales reps:", sum(is.na(sales_raw$sales_rep)), "\n")

Missing sales reps: 50

cat("Missing discounts:", sum(is.na(sales_raw$discount_pct)), "\n")

Missing discounts: 30

cat("Suspicious high prices (>$500):", sum(sales_raw$unit_price > 500), "\n")

Suspicious high prices (>$500): 5

cat("Zero quantities:", sum(sales_raw$quantity == 0), "\n")

Zero quantities: 3

This dataset represents the kind of messy, real-world data you’ll encounter in business. Throughout this module, we’ll learn how to clean, transform, and analyze it using dplyr’s powerful tools.

Module Structure

This module is organized into five comprehensive sections:

Filtering and Selecting: Master row and column selection
Mutating and Arranging: Create variables and sort data
Grouping and Summarizing: Aggregate data by groups
Joining Data: Combine multiple tables
Advanced Techniques: Programming and optimization

Each section includes: - Core concepts with clear explanations - Practical examples using real business scenarios - Common patterns and best practices - Performance tips and troubleshooting - Hands-on exercises

The Power of Pipes

Before diving in, let’s appreciate the elegance of dplyr’s pipe-friendly design:

# Without pipes (nested and hard to read)
result_nested <- arrange(
  summarise(
    group_by(
      filter(sales_raw, !is.na(sales_rep)),
      region
    ),
    total_sales = sum(net_amount),
    avg_order = mean(net_amount),
    order_count = n()
  ),
  desc(total_sales)
)

# With pipes (readable and intuitive)
result_piped <- sales_raw %>%
  filter(!is.na(sales_rep)) %>%
  group_by(region) %>%
  summarise(
    total_sales = sum(net_amount),
    avg_order = mean(net_amount),
    order_count = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(total_sales))

# Both produce the same result
identical(result_nested, result_piped)

[1] TRUE

# Show the clean, readable result
cat("Regional sales summary:\n")

Regional sales summary:

print(result_piped)

# A tibble: 4 × 4
  region total_sales avg_order order_count
  <chr>        <dbl>     <dbl>       <int>
1 West       135720.      530.         256
2 East       127331.      537.         237
3 North      116782.      495.         236
4 South      111797.      506.         221

The pipe makes data analysis feel like describing what you want to do in plain English!

A Motivating Example

Here’s a taste of what you’ll be able to do by the end of this module:

# Complex business analysis in a single, readable pipeline
quarterly_performance <- sales_raw %>%
  # Data cleaning
  filter(
    quantity > 0,                    # Remove invalid orders
    unit_price <= 500,               # Remove pricing errors
    !is.na(sales_rep)                # Only orders with assigned reps
  ) %>%

  # Join with customer and product data
  left_join(customers, by = "customer_id") %>%
  left_join(products, by = "product_id") %>%

  # Create time-based variables
  mutate(
    quarter = paste0(year(order_date), "-Q", quarter(order_date)),
    profit = net_amount - (quantity * cost),
    profit_margin = profit / net_amount
  ) %>%

  # Group by multiple dimensions
  group_by(quarter, region, customer_type) %>%

  # Calculate comprehensive metrics
  summarise(
    # Revenue metrics
    total_revenue = sum(net_amount),
    total_profit = sum(profit),
    avg_profit_margin = mean(profit_margin, na.rm = TRUE),

    # Volume metrics
    total_orders = n(),
    total_units = sum(quantity),

    # Customer metrics
    unique_customers = n_distinct(customer_id),
    avg_order_value = mean(net_amount),

    .groups = "drop"
  ) %>%

  # Add rankings and comparisons
  group_by(quarter) %>%
  mutate(
    revenue_rank = row_number(desc(total_revenue)),
    revenue_pct_of_total = total_revenue / sum(total_revenue) * 100
  ) %>%
  ungroup() %>%

  # Final sorting
  arrange(quarter, revenue_rank)

cat("Quarterly performance analysis:\n")

Quarterly performance analysis:

quarterly_performance %>%
  filter(quarter %in% c("2024-Q1", "2024-Q2")) %>%
  select(quarter, region, customer_type, total_revenue, revenue_rank, revenue_pct_of_total) %>%
  print()

# A tibble: 24 × 6
   quarter region customer_type  total_revenue revenue_rank revenue_pct_of_total
   <chr>   <chr>  <chr>                  <dbl>        <int>                <dbl>
 1 2024-Q1 West   Individual             7450.            1                16.4 
 2 2024-Q1 South  Small Business         7055.            2                15.5 
 3 2024-Q1 North  Individual             5534.            3                12.2 
 4 2024-Q1 South  Individual             4719.            4                10.4 
 5 2024-Q1 East   Small Business         4250.            5                 9.36
 6 2024-Q1 East   Individual             3448.            6                 7.60
 7 2024-Q1 East   Enterprise             2715.            7                 5.98
 8 2024-Q1 West   Small Business         2535.            8                 5.58
 9 2024-Q1 West   Enterprise             2395.            9                 5.28
10 2024-Q1 North  Enterprise             2352.           10                 5.18
# ℹ 14 more rows

This analysis combines filtering, joining, grouping, summarizing, and ranking - all in a single, readable pipeline. By the end of this module, this kind of complex analysis will feel natural and intuitive.

Prerequisites

Before starting this module, make sure you have:

Completed Module 3 (Tidyverse Introduction) and Module 4 (Data Types)
Understanding of the pipe operator (%>%)
Basic knowledge of tibbles and data frames
Familiarity with tidyverse loading and basic functions

Getting Started

Ready to become a data wrangling expert? Let’s start with the foundation: Filtering and Selecting

Summary

Data wrangling with dplyr is both an art and a science. The key principles to remember:

Think in verbs: Each dplyr function does one thing well
Chain operations: Use pipes to create readable data pipelines
Be explicit: Clear, descriptive code is better than clever shortcuts
Test incrementally: Build complex operations step by step
Handle missing data: Always consider how NAs affect your analysis

dplyr transforms messy, real-world data into clean, analysis-ready datasets. With these skills, you’ll be able to tackle any data challenge with confidence and clarity.

Let’s begin your journey to dplyr mastery! 🚀

--- title: "Module 5: Data Wrangling with dplyr" author: "IND215" date: today format: html: toc: true toc-depth: 2 code-fold: false code-tools: true --- ## Welcome to Advanced Data Wrangling! 🔧📊 Now that you understand tidyverse fundamentals and can work with different data types, it's time to master the art of data wrangling. The `dplyr` package is the Swiss Army knife of data manipulation, providing intuitive functions that correspond to the most common data analysis tasks. ## What You'll Learn In this module, we'll dive deep into the powerful world of dplyr, covering: ### 1. Filtering and Selecting - Advanced row filtering with complex conditions - Column selection with helper functions - Combining and chaining operations - Working with missing data in filters ### 2. Mutating and Arranging - Creating and modifying variables - Window functions and ranking - Conditional mutations with case_when() - Sorting data with arrange() ### 3. Grouping and Summarizing - Group-wise operations and split-apply-combine - Multiple summary statistics - Grouped mutations and filtering - Working with grouped data frames ### 4. Joining Data - Inner, left, right, and full joins - Anti-joins and semi-joins - Handling key conflicts and duplicates - Joining multiple tables ### 5. Advanced Techniques - Across() for operating on multiple columns - Programming with dplyr using {{ and !! - Custom functions and non-standard evaluation - Performance optimization and best practices ## Learning Objectives By the end of this module, you will be able to: - ✅ Filter and select data using complex conditions and patterns - ✅ Create new variables with sophisticated transformations - ✅ Perform group-wise operations and aggregations - ✅ Join multiple data sources effectively - ✅ Use advanced dplyr techniques for complex data manipulation - ✅ Write reusable data wrangling functions - ✅ Apply best practices for readable and efficient data pipelines ## Why Master dplyr? dplyr is essential for data analysis because: 1. **Intuitive verbs**: Functions named like actions (filter, select, mutate) 2. **Consistent syntax**: All functions work the same way 3. **Pipe-friendly**: Designed for chaining operations together 4. **Performance**: Optimized for speed with large datasets 5. **Versatility**: Handles 90% of data manipulation tasks ## The dplyr Philosophy: Grammar of Data Manipulation ```{r} #| label: dplyr-philosophy #| message: false library(tidyverse) # dplyr provides a consistent set of verbs for data manipulation cat("The main dplyr verbs:\n") cat("- filter(): choose rows based on conditions\n") cat("- select(): choose columns by name or pattern\n") cat("- mutate(): create or modify columns\n") cat("- arrange(): reorder rows\n") cat("- summarise(): reduce to summary statistics\n") cat("- group_by(): perform operations by groups\n") cat("\nJoining verbs:\n") cat("- left_join(), inner_join(), full_join(): combine tables\n") cat("- anti_join(), semi_join(): filtering joins\n") ``` ## Real-World Data Challenges Let's start with a realistic, messy dataset that demonstrates why we need powerful data wrangling tools: ```{r} #| label: realistic-dataset # Create a realistic sales dataset with common data problems set.seed(123) sales_raw <- tibble( order_id = paste0("ORD-", str_pad(1:1000, 4, pad = "0")), customer_id = sample(paste0("CUST-", 1:200), 1000, replace = TRUE), product_id = sample(paste0("PROD-", 1:50), 1000, replace = TRUE), order_date = sample(seq(ymd("2023-01-01"), ymd("2024-12-31"), by = "day"), 1000, replace = TRUE), quantity = sample(1:10, 1000, replace = TRUE), unit_price = round(runif(1000, 5, 200), 2), discount_pct = sample(c(0, 0.05, 0.10, 0.15, 0.20), 1000, replace = TRUE, prob = c(0.6, 0.2, 0.1, 0.05, 0.05)), sales_rep = sample(c("Alice Johnson", "Bob Smith", "Charlie Brown", "Diana Prince", "Eve Wilson"), 1000, replace = TRUE), region = sample(c("North", "South", "East", "West"), 1000, replace = TRUE), customer_type = sample(c("Individual", "Small Business", "Enterprise"), 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2)) ) # Add some realistic data quality issues sales_raw <- sales_raw %>% mutate( # Some missing values sales_rep = ifelse(row_number() %in% sample(1:1000, 50), NA, sales_rep), discount_pct = ifelse(row_number() %in% sample(1:1000, 30), NA, discount_pct), # Calculate derived fields gross_amount = quantity * unit_price, discount_amount = gross_amount * coalesce(discount_pct, 0), net_amount = gross_amount - discount_amount, # Add some data entry errors unit_price = ifelse(row_number() %in% sample(1:1000, 5), unit_price * 100, unit_price), # Pricing errors quantity = ifelse(row_number() %in% sample(1:1000, 3), 0, quantity) # Zero quantities ) # Add customer information customers <- tibble( customer_id = paste0("CUST-", 1:200), customer_name = paste("Customer", 1:200), industry = sample(c("Technology", "Healthcare", "Finance", "Retail", "Manufacturing"), 200, replace = TRUE), signup_date = sample(seq(ymd("2020-01-01"), ymd("2023-12-31"), by = "day"), 200), credit_limit = sample(c(1000, 5000, 10000, 25000, 50000), 200, replace = TRUE) ) # Add product information products <- tibble( product_id = paste0("PROD-", 1:50), product_name = paste("Product", LETTERS[1:50]), category = sample(c("Electronics", "Software", "Services", "Hardware"), 50, replace = TRUE), cost = round(runif(50, 2, 100), 2), launch_date = sample(seq(ymd("2020-01-01"), ymd("2024-01-01"), by = "day"), 50) ) cat("Created realistic business dataset:\n") cat("Sales transactions:", nrow(sales_raw), "rows\n") cat("Unique customers:", n_distinct(sales_raw$customer_id), "\n") cat("Date range:", as.character(min(sales_raw$order_date)), "to", as.character(max(sales_raw$order_date)), "\n") cat("Total revenue: $", format(sum(sales_raw$net_amount), big.mark = ","), "\n") # Show data quality issues cat("\nData quality issues to address:\n") cat("Missing sales reps:", sum(is.na(sales_raw$sales_rep)), "\n") cat("Missing discounts:", sum(is.na(sales_raw$discount_pct)), "\n") cat("Suspicious high prices (>$500):", sum(sales_raw$unit_price > 500), "\n") cat("Zero quantities:", sum(sales_raw$quantity == 0), "\n") ``` This dataset represents the kind of messy, real-world data you'll encounter in business. Throughout this module, we'll learn how to clean, transform, and analyze it using dplyr's powerful tools. ## Module Structure This module is organized into five comprehensive sections: 1. **[Filtering and Selecting](filtering-selecting.qmd)**: Master row and column selection 2. **[Mutating and Arranging](mutating-arranging.qmd)**: Create variables and sort data 3. **[Grouping and Summarizing](grouping-summarizing.qmd)**: Aggregate data by groups 4. **[Joining Data](joining-data.qmd)**: Combine multiple tables 5. **[Advanced Techniques](advanced-techniques.qmd)**: Programming and optimization Each section includes: - Core concepts with clear explanations - Practical examples using real business scenarios - Common patterns and best practices - Performance tips and troubleshooting - Hands-on exercises ## The Power of Pipes Before diving in, let's appreciate the elegance of dplyr's pipe-friendly design: ```{r} #| label: pipe-power # Without pipes (nested and hard to read) result_nested <- arrange( summarise( group_by( filter(sales_raw, !is.na(sales_rep)), region ), total_sales = sum(net_amount), avg_order = mean(net_amount), order_count = n() ), desc(total_sales) ) # With pipes (readable and intuitive) result_piped <- sales_raw %>% filter(!is.na(sales_rep)) %>% group_by(region) %>% summarise( total_sales = sum(net_amount), avg_order = mean(net_amount), order_count = n(), .groups = "drop" ) %>% arrange(desc(total_sales)) # Both produce the same result identical(result_nested, result_piped) # Show the clean, readable result cat("Regional sales summary:\n") print(result_piped) ``` The pipe makes data analysis feel like describing what you want to do in plain English! ## A Motivating Example Here's a taste of what you'll be able to do by the end of this module: ```{r} #| label: motivating-example # Complex business analysis in a single, readable pipeline quarterly_performance <- sales_raw %>% # Data cleaning filter( quantity > 0, # Remove invalid orders unit_price <= 500, # Remove pricing errors !is.na(sales_rep) # Only orders with assigned reps ) %>% # Join with customer and product data left_join(customers, by = "customer_id") %>% left_join(products, by = "product_id") %>% # Create time-based variables mutate( quarter = paste0(year(order_date), "-Q", quarter(order_date)), profit = net_amount - (quantity * cost), profit_margin = profit / net_amount ) %>% # Group by multiple dimensions group_by(quarter, region, customer_type) %>% # Calculate comprehensive metrics summarise( # Revenue metrics total_revenue = sum(net_amount), total_profit = sum(profit), avg_profit_margin = mean(profit_margin, na.rm = TRUE), # Volume metrics total_orders = n(), total_units = sum(quantity), # Customer metrics unique_customers = n_distinct(customer_id), avg_order_value = mean(net_amount), .groups = "drop" ) %>% # Add rankings and comparisons group_by(quarter) %>% mutate( revenue_rank = row_number(desc(total_revenue)), revenue_pct_of_total = total_revenue / sum(total_revenue) * 100 ) %>% ungroup() %>% # Final sorting arrange(quarter, revenue_rank) cat("Quarterly performance analysis:\n") quarterly_performance %>% filter(quarter %in% c("2024-Q1", "2024-Q2")) %>% select(quarter, region, customer_type, total_revenue, revenue_rank, revenue_pct_of_total) %>% print() ``` This analysis combines filtering, joining, grouping, summarizing, and ranking - all in a single, readable pipeline. By the end of this module, this kind of complex analysis will feel natural and intuitive. ## Prerequisites Before starting this module, make sure you have: - Completed Module 3 (Tidyverse Introduction) and Module 4 (Data Types) - Understanding of the pipe operator (`%>%`) - Basic knowledge of tibbles and data frames - Familiarity with tidyverse loading and basic functions ## Getting Started Ready to become a data wrangling expert? Let's start with the foundation: **[Filtering and Selecting](filtering-selecting.qmd)** ## Summary Data wrangling with dplyr is both an art and a science. The key principles to remember: - **Think in verbs**: Each dplyr function does one thing well - **Chain operations**: Use pipes to create readable data pipelines - **Be explicit**: Clear, descriptive code is better than clever shortcuts - **Test incrementally**: Build complex operations step by step - **Handle missing data**: Always consider how NAs affect your analysis dplyr transforms messy, real-world data into clean, analysis-ready datasets. With these skills, you'll be able to tackle any data challenge with confidence and clarity. Let's begin your journey to dplyr mastery! 🚀