Data Transformation with dplyr Basics

Author

IND215

Published

September 22, 2025

Introduction to dplyr

dplyr is the tidyverse’s grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. Think of dplyr verbs as the fundamental building blocks for data transformation.

library(tidyverse)

# We'll use the built-in mtcars dataset for examples
data(mtcars)
cars <- as_tibble(mtcars, rownames = "model")
glimpse(cars)

Rows: 32
Columns: 12
$ model <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "H…
$ mpg   <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8…
$ cyl   <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8…
$ disp  <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 1…
$ hp    <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 18…
$ drat  <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92…
$ wt    <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3…
$ qsec  <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 1…
$ vs    <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0…
$ am    <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ gear  <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3…
$ carb  <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2…

The Five Key Verbs

dplyr provides five key verbs for data manipulation:

select() - Choose columns
filter() - Choose rows
mutate() - Create or modify columns
arrange() - Reorder rows
summarize() - Reduce multiple values to a single summary

select(): Choose Columns

Basic Column Selection

# Select specific columns by name
cars %>%
  select(model, mpg, cyl, hp) %>%
  head()

# A tibble: 6 × 4
  model               mpg   cyl    hp
  <chr>             <dbl> <dbl> <dbl>
1 Mazda RX4          21       6   110
2 Mazda RX4 Wag      21       6   110
3 Datsun 710         22.8     4    93
4 Hornet 4 Drive     21.4     6   110
5 Hornet Sportabout  18.7     8   175
6 Valiant            18.1     6   105

# Select columns by position
cars %>%
  select(1:4) %>%
  head()

# A tibble: 6 × 4
  model               mpg   cyl  disp
  <chr>             <dbl> <dbl> <dbl>
1 Mazda RX4          21       6   160
2 Mazda RX4 Wag      21       6   160
3 Datsun 710         22.8     4   108
4 Hornet 4 Drive     21.4     6   258
5 Hornet Sportabout  18.7     8   360
6 Valiant            18.1     6   225

# Select all columns except some
cars %>%
  select(-am, -gear, -carb) %>%
  head()

# A tibble: 6 × 9
  model               mpg   cyl  disp    hp  drat    wt  qsec    vs
  <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4          21       6   160   110  3.9   2.62  16.5     0
2 Mazda RX4 Wag      21       6   160   110  3.9   2.88  17.0     0
3 Datsun 710         22.8     4   108    93  3.85  2.32  18.6     1
4 Hornet 4 Drive     21.4     6   258   110  3.08  3.22  19.4     1
5 Hornet Sportabout  18.7     8   360   175  3.15  3.44  17.0     0
6 Valiant            18.1     6   225   105  2.76  3.46  20.2     1

Selection Helpers

dplyr provides helpful functions for column selection:

# Select columns starting with "d"
cars %>%
  select(model, starts_with("d")) %>%
  head()

# A tibble: 6 × 3
  model              disp  drat
  <chr>             <dbl> <dbl>
1 Mazda RX4           160  3.9 
2 Mazda RX4 Wag       160  3.9 
3 Datsun 710          108  3.85
4 Hornet 4 Drive      258  3.08
5 Hornet Sportabout   360  3.15
6 Valiant             225  2.76

# Select columns ending with "p"
cars %>%
  select(model, ends_with("p")) %>%
  head()

# A tibble: 6 × 3
  model              disp    hp
  <chr>             <dbl> <dbl>
1 Mazda RX4           160   110
2 Mazda RX4 Wag       160   110
3 Datsun 710          108    93
4 Hornet 4 Drive      258   110
5 Hornet Sportabout   360   175
6 Valiant             225   105

# Select columns containing "a"
cars %>%
  select(model, contains("a")) %>%
  head()

# A tibble: 6 × 5
  model              drat    am  gear  carb
  <chr>             <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4          3.9      1     4     4
2 Mazda RX4 Wag      3.9      1     4     4
3 Datsun 710         3.85     1     4     1
4 Hornet 4 Drive     3.08     0     3     1
5 Hornet Sportabout  3.15     0     3     2
6 Valiant            2.76     0     3     1

# Select columns matching a pattern
cars %>%
  select(model, matches("^[md]")) %>%
  head()

# A tibble: 6 × 4
  model               mpg  disp  drat
  <chr>             <dbl> <dbl> <dbl>
1 Mazda RX4          21     160  3.9 
2 Mazda RX4 Wag      21     160  3.9 
3 Datsun 710         22.8   108  3.85
4 Hornet 4 Drive     21.4   258  3.08
5 Hornet Sportabout  18.7   360  3.15
6 Valiant            18.1   225  2.76

Renaming While Selecting

# Rename columns while selecting
cars %>%
  select(
    car = model,
    miles_per_gallon = mpg,
    cylinders = cyl,
    horsepower = hp
  ) %>%
  head()

# A tibble: 6 × 4
  car               miles_per_gallon cylinders horsepower
  <chr>                        <dbl>     <dbl>      <dbl>
1 Mazda RX4                     21           6        110
2 Mazda RX4 Wag                 21           6        110
3 Datsun 710                    22.8         4         93
4 Hornet 4 Drive                21.4         6        110
5 Hornet Sportabout             18.7         8        175
6 Valiant                       18.1         6        105

# Or use rename() to keep all columns
cars %>%
  rename(
    miles_per_gallon = mpg,
    cylinders = cyl
  ) %>%
  select(model, miles_per_gallon, cylinders) %>%
  head()

# A tibble: 6 × 3
  model             miles_per_gallon cylinders
  <chr>                        <dbl>     <dbl>
1 Mazda RX4                     21           6
2 Mazda RX4 Wag                 21           6
3 Datsun 710                    22.8         4
4 Hornet 4 Drive                21.4         6
5 Hornet Sportabout             18.7         8
6 Valiant                       18.1         6

Reordering Columns

# Move specific columns to the front
cars %>%
  select(model, mpg, hp, everything()) %>%
  head()

# A tibble: 6 × 12
  model          mpg    hp   cyl  disp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21     110     6   160  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21     110     6   160  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8    93     4   108  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4   110     6   258  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7   175     8   360  3.15  3.44  17.0     0     0     3     2
6 Valiant       18.1   105     6   225  2.76  3.46  20.2     1     0     3     1

# Use relocate() for more control
cars %>%
  relocate(hp, .before = mpg) %>%
  select(model, hp, mpg, cyl) %>%
  head()

# A tibble: 6 × 4
  model                hp   mpg   cyl
  <chr>             <dbl> <dbl> <dbl>
1 Mazda RX4           110  21       6
2 Mazda RX4 Wag       110  21       6
3 Datsun 710           93  22.8     4
4 Hornet 4 Drive      110  21.4     6
5 Hornet Sportabout   175  18.7     8
6 Valiant             105  18.1     6

filter(): Choose Rows

Basic Filtering

# Filter for cars with mpg > 20
cars %>%
  filter(mpg > 20) %>%
  select(model, mpg, cyl)

# A tibble: 14 × 3
   model            mpg   cyl
   <chr>          <dbl> <dbl>
 1 Mazda RX4       21       6
 2 Mazda RX4 Wag   21       6
 3 Datsun 710      22.8     4
 4 Hornet 4 Drive  21.4     6
 5 Merc 240D       24.4     4
 6 Merc 230        22.8     4
 7 Fiat 128        32.4     4
 8 Honda Civic     30.4     4
 9 Toyota Corolla  33.9     4
10 Toyota Corona   21.5     4
11 Fiat X1-9       27.3     4
12 Porsche 914-2   26       4
13 Lotus Europa    30.4     4
14 Volvo 142E      21.4     4

# Multiple conditions (AND)
cars %>%
  filter(mpg > 20, cyl == 4) %>%
  select(model, mpg, cyl)

# A tibble: 11 × 3
   model            mpg   cyl
   <chr>          <dbl> <dbl>
 1 Datsun 710      22.8     4
 2 Merc 240D       24.4     4
 3 Merc 230        22.8     4
 4 Fiat 128        32.4     4
 5 Honda Civic     30.4     4
 6 Toyota Corolla  33.9     4
 7 Toyota Corona   21.5     4
 8 Fiat X1-9       27.3     4
 9 Porsche 914-2   26       4
10 Lotus Europa    30.4     4
11 Volvo 142E      21.4     4

# Multiple conditions (OR)
cars %>%
  filter(mpg > 30 | hp > 200) %>%
  select(model, mpg, hp)

# A tibble: 11 × 3
   model                 mpg    hp
   <chr>               <dbl> <dbl>
 1 Duster 360           14.3   245
 2 Cadillac Fleetwood   10.4   205
 3 Lincoln Continental  10.4   215
 4 Chrysler Imperial    14.7   230
 5 Fiat 128             32.4    66
 6 Honda Civic          30.4    52
 7 Toyota Corolla       33.9    65
 8 Camaro Z28           13.3   245
 9 Lotus Europa         30.4   113
10 Ford Pantera L       15.8   264
11 Maserati Bora        15     335

Complex Filtering

# Using %in% for multiple values
cars %>%
  filter(cyl %in% c(4, 6)) %>%
  select(model, cyl, mpg) %>%
  head()

# A tibble: 6 × 3
  model            cyl   mpg
  <chr>          <dbl> <dbl>
1 Mazda RX4          6  21  
2 Mazda RX4 Wag      6  21  
3 Datsun 710         4  22.8
4 Hornet 4 Drive     6  21.4
5 Valiant            6  18.1
6 Merc 240D          4  24.4

# Combining conditions
cars %>%
  filter(
    (mpg > 20 & cyl == 4) | (hp > 200 & cyl == 8)
  ) %>%
  select(model, mpg, cyl, hp)

# A tibble: 18 × 4
   model                 mpg   cyl    hp
   <chr>               <dbl> <dbl> <dbl>
 1 Datsun 710           22.8     4    93
 2 Duster 360           14.3     8   245
 3 Merc 240D            24.4     4    62
 4 Merc 230             22.8     4    95
 5 Cadillac Fleetwood   10.4     8   205
 6 Lincoln Continental  10.4     8   215
 7 Chrysler Imperial    14.7     8   230
 8 Fiat 128             32.4     4    66
 9 Honda Civic          30.4     4    52
10 Toyota Corolla       33.9     4    65
11 Toyota Corona        21.5     4    97
12 Camaro Z28           13.3     8   245
13 Fiat X1-9            27.3     4    66
14 Porsche 914-2        26       4    91
15 Lotus Europa         30.4     4   113
16 Ford Pantera L       15.8     8   264
17 Maserati Bora        15       8   335
18 Volvo 142E           21.4     4   109

# Filter with between()
cars %>%
  filter(between(mpg, 15, 25)) %>%
  select(model, mpg) %>%
  head()

# A tibble: 6 × 2
  model               mpg
  <chr>             <dbl>
1 Mazda RX4          21  
2 Mazda RX4 Wag      21  
3 Datsun 710         22.8
4 Hornet 4 Drive     21.4
5 Hornet Sportabout  18.7
6 Valiant            18.1

Filtering with String Matching

# Filter models containing "Merc"
cars %>%
  filter(str_detect(model, "Merc")) %>%
  select(model, mpg)

# A tibble: 7 × 2
  model         mpg
  <chr>       <dbl>
1 Merc 240D    24.4
2 Merc 230     22.8
3 Merc 280     19.2
4 Merc 280C    17.8
5 Merc 450SE   16.4
6 Merc 450SL   17.3
7 Merc 450SLC  15.2

# Case-insensitive matching
cars %>%
  filter(str_detect(model, "(?i)MAZDA")) %>%
  select(model, mpg)

# A tibble: 2 × 2
  model           mpg
  <chr>         <dbl>
1 Mazda RX4        21
2 Mazda RX4 Wag    21

Filtering Missing Values

# Create data with NAs
cars_with_na <- cars %>%
  mutate(mpg = if_else(row_number() %in% c(2, 5, 8), NA_real_, mpg))

# Remove rows with NA in mpg
cars_with_na %>%
  filter(!is.na(mpg)) %>%
  select(model, mpg) %>%
  head()

# A tibble: 6 × 2
  model            mpg
  <chr>          <dbl>
1 Mazda RX4       21  
2 Datsun 710      22.8
3 Hornet 4 Drive  21.4
4 Valiant         18.1
5 Duster 360      14.3
6 Merc 230        22.8

# Keep only complete cases
cars_with_na %>%
  filter(complete.cases(.)) %>%
  nrow()

[1] 29

mutate(): Create or Modify Columns

Creating New Columns

# Create new columns based on existing ones
cars %>%
  mutate(
    mpg_per_cyl = mpg / cyl,
    hp_per_cyl = hp / cyl,
    performance_ratio = hp / wt,
    is_efficient = mpg > 20
  ) %>%
  select(model, mpg_per_cyl, hp_per_cyl, performance_ratio, is_efficient) %>%
  head()

# A tibble: 6 × 5
  model             mpg_per_cyl hp_per_cyl performance_ratio is_efficient
  <chr>                   <dbl>      <dbl>             <dbl> <lgl>       
1 Mazda RX4                3.5        18.3              42.0 TRUE        
2 Mazda RX4 Wag            3.5        18.3              38.3 TRUE        
3 Datsun 710               5.7        23.2              40.1 TRUE        
4 Hornet 4 Drive           3.57       18.3              34.2 TRUE        
5 Hornet Sportabout        2.34       21.9              50.9 FALSE       
6 Valiant                  3.02       17.5              30.3 FALSE

Modifying Existing Columns

# Modify existing columns
cars %>%
  mutate(
    mpg = round(mpg, 0),
    wt = wt * 1000,  # Convert to pounds
    model = str_to_upper(model)
  ) %>%
  select(model, mpg, wt) %>%
  head()

# A tibble: 6 × 3
  model               mpg    wt
  <chr>             <dbl> <dbl>
1 MAZDA RX4            21  2620
2 MAZDA RX4 WAG        21  2875
3 DATSUN 710           23  2320
4 HORNET 4 DRIVE       21  3215
5 HORNET SPORTABOUT    19  3440
6 VALIANT              18  3460

Conditional Mutations

# Using if_else()
cars %>%
  mutate(
    efficiency = if_else(mpg > 20, "Efficient", "Not Efficient"),
    size = if_else(wt < 3, "Light", "Heavy")
  ) %>%
  select(model, mpg, wt, efficiency, size) %>%
  head()

# A tibble: 6 × 5
  model               mpg    wt efficiency    size 
  <chr>             <dbl> <dbl> <chr>         <chr>
1 Mazda RX4          21    2.62 Efficient     Light
2 Mazda RX4 Wag      21    2.88 Efficient     Light
3 Datsun 710         22.8  2.32 Efficient     Light
4 Hornet 4 Drive     21.4  3.22 Efficient     Heavy
5 Hornet Sportabout  18.7  3.44 Not Efficient Heavy
6 Valiant            18.1  3.46 Not Efficient Heavy

# Using case_when() for multiple conditions
cars %>%
  mutate(
    performance = case_when(
      hp < 100 ~ "Low Power",
      hp < 150 ~ "Medium Power",
      hp < 200 ~ "High Power",
      TRUE ~ "Very High Power"
    ),
    efficiency = case_when(
      mpg > 30 ~ "Excellent",
      mpg > 25 ~ "Good",
      mpg > 20 ~ "Fair",
      TRUE ~ "Poor"
    )
  ) %>%
  select(model, hp, mpg, performance, efficiency) %>%
  head(10)

# A tibble: 10 × 5
   model                hp   mpg performance     efficiency
   <chr>             <dbl> <dbl> <chr>           <chr>     
 1 Mazda RX4           110  21   Medium Power    Fair      
 2 Mazda RX4 Wag       110  21   Medium Power    Fair      
 3 Datsun 710           93  22.8 Low Power       Fair      
 4 Hornet 4 Drive      110  21.4 Medium Power    Fair      
 5 Hornet Sportabout   175  18.7 High Power      Poor      
 6 Valiant             105  18.1 Medium Power    Poor      
 7 Duster 360          245  14.3 Very High Power Poor      
 8 Merc 240D            62  24.4 Low Power       Fair      
 9 Merc 230             95  22.8 Low Power       Fair      
10 Merc 280            123  19.2 Medium Power    Poor

Window Functions in mutate()

# Add ranking and cumulative statistics
cars %>%
  arrange(desc(mpg)) %>%
  mutate(
    mpg_rank = row_number(),
    mpg_dense_rank = dense_rank(mpg),
    mpg_percent_rank = percent_rank(mpg),
    cumulative_avg_mpg = cummean(mpg)
  ) %>%
  select(model, mpg, mpg_rank, cumulative_avg_mpg) %>%
  head(10)

# A tibble: 10 × 4
   model            mpg mpg_rank cumulative_avg_mpg
   <chr>          <dbl>    <int>              <dbl>
 1 Toyota Corolla  33.9        1               33.9
 2 Fiat 128        32.4        2               33.2
 3 Honda Civic     30.4        3               32.2
 4 Lotus Europa    30.4        4               31.8
 5 Fiat X1-9       27.3        5               30.9
 6 Porsche 914-2   26          6               30.1
 7 Merc 240D       24.4        7               29.3
 8 Datsun 710      22.8        8               28.4
 9 Merc 230        22.8        9               27.8
10 Toyota Corona   21.5       10               27.2

arrange(): Reorder Rows

Basic Sorting

# Sort by mpg (ascending)
cars %>%
  arrange(mpg) %>%
  select(model, mpg) %>%
  head()

# A tibble: 6 × 2
  model                 mpg
  <chr>               <dbl>
1 Cadillac Fleetwood   10.4
2 Lincoln Continental  10.4
3 Camaro Z28           13.3
4 Duster 360           14.3
5 Chrysler Imperial    14.7
6 Maserati Bora        15

# Sort by mpg (descending)
cars %>%
  arrange(desc(mpg)) %>%
  select(model, mpg) %>%
  head()

# A tibble: 6 × 2
  model            mpg
  <chr>          <dbl>
1 Toyota Corolla  33.9
2 Fiat 128        32.4
3 Honda Civic     30.4
4 Lotus Europa    30.4
5 Fiat X1-9       27.3
6 Porsche 914-2   26

Multiple Column Sorting

# Sort by multiple columns
cars %>%
  arrange(cyl, desc(mpg)) %>%
  select(model, cyl, mpg) %>%
  head(10)

# A tibble: 10 × 3
   model            cyl   mpg
   <chr>          <dbl> <dbl>
 1 Toyota Corolla     4  33.9
 2 Fiat 128           4  32.4
 3 Honda Civic        4  30.4
 4 Lotus Europa       4  30.4
 5 Fiat X1-9          4  27.3
 6 Porsche 914-2      4  26  
 7 Merc 240D          4  24.4
 8 Datsun 710         4  22.8
 9 Merc 230           4  22.8
10 Toyota Corona      4  21.5

# Complex sorting with calculated values
cars %>%
  mutate(efficiency_score = mpg / wt) %>%
  arrange(desc(efficiency_score)) %>%
  select(model, mpg, wt, efficiency_score) %>%
  head()

# A tibble: 6 × 4
  model            mpg    wt efficiency_score
  <chr>          <dbl> <dbl>            <dbl>
1 Lotus Europa    30.4  1.51             20.1
2 Honda Civic     30.4  1.62             18.8
3 Toyota Corolla  33.9  1.84             18.5
4 Fiat 128        32.4  2.2              14.7
5 Fiat X1-9       27.3  1.94             14.1
6 Porsche 914-2   26    2.14             12.1

summarize(): Reduce to Single Values

Basic Summarization

# Calculate summary statistics
cars %>%
  summarize(
    avg_mpg = mean(mpg),
    median_mpg = median(mpg),
    sd_mpg = sd(mpg),
    min_mpg = min(mpg),
    max_mpg = max(mpg),
    n_cars = n()
  )

# A tibble: 1 × 6
  avg_mpg median_mpg sd_mpg min_mpg max_mpg n_cars
    <dbl>      <dbl>  <dbl>   <dbl>   <dbl>  <int>
1    20.1       19.2   6.03    10.4    33.9     32

# Multiple summary statistics
cars %>%
  summarize(
    across(c(mpg, hp, wt),
           list(mean = mean,
                median = median,
                sd = sd),
           .names = "{.col}_{.fn}")
  )

# A tibble: 1 × 9
  mpg_mean mpg_median mpg_sd hp_mean hp_median hp_sd wt_mean wt_median wt_sd
     <dbl>      <dbl>  <dbl>   <dbl>     <dbl> <dbl>   <dbl>     <dbl> <dbl>
1     20.1       19.2   6.03    147.       123  68.6    3.22      3.32 0.978

Counting and Proportions

# Count unique values
cars %>%
  summarize(
    n_total = n(),
    n_efficient = sum(mpg > 20),
    prop_efficient = mean(mpg > 20),
    n_unique_cyl = n_distinct(cyl)
  )

# A tibble: 1 × 4
  n_total n_efficient prop_efficient n_unique_cyl
    <int>       <int>          <dbl>        <int>
1      32          14          0.438            3

# Using count() shortcut
cars %>%
  count(cyl, sort = TRUE)

# A tibble: 3 × 2
    cyl     n
  <dbl> <int>
1     8    14
2     4    11
3     6     7

# Count with weights
cars %>%
  count(cyl, wt = hp, name = "total_hp")

# A tibble: 3 × 2
    cyl total_hp
  <dbl>    <dbl>
1     4      909
2     6      856
3     8     2929

group_by(): The Power Multiplier

Grouped Operations

# Group by cylinder and calculate summaries
cars %>%
  group_by(cyl) %>%
  summarize(
    n = n(),
    avg_mpg = mean(mpg),
    avg_hp = mean(hp),
    avg_wt = mean(wt),
    .groups = "drop"
  )

# A tibble: 3 × 5
    cyl     n avg_mpg avg_hp avg_wt
  <dbl> <int>   <dbl>  <dbl>  <dbl>
1     4    11    26.7   82.6   2.29
2     6     7    19.7  122.    3.12
3     8    14    15.1  209.    4.00

# Multiple grouping variables
cars %>%
  mutate(transmission = if_else(am == 1, "Manual", "Automatic")) %>%
  group_by(cyl, transmission) %>%
  summarize(
    n = n(),
    avg_mpg = mean(mpg),
    .groups = "drop"
  ) %>%
  arrange(cyl, transmission)

# A tibble: 6 × 4
    cyl transmission     n avg_mpg
  <dbl> <chr>        <int>   <dbl>
1     4 Automatic        3    22.9
2     4 Manual           8    28.1
3     6 Automatic        4    19.1
4     6 Manual           3    20.6
5     8 Automatic       12    15.0
6     8 Manual           2    15.4

Grouped Mutations

# Add group-level statistics to each row
cars %>%
  group_by(cyl) %>%
  mutate(
    avg_mpg_for_cyl = mean(mpg),
    mpg_diff_from_group = mpg - avg_mpg_for_cyl,
    mpg_rank_in_group = rank(desc(mpg))
  ) %>%
  select(model, cyl, mpg, avg_mpg_for_cyl, mpg_diff_from_group, mpg_rank_in_group) %>%
  arrange(cyl, mpg_rank_in_group) %>%
  head(10)

# A tibble: 10 × 6
# Groups:   cyl [1]
   model         cyl   mpg avg_mpg_for_cyl mpg_diff_from_group mpg_rank_in_group
   <chr>       <dbl> <dbl>           <dbl>               <dbl>             <dbl>
 1 Toyota Cor…     4  33.9            26.7               7.24                1  
 2 Fiat 128        4  32.4            26.7               5.74                2  
 3 Honda Civic     4  30.4            26.7               3.74                3.5
 4 Lotus Euro…     4  30.4            26.7               3.74                3.5
 5 Fiat X1-9       4  27.3            26.7               0.636               5  
 6 Porsche 91…     4  26              26.7              -0.664               6  
 7 Merc 240D       4  24.4            26.7              -2.26                7  
 8 Datsun 710      4  22.8            26.7              -3.86                8.5
 9 Merc 230        4  22.8            26.7              -3.86                8.5
10 Toyota Cor…     4  21.5            26.7              -5.16               10

Grouped Filtering

# Keep only the most efficient car in each cylinder group
cars %>%
  group_by(cyl) %>%
  filter(mpg == max(mpg)) %>%
  select(model, cyl, mpg) %>%
  arrange(cyl)

# A tibble: 3 × 3
# Groups:   cyl [3]
  model              cyl   mpg
  <chr>            <dbl> <dbl>
1 Toyota Corolla       4  33.9
2 Hornet 4 Drive       6  21.4
3 Pontiac Firebird     8  19.2

# Keep groups meeting certain criteria
cars %>%
  group_by(cyl) %>%
  filter(mean(mpg) > 20) %>%
  ungroup() %>%
  count(cyl)

# A tibble: 1 × 2
    cyl     n
  <dbl> <int>
1     4    11

Combining Multiple Operations

Complex Data Pipelines

# Comprehensive analysis pipeline
analysis <- cars %>%
  # Add calculated columns
  mutate(
    efficiency = mpg / wt,
    performance = hp / wt,
    transmission = if_else(am == 1, "Manual", "Automatic")
  ) %>%
  # Filter for relevant cases
  filter(complete.cases(.)) %>%
  # Group for analysis
  group_by(cyl, transmission) %>%
  # Calculate summaries
  summarize(
    n_cars = n(),
    avg_mpg = mean(mpg),
    avg_efficiency = mean(efficiency),
    avg_performance = mean(performance),
    best_mpg = max(mpg),
    .groups = "drop"
  ) %>%
  # Add overall rankings
  mutate(
    efficiency_rank = dense_rank(desc(avg_efficiency)),
    performance_rank = dense_rank(desc(avg_performance))
  ) %>%
  # Sort by efficiency
  arrange(efficiency_rank)

analysis

# A tibble: 6 × 9
    cyl transmission n_cars avg_mpg avg_efficiency avg_performance best_mpg
  <dbl> <chr>         <int>   <dbl>          <dbl>           <dbl>    <dbl>
1     4 Manual            8    28.1          14.5             41.0     33.9
2     4 Automatic         3    22.9           7.87            29.6     24.4
3     6 Manual            3    20.6           7.48            47.8     21  
4     6 Automatic         4    19.1           5.66            34.0     21.4
5     8 Manual            2    15.4           4.59            88.6     15.8
6     8 Automatic        12    15.0           3.84            48.1     19.2
# ℹ 2 more variables: efficiency_rank <int>, performance_rank <int>

Real-World Example: Sales Analysis

# Create sample sales data
set.seed(123)
sales <- tibble(
  date = sample(seq.Date(from = as.Date("2024-01-01"),
                        to = as.Date("2024-03-31"),
                        by = "day"), 500, replace = TRUE),
  product = sample(c("Widget A", "Widget B", "Widget C", "Widget D"),
                  500, replace = TRUE),
  region = sample(c("North", "South", "East", "West"),
                 500, replace = TRUE),
  quantity = sample(1:20, 500, replace = TRUE),
  price = sample(c(9.99, 14.99, 19.99, 24.99), 500, replace = TRUE),
  discount = sample(c(0, 0.1, 0.15, 0.2), 500, replace = TRUE,
                   prob = c(0.6, 0.2, 0.15, 0.05))
)

# Complex analysis
sales_analysis <- sales %>%
  # Calculate revenue
  mutate(
    revenue = quantity * price * (1 - discount),
    month = format(date, "%Y-%m"),
    has_discount = discount > 0
  ) %>%
  # Monthly product performance by region
  group_by(month, product, region) %>%
  summarize(
    n_transactions = n(),
    total_quantity = sum(quantity),
    total_revenue = sum(revenue),
    avg_discount = mean(discount),
    pct_with_discount = mean(has_discount),
    .groups = "drop"
  ) %>%
  # Add product-level rankings within each month
  group_by(month) %>%
  mutate(
    revenue_rank = dense_rank(desc(total_revenue)),
    quantity_rank = dense_rank(desc(total_quantity))
  ) %>%
  ungroup() %>%
  # Focus on top performers
  filter(revenue_rank <= 3) %>%
  arrange(month, revenue_rank)

head(sales_analysis, 10)

# A tibble: 9 × 10
  month  product region n_transactions total_quantity total_revenue avg_discount
  <chr>  <chr>   <chr>           <int>          <int>         <dbl>        <dbl>
1 2024-… Widget… North              13            163         2945.       0.104 
2 2024-… Widget… South              16            167         2598.       0.0656
3 2024-… Widget… West               15            145         2540.       0.03  
4 2024-… Widget… East               15            155         2879.       0.0433
5 2024-… Widget… North              14            170         2682.       0.0643
6 2024-… Widget… North              14            126         2180.       0.0536
7 2024-… Widget… West               12            164         3177.       0.0458
8 2024-… Widget… East               16            180         3016.       0.0844
9 2024-… Widget… West               13            145         2694.       0.0577
# ℹ 3 more variables: pct_with_discount <dbl>, revenue_rank <int>,
#   quantity_rank <int>

Common Patterns and Best Practices

Pattern 1: Filter-Select-Arrange

# Common pattern for data exploration
cars %>%
  filter(mpg > 20) %>%
  select(model, mpg, cyl, hp) %>%
  arrange(desc(mpg))

# A tibble: 14 × 4
   model            mpg   cyl    hp
   <chr>          <dbl> <dbl> <dbl>
 1 Toyota Corolla  33.9     4    65
 2 Fiat 128        32.4     4    66
 3 Honda Civic     30.4     4    52
 4 Lotus Europa    30.4     4   113
 5 Fiat X1-9       27.3     4    66
 6 Porsche 914-2   26       4    91
 7 Merc 240D       24.4     4    62
 8 Datsun 710      22.8     4    93
 9 Merc 230        22.8     4    95
10 Toyota Corona   21.5     4    97
11 Hornet 4 Drive  21.4     6   110
12 Volvo 142E      21.4     4   109
13 Mazda RX4       21       6   110
14 Mazda RX4 Wag   21       6   110

Pattern 2: Group-Summarize-Arrange

# Common pattern for aggregation
cars %>%
  group_by(cyl) %>%
  summarize(
    n = n(),
    avg_mpg = mean(mpg),
    avg_hp = mean(hp),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_mpg))

# A tibble: 3 × 4
    cyl     n avg_mpg avg_hp
  <dbl> <int>   <dbl>  <dbl>
1     4    11    26.7   82.6
2     6     7    19.7  122. 
3     8    14    15.1  209.

Pattern 3: Mutate-Filter-Select

# Common pattern for feature engineering
cars %>%
  mutate(
    efficiency_ratio = mpg / wt,
    is_efficient = efficiency_ratio > 7
  ) %>%
  filter(is_efficient) %>%
  select(model, mpg, wt, efficiency_ratio)

# A tibble: 14 × 4
   model            mpg    wt efficiency_ratio
   <chr>          <dbl> <dbl>            <dbl>
 1 Mazda RX4       21    2.62             8.02
 2 Mazda RX4 Wag   21    2.88             7.30
 3 Datsun 710      22.8  2.32             9.83
 4 Merc 240D       24.4  3.19             7.65
 5 Merc 230        22.8  3.15             7.24
 6 Fiat 128        32.4  2.2             14.7 
 7 Honda Civic     30.4  1.62            18.8 
 8 Toyota Corolla  33.9  1.84            18.5 
 9 Toyota Corona   21.5  2.46             8.72
10 Fiat X1-9       27.3  1.94            14.1 
11 Porsche 914-2   26    2.14            12.1 
12 Lotus Europa    30.4  1.51            20.1 
13 Ferrari Dino    19.7  2.77             7.11
14 Volvo 142E      21.4  2.78             7.70

Advanced Tips

1. Using across() for Multiple Columns

# Apply same operation to multiple columns
cars %>%
  mutate(across(c(mpg, hp, wt), ~round(.x, 0))) %>%
  select(model, mpg, hp, wt) %>%
  head()

# A tibble: 6 × 4
  model               mpg    hp    wt
  <chr>             <dbl> <dbl> <dbl>
1 Mazda RX4            21   110     3
2 Mazda RX4 Wag        21   110     3
3 Datsun 710           23    93     2
4 Hornet 4 Drive       21   110     3
5 Hornet Sportabout    19   175     3
6 Valiant              18   105     3

# Summarize multiple columns
cars %>%
  group_by(cyl) %>%
  summarize(across(c(mpg, hp, wt),
                  list(mean = mean, sd = sd),
                  .names = "{.col}_{.fn}")) %>%
  round(2)

# A tibble: 3 × 7
    cyl mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd
  <dbl>    <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
1     4     26.7   4.51    82.6  20.9    2.29  0.57
2     6     19.7   1.45   122.   24.3    3.12  0.36
3     8     15.1   2.56   209.   51.0    4     0.76

2. Using slice() Functions

# Get top N rows per group
cars %>%
  group_by(cyl) %>%
  slice_max(mpg, n = 2) %>%
  select(model, cyl, mpg)

# A tibble: 7 × 3
# Groups:   cyl [3]
  model               cyl   mpg
  <chr>             <dbl> <dbl>
1 Toyota Corolla        4  33.9
2 Fiat 128              4  32.4
3 Hornet 4 Drive        6  21.4
4 Mazda RX4             6  21  
5 Mazda RX4 Wag         6  21  
6 Pontiac Firebird      8  19.2
7 Hornet Sportabout     8  18.7

# Get random sample per group
set.seed(123)
cars %>%
  group_by(cyl) %>%
  slice_sample(n = 2) %>%
  select(model, cyl, mpg) %>%
  arrange(cyl)

# A tibble: 6 × 3
# Groups:   cyl [3]
  model                cyl   mpg
  <chr>              <dbl> <dbl>
1 Merc 230               4  22.8
2 Volvo 142E             4  21.4
3 Mazda RX4 Wag          6  21  
4 Ferrari Dino           6  19.7
5 Cadillac Fleetwood     8  10.4
6 Camaro Z28             8  13.3

3. Scoped Variants

# Select columns by condition
cars %>%
  select(model, where(is.numeric)) %>%
  head(3)

# A tibble: 3 × 12
  model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1

# Summarize all numeric columns
cars %>%
  summarize(across(where(is.numeric), mean)) %>%
  round(2)

# A tibble: 1 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  20.1  6.19  231.  147.   3.6  3.22  17.8  0.44  0.41  3.69  2.81

Exercises

Exercise 1: Basic Verbs

Using the iris dataset: 1. Select only the Petal columns and Species 2. Filter for Petal.Length > 4 3. Create a new column for Petal.Area (Length × Width) 4. Arrange by Petal.Area descending 5. Show only the top 10 rows

Exercise 2: Grouped Operations

Using mtcars: 1. Group by number of gears 2. Calculate average mpg and hp for each group 3. Add a column showing the difference from overall average mpg 4. Keep only groups where average mpg > 20

Exercise 3: Complex Pipeline

Create a pipeline that: 1. Filters cars with 4 or 6 cylinders 2. Creates an efficiency score (mpg × 1000 / (hp × wt)) 3. Groups by cylinder count 4. Finds the most and least efficient car in each group 5. Presents the results in a clean summary table

Summary

dplyr provides a powerful, intuitive grammar for data manipulation:

select(): Choose your columns
filter(): Choose your rows
mutate(): Create new variables
arrange(): Order your data
summarize(): Calculate summaries
group_by(): Split-apply-combine operations

These verbs can be combined in endless ways to solve virtually any data manipulation challenge. The key is to think of data transformation as a series of simple steps, each accomplishing one specific task.

Next, we’ll explore data reshaping with tidyr to complement these transformation skills!