Module 6: Organizing Tabular Data with tidyr

Author

IND215

Published

September 22, 2025

Welcome to Data Reshaping and Tidying! 📊🔄

Data rarely comes in the format you need for analysis. Sometimes it’s too wide, sometimes too long, sometimes messy with multiple values in single cells. The tidyr package provides powerful tools to reshape and tidy your data, transforming it into the clean, analysis-ready format that other tidyverse packages expect.

What You’ll Learn

In this module, we’ll master the art of data reshaping and tidying with tidyr:

1. Tidy Data Principles

Understanding the three fundamental rules of tidy data
Identifying common “messy” data patterns
Diagnosing data structure problems
Planning transformation strategies

2. Pivoting Data

pivot_longer(): Converting wide data to long format
pivot_wider(): Converting long data to wide format
Handling multiple value columns and complex pivots
Real-world reshaping scenarios

3. Separating and Uniting

separate(): Splitting one column into multiple columns
unite(): Combining multiple columns into one
Working with delimiters and patterns
Handling missing values in separation

4. Nested Data and Rectangling

Working with list-columns and nested data
unnest(): Expanding nested data structures
hoist(): Extracting specific elements from lists
Rectangling JSON-like data

5. Advanced Tidying Techniques

Handling multiple types of messiness simultaneously
Custom separation and parsing functions
Performance considerations for large datasets
Integration with other tidyverse packages

Learning Objectives

By the end of this module, you will be able to:

✅ Identify and apply tidy data principles to any dataset
✅ Reshape data between wide and long formats efficiently
✅ Split and combine columns using various delimiters
✅ Work with complex nested data structures
✅ Handle real-world messy data transformation challenges
✅ Choose the right tidyr function for any data structure problem
✅ Build data cleaning pipelines that integrate with dplyr workflows

Why Master tidyr?

Data reshaping is essential because:

Analysis requirements: Different analyses need different data shapes
Visualization needs: ggplot2 works best with long-format data
Modeling prep: Machine learning models need specific data structures
Real-world messiness: Raw data is rarely in the format you need
Reproducible workflows: Systematic tidying ensures consistent results

The Tidy Data Philosophy

library(tidyverse)

cat("The three fundamental rules of tidy data:\n")

The three fundamental rules of tidy data:

cat("1. Each variable forms a column\n")

1. Each variable forms a column

cat("2. Each observation forms a row\n")

2. Each observation forms a row

cat("3. Each type of observational unit forms a table\n\n")

3. Each type of observational unit forms a table

cat("Common tidyr functions:\n")

Common tidyr functions:

cat("- pivot_longer(): Wide to long (gather)\n")

- pivot_longer(): Wide to long (gather)

cat("- pivot_wider(): Long to wide (spread)\n")

- pivot_wider(): Long to wide (spread)

cat("- separate(): Split one column into many\n")

- separate(): Split one column into many

cat("- unite(): Combine many columns into one\n")

- unite(): Combine many columns into one

cat("- unnest(): Expand list-columns\n")

- unnest(): Expand list-columns

cat("- fill(): Fill missing values\n")

- fill(): Fill missing values

Real-World Messy Data Examples

Let’s start with realistic examples of the messy data you’ll encounter and need to clean:

# Example 1: Wide sales data (common in spreadsheets)
sales_wide <- tibble(
  sales_rep = c("Alice", "Bob", "Charlie", "Diana"),
  region = c("North", "South", "East", "West"),
  Q1_2023 = c(45000, 52000, 48000, 51000),
  Q2_2023 = c(47000, 54000, 50000, 53000),
  Q3_2023 = c(49000, 56000, 52000, 55000),
  Q4_2023 = c(51000, 58000, 54000, 57000),
  Q1_2024 = c(53000, 60000, 56000, 59000),
  Q2_2024 = c(55000, 62000, 58000, 61000)
)

cat("Example 1: Wide sales data (quarters as columns)\n")

Example 1: Wide sales data (quarters as columns)

print(sales_wide)

# A tibble: 4 × 8
  sales_rep region Q1_2023 Q2_2023 Q3_2023 Q4_2023 Q1_2024 Q2_2024
  <chr>     <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 Alice     North    45000   47000   49000   51000   53000   55000
2 Bob       South    52000   54000   56000   58000   60000   62000
3 Charlie   East     48000   50000   52000   54000   56000   58000
4 Diana     West     51000   53000   55000   57000   59000   61000

# Example 2: Multiple variables in column names
survey_wide <- tibble(
  respondent_id = 1:4,
  age = c(25, 34, 28, 45),
  satisfaction_pre = c(3, 4, 2, 5),
  satisfaction_post = c(4, 5, 4, 5),
  recommendation_pre = c(2, 3, 2, 4),
  recommendation_post = c(4, 4, 3, 5)
)

cat("\nExample 2: Pre/post measurements in column names\n")


Example 2: Pre/post measurements in column names

print(survey_wide)

# A tibble: 4 × 6
  respondent_id   age satisfaction_pre satisfaction_post recommendation_pre
          <int> <dbl>            <dbl>             <dbl>              <dbl>
1             1    25                3                 4                  2
2             2    34                4                 5                  3
3             3    28                2                 4                  2
4             4    45                5                 5                  4
# ℹ 1 more variable: recommendation_post <dbl>

# Example 3: Multiple values in single cells
contact_messy <- tibble(
  id = 1:4,
  name = c("John Doe", "Jane Smith", "Bob Johnson", "Alice Brown"),
  contact = c("john@email.com; 555-1234",
              "jane@company.org; 555-5678; LinkedIn: jsmith",
              "bob.j@domain.com",
              "alice@startup.io; 555-9999; Twitter: @alice_b"),
  skills = c("R, Python, SQL", "Excel, Tableau, SQL", "Python, Java", "R, Statistics, Machine Learning")
)

cat("\nExample 3: Multiple values in single cells\n")


Example 3: Multiple values in single cells

print(contact_messy)

# A tibble: 4 × 4
     id name        contact                                       skills        
  <int> <chr>       <chr>                                         <chr>         
1     1 John Doe    john@email.com; 555-1234                      R, Python, SQL
2     2 Jane Smith  jane@company.org; 555-5678; LinkedIn: jsmith  Excel, Tablea…
3     3 Bob Johnson bob.j@domain.com                              Python, Java  
4     4 Alice Brown alice@startup.io; 555-9999; Twitter: @alice_b R, Statistics…

# Example 4: Missing structure (common in real data)
sales_messy <- tibble(
  entry = 1:8,
  data = c("Product: Widget A", "Q1: 1000", "Q2: 1200", "Q3: 1100",
           "Product: Widget B", "Q1: 800", "Q2: 950", "Q3: 900")
)

cat("\nExample 4: Unstructured data that needs parsing\n")


Example 4: Unstructured data that needs parsing

print(sales_messy)

# A tibble: 8 × 2
  entry data             
  <int> <chr>            
1     1 Product: Widget A
2     2 Q1: 1000         
3     3 Q2: 1200         
4     4 Q3: 1100         
5     5 Product: Widget B
6     6 Q1: 800          
7     7 Q2: 950          
8     8 Q3: 900

These examples represent real challenges you’ll face. By the end of this module, you’ll be able to transform all of these into clean, tidy formats!

The Power of Tidy Data

Let’s see the dramatic difference between messy and tidy data for analysis:

# Transform the wide sales data to tidy format
sales_tidy <- sales_wide %>%
  pivot_longer(
    cols = starts_with("Q"),
    names_to = "quarter_year",
    values_to = "sales"
  ) %>%
  separate(quarter_year, into = c("quarter", "year"), sep = "_") %>%
  mutate(
    year = as.numeric(year),
    date = ymd(paste(year, case_when(
      quarter == "Q1" ~ "03-31",
      quarter == "Q2" ~ "06-30",
      quarter == "Q3" ~ "09-30",
      quarter == "Q4" ~ "12-31"
    )))
  )

cat("Transformed to tidy format:\n")

Transformed to tidy format:

print(head(sales_tidy, 10))

# A tibble: 10 × 6
   sales_rep region quarter  year sales date      
   <chr>     <chr>  <chr>   <dbl> <dbl> <date>    
 1 Alice     North  Q1       2023 45000 2023-03-31
 2 Alice     North  Q2       2023 47000 2023-06-30
 3 Alice     North  Q3       2023 49000 2023-09-30
 4 Alice     North  Q4       2023 51000 2023-12-31
 5 Alice     North  Q1       2024 53000 2024-03-31
 6 Alice     North  Q2       2024 55000 2024-06-30
 7 Bob       South  Q1       2023 52000 2023-03-31
 8 Bob       South  Q2       2023 54000 2023-06-30
 9 Bob       South  Q3       2023 56000 2023-09-30
10 Bob       South  Q4       2023 58000 2023-12-31

# Now analysis becomes straightforward
cat("\nWith tidy data, analysis is simple:\n")


With tidy data, analysis is simple:

# Growth analysis
growth_analysis <- sales_tidy %>%
  arrange(sales_rep, date) %>%
  group_by(sales_rep) %>%
  mutate(
    previous_sales = lag(sales),
    growth_rate = (sales - previous_sales) / previous_sales * 100
  ) %>%
  filter(!is.na(growth_rate))

cat("Average quarterly growth by rep:\n")

Average quarterly growth by rep:

growth_analysis %>%
  group_by(sales_rep) %>%
  summarise(avg_growth = round(mean(growth_rate, na.rm = TRUE), 1), .groups = "drop") %>%
  arrange(desc(avg_growth)) %>%
  print()

# A tibble: 4 × 2
  sales_rep avg_growth
  <chr>          <dbl>
1 Alice            4.1
2 Charlie          3.9
3 Bob              3.6
4 Diana            3.6

# Regional performance
cat("\nBest performing region by year:\n")


Best performing region by year:

sales_tidy %>%
  group_by(region, year) %>%
  summarise(total_sales = sum(sales), .groups = "drop") %>%
  group_by(year) %>%
  slice_max(total_sales, n = 1) %>%
  print()

# A tibble: 2 × 3
# Groups:   year [2]
  region  year total_sales
  <chr>  <dbl>       <dbl>
1 South   2023      220000
2 South   2024      122000

This transformation from wide to long format makes complex analyses straightforward!

Module Structure

This module is organized into four comprehensive sections:

Tidy Data Principles: Master the fundamentals of tidy data
Pivoting Data: Reshape between wide and long formats
Separating and Uniting: Split and combine columns
Nested Data and Rectangling: Work with complex structures

Each section includes: - Core concepts with clear explanations - Before-and-after transformation examples - Real-world data scenarios and solutions - Integration with dplyr for complete workflows - Performance tips and best practices

Common Data Shapes and When to Use Them

Understanding when to use different data shapes is crucial:

# Wide format: Good for human reading, reporting
cat("WIDE FORMAT - Good for:\n")

WIDE FORMAT - Good for:

cat("- Spreadsheet-style reports\n")

- Spreadsheet-style reports

cat("- Side-by-side comparisons\n")

- Side-by-side comparisons

cat("- Correlation matrices\n")

- Correlation matrices

cat("- Human-readable summaries\n\n")

- Human-readable summaries

# Long format: Good for analysis, visualization
cat("LONG FORMAT - Good for:\n")

LONG FORMAT - Good for:

cat("- ggplot2 visualizations\n")

- ggplot2 visualizations

cat("- Statistical modeling\n")

- Statistical modeling

cat("- Group-by operations\n")

- Group-by operations

cat("- Time series analysis\n\n")

- Time series analysis

# Demonstrate with a simple example
demo_wide <- tibble(
  person = c("Alice", "Bob"),
  height = c(165, 180),
  weight = c(60, 75)
)

demo_long <- demo_wide %>%
  pivot_longer(cols = c(height, weight), names_to = "measurement", values_to = "value")

cat("Wide format (2 rows, 3 columns):\n")

Wide format (2 rows, 3 columns):

print(demo_wide)

# A tibble: 2 × 3
  person height weight
  <chr>   <dbl>  <dbl>
1 Alice     165     60
2 Bob       180     75

cat("\nLong format (4 rows, 3 columns):\n")


Long format (4 rows, 3 columns):

print(demo_long)

# A tibble: 4 × 3
  person measurement value
  <chr>  <chr>       <dbl>
1 Alice  height        165
2 Alice  weight         60
3 Bob    height        180
4 Bob    weight         75

cat("\nUse wide for comparison, long for analysis!\n")


Use wide for comparison, long for analysis!

Integration with the Tidyverse Ecosystem

tidyr works seamlessly with other tidyverse packages:

# Complete data cleaning pipeline
messy_sales <- tibble(
  rep_region = c("Alice_North", "Bob_South", "Charlie_East"),
  jan_feb = c("1000,1200", "800,900", "1100,1050"),
  mar_apr = c("1300,1400", "950,1000", "1200,1250"),
  contact = c("alice@co.com", "bob@co.com", "charlie@co.com")
)

cat("Starting with messy data:\n")

Starting with messy data:

print(messy_sales)

# A tibble: 3 × 4
  rep_region   jan_feb   mar_apr   contact       
  <chr>        <chr>     <chr>     <chr>         
1 Alice_North  1000,1200 1300,1400 alice@co.com  
2 Bob_South    800,900   950,1000  bob@co.com    
3 Charlie_East 1100,1050 1200,1250 charlie@co.com

# Complete tidying pipeline
clean_sales <- messy_sales %>%
  # Step 1: Separate rep and region
  separate(rep_region, into = c("sales_rep", "region"), sep = "_") %>%

  # Step 2: Pivot longer to get month pairs in rows
  pivot_longer(
    cols = c(jan_feb, mar_apr),
    names_to = "month_pair",
    values_to = "sales_pair"
  ) %>%

  # Step 3: Separate the sales pairs
  separate(sales_pair, into = c("month1_sales", "month2_sales"), sep = ",") %>%

  # Step 4: Convert to proper structure
  mutate(
    month1_sales = as.numeric(month1_sales),
    month2_sales = as.numeric(month2_sales)
  ) %>%

  # Step 5: Pivot longer again to get one row per month
  pivot_longer(
    cols = c(month1_sales, month2_sales),
    names_to = "month_type",
    values_to = "sales"
  ) %>%

  # Step 6: Clean up month names with dplyr
  mutate(
    month = case_when(
      month_pair == "jan_feb" & month_type == "month1_sales" ~ "January",
      month_pair == "jan_feb" & month_type == "month2_sales" ~ "February",
      month_pair == "mar_apr" & month_type == "month1_sales" ~ "March",
      month_pair == "mar_apr" & month_type == "month2_sales" ~ "April"
    )
  ) %>%

  # Step 7: Select final columns
  select(sales_rep, region, month, sales, contact) %>%
  arrange(sales_rep, match(month, c("January", "February", "March", "April")))

cat("\nAfter complete tidying pipeline:\n")


After complete tidying pipeline:

print(clean_sales)

# A tibble: 12 × 5
   sales_rep region month    sales contact       
   <chr>     <chr>  <chr>    <dbl> <chr>         
 1 Alice     North  January   1000 alice@co.com  
 2 Alice     North  February  1200 alice@co.com  
 3 Alice     North  March     1300 alice@co.com  
 4 Alice     North  April     1400 alice@co.com  
 5 Bob       South  January    800 bob@co.com    
 6 Bob       South  February   900 bob@co.com    
 7 Bob       South  March      950 bob@co.com    
 8 Bob       South  April     1000 bob@co.com    
 9 Charlie   East   January   1100 charlie@co.com
10 Charlie   East   February  1050 charlie@co.com
11 Charlie   East   March     1200 charlie@co.com
12 Charlie   East   April     1250 charlie@co.com

cat("\nNow ready for analysis with dplyr, ggplot2, etc.!\n")


Now ready for analysis with dplyr, ggplot2, etc.!

This shows the power of combining tidyr with dplyr for complete data transformation workflows.

Prerequisites

Before starting this module, make sure you have:

Completed Module 3 (Tidyverse Introduction) and Module 5 (Data Wrangling with dplyr)
Understanding of the pipe operator and dplyr verbs
Basic knowledge of data types and tibbles
Familiarity with the concept of “tidy data”

Getting Started

Ready to master data reshaping and tidying? Let’s begin with the fundamental principles: Tidy Data Principles

Summary

Data tidying with tidyr is both an art and a science. The key principles to remember:

Think tidy first: Always consider what the final tidy structure should look like
One step at a time: Complex transformations are easier when broken into steps
Combine tools: tidyr + dplyr creates powerful data cleaning pipelines
Document your logic: Reshaping operations can be complex - comment your code
Test incrementally: Verify each transformation step before moving to the next

tidyr transforms chaotic, real-world data into the clean, consistent format that enables powerful analysis. With these skills, no data structure challenge will be too complex to handle!

Let’s begin your journey to data tidying mastery! 🧹📊