Tidy Data Principles

Author

IND215

Published

September 22, 2025

Introduction to Tidy Data

Tidy data is a consistent way to organize data that makes analysis easier and more intuitive. The concept, formalized by Hadley Wickham, provides a standard way to structure datasets that works seamlessly with the tidyverse ecosystem.

library(tidyverse)

cat("The three fundamental rules of tidy data:\n")
The three fundamental rules of tidy data:
cat("1. Each variable forms a column\n")
1. Each variable forms a column
cat("2. Each observation forms a row\n")
2. Each observation forms a row
cat("3. Each type of observational unit forms a table\n")
3. Each type of observational unit forms a table

The Three Rules of Tidy Data

Rule 1: Each Variable Forms a Column

A variable is something you can measure, count, or describe. Each variable should have its own column.

# TIDY: Each variable (name, age, height) has its own column
tidy_people <- tibble(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 30, 35),
  height = c(165, 180, 175)
)

cat("TIDY - Each variable in its own column:\n")
TIDY - Each variable in its own column:
print(tidy_people)
# A tibble: 3 × 3
  name      age height
  <chr>   <dbl>  <dbl>
1 Alice      25    165
2 Bob        30    180
3 Charlie    35    175
# MESSY: Multiple variables squeezed into one column
messy_people <- tibble(
  person = c("Alice", "Bob", "Charlie"),
  age_height = c("25_165", "30_180", "35_175")
)

cat("\nMESSY - Multiple variables in one column:\n")

MESSY - Multiple variables in one column:
print(messy_people)
# A tibble: 3 × 2
  person  age_height
  <chr>   <chr>     
1 Alice   25_165    
2 Bob     30_180    
3 Charlie 35_175    
# MESSY: Variable names as data
messy_measurements <- tibble(
  person = c("Alice", "Alice", "Bob", "Bob"),
  measurement = c("age", "height", "age", "height"),
  value = c(25, 165, 30, 180)
)

cat("\nMESSY - Variable names stored as data:\n")

MESSY - Variable names stored as data:
print(messy_measurements)
# A tibble: 4 × 3
  person measurement value
  <chr>  <chr>       <dbl>
1 Alice  age            25
2 Alice  height        165
3 Bob    age            30
4 Bob    height        180

Rule 2: Each Observation Forms a Row

An observation is all the values measured on the same unit (person, company, country) at the same time.

# TIDY: Each person (observation) gets one row
tidy_sales <- tibble(
  sales_rep = c("Alice", "Bob", "Charlie"),
  region = c("North", "South", "East"),
  Q1_sales = c(10000, 12000, 11000),
  Q2_sales = c(11000, 13000, 12000)
)

cat("TIDY - Each sales rep (observation) in one row:\n")
TIDY - Each sales rep (observation) in one row:
print(tidy_sales)
# A tibble: 3 × 4
  sales_rep region Q1_sales Q2_sales
  <chr>     <chr>     <dbl>    <dbl>
1 Alice     North     10000    11000
2 Bob       South     12000    13000
3 Charlie   East      11000    12000
# MESSY: Observations split across multiple rows
messy_sales_split <- tibble(
  rep_quarter = c("Alice_Q1", "Alice_Q2", "Bob_Q1", "Bob_Q2"),
  region = c("North", "North", "South", "South"),
  sales = c(10000, 11000, 12000, 13000)
)

cat("\nMESSY - Observations split across rows:\n")

MESSY - Observations split across rows:
print(messy_sales_split)
# A tibble: 4 × 3
  rep_quarter region sales
  <chr>       <chr>  <dbl>
1 Alice_Q1    North  10000
2 Alice_Q2    North  11000
3 Bob_Q1      South  12000
4 Bob_Q2      South  13000
# MESSY: Multiple observations in one row
messy_sales_combined <- tibble(
  quarter = c("Q1", "Q2"),
  alice_north = c(10000, 11000),
  bob_south = c(12000, 13000),
  charlie_east = c(11000, 12000)
)

cat("\nMESSY - Multiple observations in one row:\n")

MESSY - Multiple observations in one row:
print(messy_sales_combined)
# A tibble: 2 × 4
  quarter alice_north bob_south charlie_east
  <chr>         <dbl>     <dbl>        <dbl>
1 Q1            10000     12000        11000
2 Q2            11000     13000        12000

Rule 3: Each Type of Observational Unit Forms a Table

Different types of data should be stored in separate tables. For example, customer data and order data are different types of observational units.

# TIDY: Separate tables for different observational units
customers <- tibble(
  customer_id = c("C001", "C002", "C003"),
  customer_name = c("Alice Corp", "Bob Ltd", "Charlie Inc"),
  industry = c("Tech", "Finance", "Healthcare")
)

orders <- tibble(
  order_id = c("O001", "O002", "O003", "O004"),
  customer_id = c("C001", "C002", "C001", "C003"),
  order_date = ymd(c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18")),
  amount = c(1000, 1500, 800, 2000)
)

cat("TIDY - Customers table:\n")
TIDY - Customers table:
print(customers)
# A tibble: 3 × 3
  customer_id customer_name industry  
  <chr>       <chr>         <chr>     
1 C001        Alice Corp    Tech      
2 C002        Bob Ltd       Finance   
3 C003        Charlie Inc   Healthcare
cat("\nTIDY - Orders table:\n")

TIDY - Orders table:
print(orders)
# A tibble: 4 × 4
  order_id customer_id order_date amount
  <chr>    <chr>       <date>      <dbl>
1 O001     C001        2024-01-15   1000
2 O002     C002        2024-01-16   1500
3 O003     C001        2024-01-17    800
4 O004     C003        2024-01-18   2000
# MESSY: Everything mashed together
messy_combined <- tibble(
  order_id = c("O001", "O002", "O003", "O004"),
  customer_id = c("C001", "C002", "C001", "C003"),
  customer_name = c("Alice Corp", "Bob Ltd", "Alice Corp", "Charlie Inc"),
  industry = c("Tech", "Finance", "Tech", "Healthcare"),
  order_date = ymd(c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18")),
  amount = c(1000, 1500, 800, 2000)
)

cat("\nMESSY - Everything in one table (redundant customer info):\n")

MESSY - Everything in one table (redundant customer info):
print(messy_combined)
# A tibble: 4 × 6
  order_id customer_id customer_name industry   order_date amount
  <chr>    <chr>       <chr>         <chr>      <date>      <dbl>
1 O001     C001        Alice Corp    Tech       2024-01-15   1000
2 O002     C002        Bob Ltd       Finance    2024-01-16   1500
3 O003     C001        Alice Corp    Tech       2024-01-17    800
4 O004     C003        Charlie Inc   Healthcare 2024-01-18   2000

Common Messy Data Patterns

Pattern 1: Column Headers are Values

This is one of the most common problems in real-world data.

# Common in spreadsheets: Years as column headers
messy_population <- tibble(
  country = c("USA", "Canada", "Mexico"),
  `2020` = c(331, 38, 129),
  `2021` = c(332, 38, 130),
  `2022` = c(333, 39, 131),
  `2023` = c(334, 39, 132)
)

cat("MESSY - Years as column headers:\n")
MESSY - Years as column headers:
print(messy_population)
# A tibble: 3 × 5
  country `2020` `2021` `2022` `2023`
  <chr>    <dbl>  <dbl>  <dbl>  <dbl>
1 USA        331    332    333    334
2 Canada      38     38     39     39
3 Mexico     129    130    131    132
# How to identify this pattern:
cat("\nHow to identify this pattern:\n")

How to identify this pattern:
cat("- Column names represent values of a variable (years, quarters, treatments)\n")
- Column names represent values of a variable (years, quarters, treatments)
cat("- You see repeated patterns in column names\n")
- You see repeated patterns in column names
cat("- Data is 'wider' than it should be\n")
- Data is 'wider' than it should be
# What the tidy version should look like:
tidy_population <- messy_population %>%
  pivot_longer(
    cols = -country,
    names_to = "year",
    values_to = "population_millions"
  ) %>%
  mutate(year = as.numeric(year))

cat("\nTIDY version:\n")

TIDY version:
print(tidy_population)
# A tibble: 12 × 3
   country  year population_millions
   <chr>   <dbl>               <dbl>
 1 USA      2020                 331
 2 USA      2021                 332
 3 USA      2022                 333
 4 USA      2023                 334
 5 Canada   2020                  38
 6 Canada   2021                  38
 7 Canada   2022                  39
 8 Canada   2023                  39
 9 Mexico   2020                 129
10 Mexico   2021                 130
11 Mexico   2022                 131
12 Mexico   2023                 132

Pattern 2: Multiple Variables in Column Names

Column names often encode multiple pieces of information.

# Treatment and measurement type encoded in column names
messy_experiment <- tibble(
  subject_id = 1:3,
  treatment_A_weight = c(70, 75, 68),
  treatment_A_height = c(170, 180, 165),
  treatment_B_weight = c(71, 76, 69),
  treatment_B_height = c(171, 181, 166)
)

cat("MESSY - Treatment and measurement in column names:\n")
MESSY - Treatment and measurement in column names:
print(messy_experiment)
# A tibble: 3 × 5
  subject_id treatment_A_weight treatment_A_height treatment_B_weight
       <int>              <dbl>              <dbl>              <dbl>
1          1                 70                170                 71
2          2                 75                180                 76
3          3                 68                165                 69
# ℹ 1 more variable: treatment_B_height <dbl>
cat("\nHow to identify this pattern:\n")

How to identify this pattern:
cat("- Column names contain underscores or separators\n")
- Column names contain underscores or separators
cat("- Multiple pieces of info encoded in names\n")
- Multiple pieces of info encoded in names
cat("- Similar patterns repeated across columns\n")
- Similar patterns repeated across columns
# Tidy approach: separate the information
tidy_experiment <- messy_experiment %>%
  pivot_longer(
    cols = -subject_id,
    names_to = "measurement_info",
    values_to = "value"
  ) %>%
  separate(measurement_info,
           into = c("treatment_label", "treatment", "measurement"),
           sep = "_") %>%
  select(-treatment_label) %>%  # Remove redundant column
  pivot_wider(names_from = measurement, values_from = value)

cat("\nTIDY version:\n")

TIDY version:
print(tidy_experiment)
# A tibble: 6 × 4
  subject_id treatment weight height
       <int> <chr>      <dbl>  <dbl>
1          1 A             70    170
2          1 B             71    171
3          2 A             75    180
4          2 B             76    181
5          3 A             68    165
6          3 B             69    166

Pattern 3: Variables Stored in Rows and Columns

Sometimes data is stored in a matrix-like format where both rows and columns contain variable information.

# Sales data with products as rows and quarters as columns
messy_sales_matrix <- tibble(
  product = c("Widget A", "Widget B", "Widget C"),
  Q1_North = c(100, 150, 120),
  Q1_South = c(90, 140, 110),
  Q2_North = c(110, 160, 130),
  Q2_South = c(95, 145, 115)
)

cat("MESSY - Products in rows, quarters and regions in columns:\n")
MESSY - Products in rows, quarters and regions in columns:
print(messy_sales_matrix)
# A tibble: 3 × 5
  product  Q1_North Q1_South Q2_North Q2_South
  <chr>       <dbl>    <dbl>    <dbl>    <dbl>
1 Widget A      100       90      110       95
2 Widget B      150      140      160      145
3 Widget C      120      110      130      115
# Tidy version needs multiple steps
tidy_sales_matrix <- messy_sales_matrix %>%
  pivot_longer(
    cols = -product,
    names_to = "quarter_region",
    values_to = "sales"
  ) %>%
  separate(quarter_region, into = c("quarter", "region"), sep = "_")

cat("\nTIDY version:\n")

TIDY version:
print(tidy_sales_matrix)
# A tibble: 12 × 4
   product  quarter region sales
   <chr>    <chr>   <chr>  <dbl>
 1 Widget A Q1      North    100
 2 Widget A Q1      South     90
 3 Widget A Q2      North    110
 4 Widget A Q2      South     95
 5 Widget B Q1      North    150
 6 Widget B Q1      South    140
 7 Widget B Q2      North    160
 8 Widget B Q2      South    145
 9 Widget C Q1      North    120
10 Widget C Q1      South    110
11 Widget C Q2      North    130
12 Widget C Q2      South    115

Pattern 4: Multiple Observational Units in One Table

Different types of information mixed together.

# Student and course information mixed
messy_grades <- tibble(
  record_id = 1:4,
  student_name = c("Alice", "Alice", "Bob", "Bob"),
  student_age = c(20, 20, 22, 22),
  student_major = c("Math", "Math", "Physics", "Physics"),
  course_name = c("Calculus", "Statistics", "Calculus", "Physics"),
  course_credits = c(3, 3, 3, 4),
  grade = c("A", "B", "A-", "B+")
)

cat("MESSY - Student and course info mixed:\n")
MESSY - Student and course info mixed:
print(messy_grades)
# A tibble: 4 × 7
  record_id student_name student_age student_major course_name course_credits
      <int> <chr>              <dbl> <chr>         <chr>                <dbl>
1         1 Alice                 20 Math          Calculus                 3
2         2 Alice                 20 Math          Statistics               3
3         3 Bob                   22 Physics       Calculus                 3
4         4 Bob                   22 Physics       Physics                  4
# ℹ 1 more variable: grade <chr>
# Should be separate tables
students <- messy_grades %>%
  distinct(student_name, student_age, student_major)

courses <- messy_grades %>%
  distinct(course_name, course_credits)

enrollments <- messy_grades %>%
  select(student_name, course_name, grade)

cat("\nTIDY - Students table:\n")

TIDY - Students table:
print(students)
# A tibble: 2 × 3
  student_name student_age student_major
  <chr>              <dbl> <chr>        
1 Alice                 20 Math         
2 Bob                   22 Physics      
cat("\nTIDY - Courses table:\n")

TIDY - Courses table:
print(courses)
# A tibble: 3 × 2
  course_name course_credits
  <chr>                <dbl>
1 Calculus                 3
2 Statistics               3
3 Physics                  4
cat("\nTIDY - Enrollments table:\n")

TIDY - Enrollments table:
print(enrollments)
# A tibble: 4 × 3
  student_name course_name grade
  <chr>        <chr>       <chr>
1 Alice        Calculus    A    
2 Alice        Statistics  B    
3 Bob          Calculus    A-   
4 Bob          Physics     B+   

Diagnosing Data Structure Problems

Quick Diagnostic Questions

diagnostic_checklist <- function(data) {
  cat("TIDY DATA DIAGNOSTIC CHECKLIST\n")
  cat("==============================\n\n")

  cat("1. COLUMNS (Variables):\n")
  cat("   □ Does each column represent exactly one variable?\n")
  cat("   □ Are column names variable names (not values)?\n")
  cat("   □ Do you see patterns like '2020', '2021' in column names?\n")
  cat("   □ Do column names contain multiple pieces of info?\n\n")

  cat("2. ROWS (Observations):\n")
  cat("   □ Does each row represent exactly one observation?\n")
  cat("   □ Are observations split across multiple rows?\n")
  cat("   □ Are multiple observations squeezed into one row?\n\n")

  cat("3. CELLS (Values):\n")
  cat("   □ Does each cell contain exactly one value?\n")
  cat("   □ Are multiple values separated by commas/semicolons?\n")
  cat("   □ Are values and variable names mixed in cells?\n\n")

  cat("4. TABLES (Observational Units):\n")
  cat("   □ Does this table mix different types of entities?\n")
  cat("   □ Is information duplicated across rows?\n")
  cat("   □ Should this be split into multiple tables?\n\n")

  # Basic analysis of the provided data
  cat("QUICK ANALYSIS OF YOUR DATA:\n")
  cat("============================\n")
  cat("Dimensions:", nrow(data), "rows ×", ncol(data), "columns\n")
  cat("Column names:", paste(names(data), collapse = ", "), "\n")

  # Check for common patterns
  col_names <- names(data)
  if (any(str_detect(col_names, "\\d{4}"))) {
    cat("⚠️  WARNING: Found years in column names - might need pivot_longer()\n")
  }
  if (any(str_detect(col_names, "_"))) {
    cat("⚠️  WARNING: Found underscores in column names - might need separate()\n")
  }
  if (length(unique(apply(data, 1, paste, collapse = ""))) < nrow(data)) {
    cat("⚠️  WARNING: Found duplicate rows - check for repeated observations\n")
  }
}

# Example usage
sample_messy <- tibble(
  id = 1:3,
  name = c("Alice", "Bob", "Charlie"),
  `2023_sales` = c(1000, 1200, 1100),
  `2024_sales` = c(1100, 1300, 1200)
)

diagnostic_checklist(sample_messy)
TIDY DATA DIAGNOSTIC CHECKLIST
==============================

1. COLUMNS (Variables):
   □ Does each column represent exactly one variable?
   □ Are column names variable names (not values)?
   □ Do you see patterns like '2020', '2021' in column names?
   □ Do column names contain multiple pieces of info?

2. ROWS (Observations):
   □ Does each row represent exactly one observation?
   □ Are observations split across multiple rows?
   □ Are multiple observations squeezed into one row?

3. CELLS (Values):
   □ Does each cell contain exactly one value?
   □ Are multiple values separated by commas/semicolons?
   □ Are values and variable names mixed in cells?

4. TABLES (Observational Units):
   □ Does this table mix different types of entities?
   □ Is information duplicated across rows?
   □ Should this be split into multiple tables?

QUICK ANALYSIS OF YOUR DATA:
============================
Dimensions: 3 rows × 4 columns
Column names: id, name, 2023_sales, 2024_sales 
⚠️  WARNING: Found years in column names - might need pivot_longer()
⚠️  WARNING: Found underscores in column names - might need separate()

Visual Pattern Recognition

# Function to visualize data structure issues
show_data_problems <- function() {
  cat("COMMON MESSY DATA PATTERNS - VISUAL GUIDE\n")
  cat("=========================================\n\n")

  # Pattern 1: Too wide
  cat("PATTERN 1: TOO WIDE (headers are values)\n")
  cat("┌─────────┬─────┬─────┬─────┬─────┐\n")
  cat("│ country │2020 │2021 │2022 │2023 │\n")
  cat("├─────────┼─────┼─────┼─────┼─────┤\n")
  cat("│ USA     │ 331 │ 332 │ 333 │ 334 │\n")
  cat("│ Canada  │  38 │  38 │  39 │  39 │\n")
  cat("└─────────┴─────┴─────┴─────┴─────┘\n")
  cat("SHOULD BE: pivot_longer() to make it tall\n\n")

  # Pattern 2: Multiple variables in names
  cat("PATTERN 2: MULTIPLE VARIABLES IN COLUMN NAMES\n")
  cat("┌─────┬──────────┬──────────┬──────────┐\n")
  cat("│ id  │treat_A_pre│treat_A_post│treat_B_pre│\n")
  cat("├─────┼──────────┼──────────┼──────────┤\n")
  cat("│  1  │    70    │    75    │    72    │\n")
  cat("└─────┴──────────┴──────────┴──────────┘\n")
  cat("SHOULD BE: pivot_longer() then separate()\n\n")

  # Pattern 3: Multiple values in cells
  cat("PATTERN 3: MULTIPLE VALUES IN CELLS\n")
  cat("┌─────────┬─────────────┬─────────────┐\n")
  cat("│ person  │   contact   │   skills    │\n")
  cat("├─────────┼─────────────┼─────────────┤\n")
  cat("│ Alice   │email;phone  │ R,Python    │\n")
  cat("│ Bob     │email;phone  │Excel,SQL    │\n")
  cat("└─────────┴─────────────┴─────────────┘\n")
  cat("SHOULD BE: separate() into multiple columns\n\n")

  # Pattern 4: Variables as rows
  cat("PATTERN 4: VARIABLES STORED AS ROWS\n")
  cat("┌─────────┬─────────┬───────┐\n")
  cat("│ person  │variable │ value │\n")
  cat("├─────────┼─────────┼───────┤\n")
  cat("│ Alice   │ height  │  165  │\n")
  cat("│ Alice   │ weight  │   60  │\n")
  cat("│ Bob     │ height  │  180  │\n")
  cat("│ Bob     │ weight  │   75  │\n")
  cat("└─────────┴─────────┴───────┘\n")
  cat("SHOULD BE: pivot_wider() to make columns\n\n")
}

show_data_problems()
COMMON MESSY DATA PATTERNS - VISUAL GUIDE
=========================================

PATTERN 1: TOO WIDE (headers are values)
┌─────────┬─────┬─────┬─────┬─────┐
│ country │2020 │2021 │2022 │2023 │
├─────────┼─────┼─────┼─────┼─────┤
│ USA     │ 331 │ 332 │ 333 │ 334 │
│ Canada  │  38 │  38 │  39 │  39 │
└─────────┴─────┴─────┴─────┴─────┘
SHOULD BE: pivot_longer() to make it tall

PATTERN 2: MULTIPLE VARIABLES IN COLUMN NAMES
┌─────┬──────────┬──────────┬──────────┐
│ id  │treat_A_pre│treat_A_post│treat_B_pre│
├─────┼──────────┼──────────┼──────────┤
│  1  │    70    │    75    │    72    │
└─────┴──────────┴──────────┴──────────┘
SHOULD BE: pivot_longer() then separate()

PATTERN 3: MULTIPLE VALUES IN CELLS
┌─────────┬─────────────┬─────────────┐
│ person  │   contact   │   skills    │
├─────────┼─────────────┼─────────────┤
│ Alice   │email;phone  │ R,Python    │
│ Bob     │email;phone  │Excel,SQL    │
└─────────┴─────────────┴─────────────┘
SHOULD BE: separate() into multiple columns

PATTERN 4: VARIABLES STORED AS ROWS
┌─────────┬─────────┬───────┐
│ person  │variable │ value │
├─────────┼─────────┼───────┤
│ Alice   │ height  │  165  │
│ Alice   │ weight  │   60  │
│ Bob     │ height  │  180  │
│ Bob     │ weight  │   75  │
└─────────┴─────────┴───────┘
SHOULD BE: pivot_wider() to make columns

Planning Your Tidying Strategy

Step-by-Step Planning Process

# Function to help plan tidying strategy
plan_tidying <- function(data, target_description = NULL) {
  cat("TIDYING STRATEGY PLANNER\n")
  cat("========================\n\n")

  cat("STEP 1: UNDERSTAND YOUR CURRENT DATA\n")
  cat("Current dimensions:", nrow(data), "rows ×", ncol(data), "columns\n")
  cat("Current columns:", paste(names(data), collapse = ", "), "\n\n")

  cat("STEP 2: ENVISION YOUR TARGET STRUCTURE\n")
  if (!is.null(target_description)) {
    cat("Target:", target_description, "\n")
  } else {
    cat("Ask yourself:\n")
    cat("- What should each row represent?\n")
    cat("- What variables do I need as columns?\n")
    cat("- What is the observational unit?\n")
  }
  cat("\n")

  cat("STEP 3: IDENTIFY REQUIRED TRANSFORMATIONS\n")

  # Check for common patterns
  col_names <- names(data)

  if (any(str_detect(col_names, "\\d{4}"))) {
    cat("📋 NEEDS: pivot_longer() - Found years/dates in column names\n")
  }

  if (any(str_detect(col_names, "_.*_"))) {
    cat("📋 NEEDS: separate() - Found multiple variables in column names\n")
  }

  # Check for values that look like they contain multiple items
  char_cols <- data %>% select(where(is.character))
  if (ncol(char_cols) > 0) {
    multi_value_check <- char_cols %>%
      summarise(across(everything(), ~ any(str_detect(.x, "[,;|]"), na.rm = TRUE)))

    if (any(multi_value_check)) {
      cat("📋 NEEDS: separate() - Found multiple values in cells\n")
    }
  }

  cat("\n")

  cat("STEP 4: PLAN THE SEQUENCE\n")
  cat("1. Start with the most fundamental transformation\n")
  cat("2. Work from general to specific\n")
  cat("3. Clean up column names and types at the end\n\n")

  cat("STEP 5: VERIFY EACH STEP\n")
  cat("- Check dimensions after each transformation\n")
  cat("- Verify that you haven't lost or duplicated data\n")
  cat("- Test with a small sample first\n")
}

# Example usage
messy_survey <- tibble(
  respondent = 1:3,
  Q1_pre = c(3, 4, 2),
  Q1_post = c(4, 5, 4),
  Q2_pre = c(2, 3, 2),
  Q2_post = c(3, 4, 3),
  demographics = c("25_Male", "34_Female", "28_Male")
)

plan_tidying(messy_survey, "One row per respondent-question-time combination")
TIDYING STRATEGY PLANNER
========================

STEP 1: UNDERSTAND YOUR CURRENT DATA
Current dimensions: 3 rows × 6 columns
Current columns: respondent, Q1_pre, Q1_post, Q2_pre, Q2_post, demographics 

STEP 2: ENVISION YOUR TARGET STRUCTURE
Target: One row per respondent-question-time combination 

STEP 3: IDENTIFY REQUIRED TRANSFORMATIONS

STEP 4: PLAN THE SEQUENCE
1. Start with the most fundamental transformation
2. Work from general to specific
3. Clean up column names and types at the end

STEP 5: VERIFY EACH STEP
- Check dimensions after each transformation
- Verify that you haven't lost or duplicated data
- Test with a small sample first

Real-World Tidying Examples

Example 1: Financial Data

# Typical financial data from spreadsheets
financial_messy <- tibble(
  company = c("Apple", "Google", "Microsoft"),
  `2021_Revenue` = c(365.8, 257.6, 168.1),
  `2021_Profit` = c(94.7, 76.0, 61.3),
  `2022_Revenue` = c(394.3, 282.8, 198.3),
  `2022_Profit` = c(99.8, 82.5, 72.7),
  `2023_Revenue` = c(383.3, 307.4, 211.9),
  `2023_Profit` = c(97.0, 73.8, 72.4)
)

cat("MESSY: Financial data with years and metrics in column names\n")
MESSY: Financial data with years and metrics in column names
print(financial_messy)
# A tibble: 3 × 7
  company   `2021_Revenue` `2021_Profit` `2022_Revenue` `2022_Profit`
  <chr>              <dbl>         <dbl>          <dbl>         <dbl>
1 Apple               366.          94.7           394.          99.8
2 Google              258.          76             283.          82.5
3 Microsoft           168.          61.3           198.          72.7
# ℹ 2 more variables: `2023_Revenue` <dbl>, `2023_Profit` <dbl>
# Step 1: Identify the structure
cat("\nANALYSIS:\n")

ANALYSIS:
cat("- Years (2021, 2022, 2023) are in column names → need pivot_longer()\n")
- Years (2021, 2022, 2023) are in column names → need pivot_longer()
cat("- Metrics (Revenue, Profit) are in column names → need separate()\n")
- Metrics (Revenue, Profit) are in column names → need separate()
cat("- Each company-year should be one observation\n")
- Each company-year should be one observation
# Step 2: Tidy transformation
financial_tidy <- financial_messy %>%
  # First, pivot longer to get year_metric combinations
  pivot_longer(
    cols = -company,
    names_to = "year_metric",
    values_to = "value"
  ) %>%
  # Then separate the year and metric
  separate(year_metric, into = c("year", "metric"), sep = "_") %>%
  # Convert year to numeric
  mutate(year = as.numeric(year)) %>%
  # Pivot wider to get metrics as columns
  pivot_wider(names_from = metric, values_from = value) %>%
  # Calculate derived metrics
  mutate(
    profit_margin = round(Profit / Revenue * 100, 1),
    profit_growth = round((Profit / lag(Profit) - 1) * 100, 1)
  ) %>%
  arrange(company, year)

cat("\nTIDY: Clean financial data ready for analysis\n")

TIDY: Clean financial data ready for analysis
print(financial_tidy)
# A tibble: 9 × 6
  company    year Revenue Profit profit_margin profit_growth
  <chr>     <dbl>   <dbl>  <dbl>         <dbl>         <dbl>
1 Apple      2021    366.   94.7          25.9          NA  
2 Apple      2022    394.   99.8          25.3           5.4
3 Apple      2023    383.   97            25.3          -2.8
4 Google     2021    258.   76            29.5         -21.6
5 Google     2022    283.   82.5          29.2           8.6
6 Google     2023    307.   73.8          24           -10.5
7 Microsoft  2021    168.   61.3          36.5         -16.9
8 Microsoft  2022    198.   72.7          36.7          18.6
9 Microsoft  2023    212.   72.4          34.2          -0.4
# Now analysis is straightforward
cat("\nAnalysis example - Average profit margin by company:\n")

Analysis example - Average profit margin by company:
financial_tidy %>%
  group_by(company) %>%
  summarise(avg_profit_margin = round(mean(profit_margin), 1), .groups = "drop") %>%
  arrange(desc(avg_profit_margin)) %>%
  print()
# A tibble: 3 × 2
  company   avg_profit_margin
  <chr>                 <dbl>
1 Microsoft              35.8
2 Google                 27.6
3 Apple                  25.5

Example 2: Survey Data

# Complex survey data with multiple issues
survey_messy <- tibble(
  id = 1:4,
  name_age = c("Alice_25", "Bob_30", "Charlie_35", "Diana_28"),
  pre_satisfaction = c(3, 4, 2, 5),
  post_satisfaction = c(4, 5, 4, 5),
  pre_recommendation = c(2, 3, 2, 4),
  post_recommendation = c(4, 4, 3, 5),
  contact_info = c("alice@email.com;555-1234", "bob@company.org;555-5678",
                   "charlie@domain.com", "diana@startup.io;555-9999")
)

cat("MESSY: Survey data with multiple structural issues\n")
MESSY: Survey data with multiple structural issues
print(survey_messy)
# A tibble: 4 × 7
     id name_age   pre_satisfaction post_satisfaction pre_recommendation
  <int> <chr>                 <dbl>             <dbl>              <dbl>
1     1 Alice_25                  3                 4                  2
2     2 Bob_30                    4                 5                  3
3     3 Charlie_35                2                 4                  2
4     4 Diana_28                  5                 5                  4
# ℹ 2 more variables: post_recommendation <dbl>, contact_info <chr>
# Multi-step tidying process
survey_tidy <- survey_messy %>%
  # Step 1: Separate name and age
  separate(name_age, into = c("name", "age"), sep = "_") %>%
  mutate(age = as.numeric(age)) %>%

  # Step 2: Pivot longer to get measurement types in rows
  pivot_longer(
    cols = matches("(satisfaction|recommendation)"),
    names_to = "measurement_info",
    values_to = "score"
  ) %>%

  # Step 3: Separate measurement info
  separate(measurement_info, into = c("time", "metric"), sep = "_") %>%

  # Step 4: Handle contact info (separate into multiple rows for multiple contacts)
  separate_rows(contact_info, sep = ";") %>%

  # Step 5: Clean up and organize
  mutate(
    contact_type = case_when(
      str_detect(contact_info, "@") ~ "email",
      str_detect(contact_info, "^\\d{3}-\\d{4}$") ~ "phone",
      TRUE ~ "other"
    )
  ) %>%
  arrange(id, time, metric)

cat("\nTIDY: Survey data in analysis-ready format\n")

TIDY: Survey data in analysis-ready format
print(head(survey_tidy, 12))
# A tibble: 12 × 8
      id name    age contact_info    time  metric         score contact_type
   <int> <chr> <dbl> <chr>           <chr> <chr>          <dbl> <chr>       
 1     1 Alice    25 alice@email.com post  recommendation     4 email       
 2     1 Alice    25 555-1234        post  recommendation     4 phone       
 3     1 Alice    25 alice@email.com post  satisfaction       4 email       
 4     1 Alice    25 555-1234        post  satisfaction       4 phone       
 5     1 Alice    25 alice@email.com pre   recommendation     2 email       
 6     1 Alice    25 555-1234        pre   recommendation     2 phone       
 7     1 Alice    25 alice@email.com pre   satisfaction       3 email       
 8     1 Alice    25 555-1234        pre   satisfaction       3 phone       
 9     2 Bob      30 bob@company.org post  recommendation     4 email       
10     2 Bob      30 555-5678        post  recommendation     4 phone       
11     2 Bob      30 bob@company.org post  satisfaction       5 email       
12     2 Bob      30 555-5678        post  satisfaction       5 phone       
# Now we can easily analyze pre/post changes
cat("\nAnalysis example - Average improvement by metric:\n")

Analysis example - Average improvement by metric:
survey_improvement <- survey_tidy %>%
  filter(contact_type == "email") %>%  # One record per person
  select(id, name, time, metric, score) %>%
  pivot_wider(names_from = time, values_from = score) %>%
  mutate(improvement = post - pre) %>%
  group_by(metric) %>%
  summarise(avg_improvement = round(mean(improvement, na.rm = TRUE), 2), .groups = "drop")

print(survey_improvement)
# A tibble: 2 × 2
  metric         avg_improvement
  <chr>                    <dbl>
1 recommendation            1.25
2 satisfaction              1   

Best Practices for Tidy Data

Documentation and Naming

# Good practices for maintaining tidy data
cat("BEST PRACTICES FOR TIDY DATA\n")
BEST PRACTICES FOR TIDY DATA
cat("============================\n\n")
============================
cat("1. CONSISTENT NAMING CONVENTIONS:\n")
1. CONSISTENT NAMING CONVENTIONS:
cat("   ✓ Use snake_case for column names\n")
   ✓ Use snake_case for column names
cat("   ✓ Use descriptive, unambiguous names\n")
   ✓ Use descriptive, unambiguous names
cat("   ✓ Avoid spaces, special characters\n")
   ✓ Avoid spaces, special characters
cat("   ✓ Be consistent with abbreviations\n\n")
   ✓ Be consistent with abbreviations
# Example of good naming
good_names_example <- tibble(
  customer_id = c("C001", "C002"),
  order_date = ymd(c("2024-01-15", "2024-01-16")),
  product_category = c("Electronics", "Clothing"),
  unit_price_usd = c(299.99, 49.99),
  quantity_ordered = c(1, 2),
  discount_percent = c(0.1, 0.0)
)

cat("GOOD NAMING EXAMPLE:\n")
GOOD NAMING EXAMPLE:
print(good_names_example)
# A tibble: 2 × 6
  customer_id order_date product_category unit_price_usd quantity_ordered
  <chr>       <date>     <chr>                     <dbl>            <dbl>
1 C001        2024-01-15 Electronics               300.                 1
2 C002        2024-01-16 Clothing                   50.0                2
# ℹ 1 more variable: discount_percent <dbl>
cat("\n2. DATA TYPES:\n")

2. DATA TYPES:
cat("   ✓ Use appropriate data types (dates as Date, numbers as numeric)\n")
   ✓ Use appropriate data types (dates as Date, numbers as numeric)
cat("   ✓ Use factors for categorical data with known levels\n")
   ✓ Use factors for categorical data with known levels
cat("   ✓ Use logical (TRUE/FALSE) for binary variables\n")
   ✓ Use logical (TRUE/FALSE) for binary variables
cat("   ✓ Document any special coding (NA, missing values)\n\n")
   ✓ Document any special coding (NA, missing values)
cat("3. DOCUMENTATION:\n")
3. DOCUMENTATION:
cat("   ✓ Document your tidying process\n")
   ✓ Document your tidying process
cat("   ✓ Keep original data unchanged\n")
   ✓ Keep original data unchanged
cat("   ✓ Comment complex transformations\n")
   ✓ Comment complex transformations
cat("   ✓ Include data dictionaries\n\n")
   ✓ Include data dictionaries
cat("4. VALIDATION:\n")
4. VALIDATION:
cat("   ✓ Check row counts before/after transformations\n")
   ✓ Check row counts before/after transformations
cat("   ✓ Verify no data was lost or duplicated\n")
   ✓ Verify no data was lost or duplicated
cat("   ✓ Test with edge cases and missing data\n")
   ✓ Test with edge cases and missing data
cat("   ✓ Compare aggregated totals before/after\n")
   ✓ Compare aggregated totals before/after

Common Pitfalls and How to Avoid Them

cat("COMMON PITFALLS IN DATA TIDYING\n")
COMMON PITFALLS IN DATA TIDYING
cat("===============================\n\n")
===============================
# Pitfall 1: Losing data during transformation
cat("PITFALL 1: LOSING DATA DURING TRANSFORMATION\n")
PITFALL 1: LOSING DATA DURING TRANSFORMATION
example_data <- tibble(
  id = 1:3,
  value1 = c(10, 20, 30),
  value2 = c(15, 25, 35)
)

# Bad: Forgetting to check row counts
before_rows <- nrow(example_data)
transformed <- example_data %>% pivot_longer(cols = -id, names_to = "variable", values_to = "value")
after_rows <- nrow(transformed)

cat("Original rows:", before_rows, "\n")
Original rows: 3 
cat("After transformation:", after_rows, "\n")
After transformation: 6 
cat("✓ GOOD: Always verify row counts make sense\n\n")
✓ GOOD: Always verify row counts make sense
# Pitfall 2: Mixing data types
cat("PITFALL 2: MIXING DATA TYPES\n")
PITFALL 2: MIXING DATA TYPES
mixed_types <- tibble(
  metric = c("revenue", "count", "rate"),
  value = c("1000", "50", "0.05")  # All stored as character!
)

cat("✗ BAD: All values stored as character\n")
✗ BAD: All values stored as character
print(mixed_types)
# A tibble: 3 × 2
  metric  value
  <chr>   <chr>
1 revenue 1000 
2 count   50   
3 rate    0.05 
# Better approach
proper_types <- tibble(
  metric = c("revenue", "count", "rate"),
  value = c(1000, 50, 0.05),
  unit = c("USD", "items", "proportion")
)

cat("✓ GOOD: Proper data types with units documented\n")
✓ GOOD: Proper data types with units documented
print(proper_types)
# A tibble: 3 × 3
  metric    value unit      
  <chr>     <dbl> <chr>     
1 revenue 1000    USD       
2 count     50    items     
3 rate       0.05 proportion
cat("\nPITFALL 3: OVER-TIDYING\n")

PITFALL 3: OVER-TIDYING
cat("✗ BAD: Making data so long it's hard to understand\n")
✗ BAD: Making data so long it's hard to understand
cat("✓ GOOD: Keep related variables together when it makes sense\n")
✓ GOOD: Keep related variables together when it makes sense
cat("✓ GOOD: Consider your analysis needs when choosing structure\n\n")
✓ GOOD: Consider your analysis needs when choosing structure
cat("PITFALL 4: NOT HANDLING MISSING VALUES\n")
PITFALL 4: NOT HANDLING MISSING VALUES
cat("✗ BAD: Ignoring how NAs will behave in transformations\n")
✗ BAD: Ignoring how NAs will behave in transformations
cat("✓ GOOD: Explicitly handle missing values with coalesce(), replace_na()\n")
✓ GOOD: Explicitly handle missing values with coalesce(), replace_na()
cat("✓ GOOD: Document what missing values mean\n")
✓ GOOD: Document what missing values mean

Exercises

Exercise 1: Diagnostic Practice

Given these messy datasets, identify what makes them untidy: 1. A table with years as column headers 2. A survey with pre_post measurements combined 3. A contact list with multiple phone numbers in one cell 4. A gradebook mixing student and course information

Exercise 2: Planning Transformations

For each messy dataset, write out: 1. What the tidy version should look like 2. Which tidyr functions you’ll need 3. The order of operations 4. How to verify the transformation worked

Exercise 3: Multi-Step Tidying

Practice with complex datasets that require: 1. Multiple pivot operations 2. Separation and recombination 3. Handling missing values 4. Type conversions

Exercise 4: Real-World Scenarios

Work with realistic messy data: 1. Financial statements from spreadsheets 2. Survey data with encoded responses 3. Log files with mixed formats 4. Scientific data with experimental conditions in names

Summary

Understanding tidy data principles is fundamental to effective data analysis:

The Three Rules:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Common Problems:

  • Headers as values: Years, treatments, conditions in column names
  • Multiple variables in names: Combined information that should be separated
  • Multiple values in cells: Lists or combined data in single cells
  • Mixed observational units: Different types of entities in one table

Diagnostic Process:

  1. Examine the structure: Look at dimensions and column names
  2. Identify patterns: Spot repeated elements and encoding
  3. Plan transformations: Map current state to desired tidy state
  4. Execute step-by-step: Transform incrementally and verify
  5. Validate results: Check row counts and data integrity

Best Practices:

  • Plan before you pivot: Understand your target structure
  • Document your process: Comment complex transformations
  • Validate each step: Check that data is preserved
  • Use consistent naming: Follow naming conventions
  • Handle missing data explicitly: Don’t let NAs surprise you

Mastering tidy data principles provides the foundation for all other data analysis work. When your data is tidy, analysis becomes straightforward and intuitive!

Next: Pivoting Data