Tidy data is a consistent way to organize data that makes analysis easier and more intuitive. The concept, formalized by Hadley Wickham, provides a standard way to structure datasets that works seamlessly with the tidyverse ecosystem.
library(tidyverse)cat("The three fundamental rules of tidy data:\n")
The three fundamental rules of tidy data:
cat("1. Each variable forms a column\n")
1. Each variable forms a column
cat("2. Each observation forms a row\n")
2. Each observation forms a row
cat("3. Each type of observational unit forms a table\n")
3. Each type of observational unit forms a table
The Three Rules of Tidy Data
Rule 1: Each Variable Forms a Column
A variable is something you can measure, count, or describe. Each variable should have its own column.
# TIDY: Each variable (name, age, height) has its own columntidy_people <-tibble(name =c("Alice", "Bob", "Charlie"),age =c(25, 30, 35),height =c(165, 180, 175))cat("TIDY - Each variable in its own column:\n")
TIDY - Each variable in its own column:
print(tidy_people)
# A tibble: 3 × 3
name age height
<chr> <dbl> <dbl>
1 Alice 25 165
2 Bob 30 180
3 Charlie 35 175
# MESSY: Multiple variables squeezed into one columnmessy_people <-tibble(person =c("Alice", "Bob", "Charlie"),age_height =c("25_165", "30_180", "35_175"))cat("\nMESSY - Multiple variables in one column:\n")
MESSY - Multiple variables in one column:
print(messy_people)
# A tibble: 3 × 2
person age_height
<chr> <chr>
1 Alice 25_165
2 Bob 30_180
3 Charlie 35_175
# A tibble: 4 × 3
person measurement value
<chr> <chr> <dbl>
1 Alice age 25
2 Alice height 165
3 Bob age 30
4 Bob height 180
Rule 2: Each Observation Forms a Row
An observation is all the values measured on the same unit (person, company, country) at the same time.
# TIDY: Each person (observation) gets one rowtidy_sales <-tibble(sales_rep =c("Alice", "Bob", "Charlie"),region =c("North", "South", "East"),Q1_sales =c(10000, 12000, 11000),Q2_sales =c(11000, 13000, 12000))cat("TIDY - Each sales rep (observation) in one row:\n")
TIDY - Each sales rep (observation) in one row:
print(tidy_sales)
# A tibble: 3 × 4
sales_rep region Q1_sales Q2_sales
<chr> <chr> <dbl> <dbl>
1 Alice North 10000 11000
2 Bob South 12000 13000
3 Charlie East 11000 12000
MESSY - Everything in one table (redundant customer info):
print(messy_combined)
# A tibble: 4 × 6
order_id customer_id customer_name industry order_date amount
<chr> <chr> <chr> <chr> <date> <dbl>
1 O001 C001 Alice Corp Tech 2024-01-15 1000
2 O002 C002 Bob Ltd Finance 2024-01-16 1500
3 O003 C001 Alice Corp Tech 2024-01-17 800
4 O004 C003 Charlie Inc Healthcare 2024-01-18 2000
Common Messy Data Patterns
Pattern 1: Column Headers are Values
This is one of the most common problems in real-world data.
# Common in spreadsheets: Years as column headersmessy_population <-tibble(country =c("USA", "Canada", "Mexico"),`2020`=c(331, 38, 129),`2021`=c(332, 38, 130),`2022`=c(333, 39, 131),`2023`=c(334, 39, 132))cat("MESSY - Years as column headers:\n")
# How to identify this pattern:cat("\nHow to identify this pattern:\n")
How to identify this pattern:
cat("- Column names represent values of a variable (years, quarters, treatments)\n")
- Column names represent values of a variable (years, quarters, treatments)
cat("- You see repeated patterns in column names\n")
- You see repeated patterns in column names
cat("- Data is 'wider' than it should be\n")
- Data is 'wider' than it should be
# What the tidy version should look like:tidy_population <- messy_population %>%pivot_longer(cols =-country,names_to ="year",values_to ="population_millions" ) %>%mutate(year =as.numeric(year))cat("\nTIDY version:\n")
TIDY version:
print(tidy_population)
# A tibble: 12 × 3
country year population_millions
<chr> <dbl> <dbl>
1 USA 2020 331
2 USA 2021 332
3 USA 2022 333
4 USA 2023 334
5 Canada 2020 38
6 Canada 2021 38
7 Canada 2022 39
8 Canada 2023 39
9 Mexico 2020 129
10 Mexico 2021 130
11 Mexico 2022 131
12 Mexico 2023 132
Pattern 2: Multiple Variables in Column Names
Column names often encode multiple pieces of information.
# Treatment and measurement type encoded in column namesmessy_experiment <-tibble(subject_id =1:3,treatment_A_weight =c(70, 75, 68),treatment_A_height =c(170, 180, 165),treatment_B_weight =c(71, 76, 69),treatment_B_height =c(171, 181, 166))cat("MESSY - Treatment and measurement in column names:\n")
MESSY - Treatment and measurement in column names:
# A tibble: 6 × 4
subject_id treatment weight height
<int> <chr> <dbl> <dbl>
1 1 A 70 170
2 1 B 71 171
3 2 A 75 180
4 2 B 76 181
5 3 A 68 165
6 3 B 69 166
Pattern 3: Variables Stored in Rows and Columns
Sometimes data is stored in a matrix-like format where both rows and columns contain variable information.
# Sales data with products as rows and quarters as columnsmessy_sales_matrix <-tibble(product =c("Widget A", "Widget B", "Widget C"),Q1_North =c(100, 150, 120),Q1_South =c(90, 140, 110),Q2_North =c(110, 160, 130),Q2_South =c(95, 145, 115))cat("MESSY - Products in rows, quarters and regions in columns:\n")
MESSY - Products in rows, quarters and regions in columns:
print(messy_sales_matrix)
# A tibble: 3 × 5
product Q1_North Q1_South Q2_North Q2_South
<chr> <dbl> <dbl> <dbl> <dbl>
1 Widget A 100 90 110 95
2 Widget B 150 140 160 145
3 Widget C 120 110 130 115
# Tidy version needs multiple stepstidy_sales_matrix <- messy_sales_matrix %>%pivot_longer(cols =-product,names_to ="quarter_region",values_to ="sales" ) %>%separate(quarter_region, into =c("quarter", "region"), sep ="_")cat("\nTIDY version:\n")
TIDY version:
print(tidy_sales_matrix)
# A tibble: 12 × 4
product quarter region sales
<chr> <chr> <chr> <dbl>
1 Widget A Q1 North 100
2 Widget A Q1 South 90
3 Widget A Q2 North 110
4 Widget A Q2 South 95
5 Widget B Q1 North 150
6 Widget B Q1 South 140
7 Widget B Q2 North 160
8 Widget B Q2 South 145
9 Widget C Q1 North 120
10 Widget C Q1 South 110
11 Widget C Q2 North 130
12 Widget C Q2 South 115
Pattern 4: Multiple Observational Units in One Table
Different types of information mixed together.
# Student and course information mixedmessy_grades <-tibble(record_id =1:4,student_name =c("Alice", "Alice", "Bob", "Bob"),student_age =c(20, 20, 22, 22),student_major =c("Math", "Math", "Physics", "Physics"),course_name =c("Calculus", "Statistics", "Calculus", "Physics"),course_credits =c(3, 3, 3, 4),grade =c("A", "B", "A-", "B+"))cat("MESSY - Student and course info mixed:\n")
MESSY - Student and course info mixed:
print(messy_grades)
# A tibble: 4 × 7
record_id student_name student_age student_major course_name course_credits
<int> <chr> <dbl> <chr> <chr> <dbl>
1 1 Alice 20 Math Calculus 3
2 2 Alice 20 Math Statistics 3
3 3 Bob 22 Physics Calculus 3
4 4 Bob 22 Physics Physics 4
# ℹ 1 more variable: grade <chr>
# Should be separate tablesstudents <- messy_grades %>%distinct(student_name, student_age, student_major)courses <- messy_grades %>%distinct(course_name, course_credits)enrollments <- messy_grades %>%select(student_name, course_name, grade)cat("\nTIDY - Students table:\n")
TIDY - Students table:
print(students)
# A tibble: 2 × 3
student_name student_age student_major
<chr> <dbl> <chr>
1 Alice 20 Math
2 Bob 22 Physics
# A tibble: 4 × 3
student_name course_name grade
<chr> <chr> <chr>
1 Alice Calculus A
2 Alice Statistics B
3 Bob Calculus A-
4 Bob Physics B+
Diagnosing Data Structure Problems
Quick Diagnostic Questions
diagnostic_checklist <-function(data) {cat("TIDY DATA DIAGNOSTIC CHECKLIST\n")cat("==============================\n\n")cat("1. COLUMNS (Variables):\n")cat(" □ Does each column represent exactly one variable?\n")cat(" □ Are column names variable names (not values)?\n")cat(" □ Do you see patterns like '2020', '2021' in column names?\n")cat(" □ Do column names contain multiple pieces of info?\n\n")cat("2. ROWS (Observations):\n")cat(" □ Does each row represent exactly one observation?\n")cat(" □ Are observations split across multiple rows?\n")cat(" □ Are multiple observations squeezed into one row?\n\n")cat("3. CELLS (Values):\n")cat(" □ Does each cell contain exactly one value?\n")cat(" □ Are multiple values separated by commas/semicolons?\n")cat(" □ Are values and variable names mixed in cells?\n\n")cat("4. TABLES (Observational Units):\n")cat(" □ Does this table mix different types of entities?\n")cat(" □ Is information duplicated across rows?\n")cat(" □ Should this be split into multiple tables?\n\n")# Basic analysis of the provided datacat("QUICK ANALYSIS OF YOUR DATA:\n")cat("============================\n")cat("Dimensions:", nrow(data), "rows ×", ncol(data), "columns\n")cat("Column names:", paste(names(data), collapse =", "), "\n")# Check for common patterns col_names <-names(data)if (any(str_detect(col_names, "\\d{4}"))) {cat("⚠️ WARNING: Found years in column names - might need pivot_longer()\n") }if (any(str_detect(col_names, "_"))) {cat("⚠️ WARNING: Found underscores in column names - might need separate()\n") }if (length(unique(apply(data, 1, paste, collapse =""))) <nrow(data)) {cat("⚠️ WARNING: Found duplicate rows - check for repeated observations\n") }}# Example usagesample_messy <-tibble(id =1:3,name =c("Alice", "Bob", "Charlie"),`2023_sales`=c(1000, 1200, 1100),`2024_sales`=c(1100, 1300, 1200))diagnostic_checklist(sample_messy)
TIDY DATA DIAGNOSTIC CHECKLIST
==============================
1. COLUMNS (Variables):
□ Does each column represent exactly one variable?
□ Are column names variable names (not values)?
□ Do you see patterns like '2020', '2021' in column names?
□ Do column names contain multiple pieces of info?
2. ROWS (Observations):
□ Does each row represent exactly one observation?
□ Are observations split across multiple rows?
□ Are multiple observations squeezed into one row?
3. CELLS (Values):
□ Does each cell contain exactly one value?
□ Are multiple values separated by commas/semicolons?
□ Are values and variable names mixed in cells?
4. TABLES (Observational Units):
□ Does this table mix different types of entities?
□ Is information duplicated across rows?
□ Should this be split into multiple tables?
QUICK ANALYSIS OF YOUR DATA:
============================
Dimensions: 3 rows × 4 columns
Column names: id, name, 2023_sales, 2024_sales
⚠️ WARNING: Found years in column names - might need pivot_longer()
⚠️ WARNING: Found underscores in column names - might need separate()
Visual Pattern Recognition
# Function to visualize data structure issuesshow_data_problems <-function() {cat("COMMON MESSY DATA PATTERNS - VISUAL GUIDE\n")cat("=========================================\n\n")# Pattern 1: Too widecat("PATTERN 1: TOO WIDE (headers are values)\n")cat("┌─────────┬─────┬─────┬─────┬─────┐\n")cat("│ country │2020 │2021 │2022 │2023 │\n")cat("├─────────┼─────┼─────┼─────┼─────┤\n")cat("│ USA │ 331 │ 332 │ 333 │ 334 │\n")cat("│ Canada │ 38 │ 38 │ 39 │ 39 │\n")cat("└─────────┴─────┴─────┴─────┴─────┘\n")cat("SHOULD BE: pivot_longer() to make it tall\n\n")# Pattern 2: Multiple variables in namescat("PATTERN 2: MULTIPLE VARIABLES IN COLUMN NAMES\n")cat("┌─────┬──────────┬──────────┬──────────┐\n")cat("│ id │treat_A_pre│treat_A_post│treat_B_pre│\n")cat("├─────┼──────────┼──────────┼──────────┤\n")cat("│ 1 │ 70 │ 75 │ 72 │\n")cat("└─────┴──────────┴──────────┴──────────┘\n")cat("SHOULD BE: pivot_longer() then separate()\n\n")# Pattern 3: Multiple values in cellscat("PATTERN 3: MULTIPLE VALUES IN CELLS\n")cat("┌─────────┬─────────────┬─────────────┐\n")cat("│ person │ contact │ skills │\n")cat("├─────────┼─────────────┼─────────────┤\n")cat("│ Alice │email;phone │ R,Python │\n")cat("│ Bob │email;phone │Excel,SQL │\n")cat("└─────────┴─────────────┴─────────────┘\n")cat("SHOULD BE: separate() into multiple columns\n\n")# Pattern 4: Variables as rowscat("PATTERN 4: VARIABLES STORED AS ROWS\n")cat("┌─────────┬─────────┬───────┐\n")cat("│ person │variable │ value │\n")cat("├─────────┼─────────┼───────┤\n")cat("│ Alice │ height │ 165 │\n")cat("│ Alice │ weight │ 60 │\n")cat("│ Bob │ height │ 180 │\n")cat("│ Bob │ weight │ 75 │\n")cat("└─────────┴─────────┴───────┘\n")cat("SHOULD BE: pivot_wider() to make columns\n\n")}show_data_problems()
COMMON MESSY DATA PATTERNS - VISUAL GUIDE
=========================================
PATTERN 1: TOO WIDE (headers are values)
┌─────────┬─────┬─────┬─────┬─────┐
│ country │2020 │2021 │2022 │2023 │
├─────────┼─────┼─────┼─────┼─────┤
│ USA │ 331 │ 332 │ 333 │ 334 │
│ Canada │ 38 │ 38 │ 39 │ 39 │
└─────────┴─────┴─────┴─────┴─────┘
SHOULD BE: pivot_longer() to make it tall
PATTERN 2: MULTIPLE VARIABLES IN COLUMN NAMES
┌─────┬──────────┬──────────┬──────────┐
│ id │treat_A_pre│treat_A_post│treat_B_pre│
├─────┼──────────┼──────────┼──────────┤
│ 1 │ 70 │ 75 │ 72 │
└─────┴──────────┴──────────┴──────────┘
SHOULD BE: pivot_longer() then separate()
PATTERN 3: MULTIPLE VALUES IN CELLS
┌─────────┬─────────────┬─────────────┐
│ person │ contact │ skills │
├─────────┼─────────────┼─────────────┤
│ Alice │email;phone │ R,Python │
│ Bob │email;phone │Excel,SQL │
└─────────┴─────────────┴─────────────┘
SHOULD BE: separate() into multiple columns
PATTERN 4: VARIABLES STORED AS ROWS
┌─────────┬─────────┬───────┐
│ person │variable │ value │
├─────────┼─────────┼───────┤
│ Alice │ height │ 165 │
│ Alice │ weight │ 60 │
│ Bob │ height │ 180 │
│ Bob │ weight │ 75 │
└─────────┴─────────┴───────┘
SHOULD BE: pivot_wider() to make columns
Planning Your Tidying Strategy
Step-by-Step Planning Process
# Function to help plan tidying strategyplan_tidying <-function(data, target_description =NULL) {cat("TIDYING STRATEGY PLANNER\n")cat("========================\n\n")cat("STEP 1: UNDERSTAND YOUR CURRENT DATA\n")cat("Current dimensions:", nrow(data), "rows ×", ncol(data), "columns\n")cat("Current columns:", paste(names(data), collapse =", "), "\n\n")cat("STEP 2: ENVISION YOUR TARGET STRUCTURE\n")if (!is.null(target_description)) {cat("Target:", target_description, "\n") } else {cat("Ask yourself:\n")cat("- What should each row represent?\n")cat("- What variables do I need as columns?\n")cat("- What is the observational unit?\n") }cat("\n")cat("STEP 3: IDENTIFY REQUIRED TRANSFORMATIONS\n")# Check for common patterns col_names <-names(data)if (any(str_detect(col_names, "\\d{4}"))) {cat("📋 NEEDS: pivot_longer() - Found years/dates in column names\n") }if (any(str_detect(col_names, "_.*_"))) {cat("📋 NEEDS: separate() - Found multiple variables in column names\n") }# Check for values that look like they contain multiple items char_cols <- data %>%select(where(is.character))if (ncol(char_cols) >0) { multi_value_check <- char_cols %>%summarise(across(everything(), ~any(str_detect(.x, "[,;|]"), na.rm =TRUE)))if (any(multi_value_check)) {cat("📋 NEEDS: separate() - Found multiple values in cells\n") } }cat("\n")cat("STEP 4: PLAN THE SEQUENCE\n")cat("1. Start with the most fundamental transformation\n")cat("2. Work from general to specific\n")cat("3. Clean up column names and types at the end\n\n")cat("STEP 5: VERIFY EACH STEP\n")cat("- Check dimensions after each transformation\n")cat("- Verify that you haven't lost or duplicated data\n")cat("- Test with a small sample first\n")}# Example usagemessy_survey <-tibble(respondent =1:3,Q1_pre =c(3, 4, 2),Q1_post =c(4, 5, 4),Q2_pre =c(2, 3, 2),Q2_post =c(3, 4, 3),demographics =c("25_Male", "34_Female", "28_Male"))plan_tidying(messy_survey, "One row per respondent-question-time combination")
TIDYING STRATEGY PLANNER
========================
STEP 1: UNDERSTAND YOUR CURRENT DATA
Current dimensions: 3 rows × 6 columns
Current columns: respondent, Q1_pre, Q1_post, Q2_pre, Q2_post, demographics
STEP 2: ENVISION YOUR TARGET STRUCTURE
Target: One row per respondent-question-time combination
STEP 3: IDENTIFY REQUIRED TRANSFORMATIONS
STEP 4: PLAN THE SEQUENCE
1. Start with the most fundamental transformation
2. Work from general to specific
3. Clean up column names and types at the end
STEP 5: VERIFY EACH STEP
- Check dimensions after each transformation
- Verify that you haven't lost or duplicated data
- Test with a small sample first
Real-World Tidying Examples
Example 1: Financial Data
# Typical financial data from spreadsheetsfinancial_messy <-tibble(company =c("Apple", "Google", "Microsoft"),`2021_Revenue`=c(365.8, 257.6, 168.1),`2021_Profit`=c(94.7, 76.0, 61.3),`2022_Revenue`=c(394.3, 282.8, 198.3),`2022_Profit`=c(99.8, 82.5, 72.7),`2023_Revenue`=c(383.3, 307.4, 211.9),`2023_Profit`=c(97.0, 73.8, 72.4))cat("MESSY: Financial data with years and metrics in column names\n")
MESSY: Financial data with years and metrics in column names
print(financial_messy)
# A tibble: 3 × 7
company `2021_Revenue` `2021_Profit` `2022_Revenue` `2022_Profit`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Apple 366. 94.7 394. 99.8
2 Google 258. 76 283. 82.5
3 Microsoft 168. 61.3 198. 72.7
# ℹ 2 more variables: `2023_Revenue` <dbl>, `2023_Profit` <dbl>
# Step 1: Identify the structurecat("\nANALYSIS:\n")
ANALYSIS:
cat("- Years (2021, 2022, 2023) are in column names → need pivot_longer()\n")
- Years (2021, 2022, 2023) are in column names → need pivot_longer()
cat("- Metrics (Revenue, Profit) are in column names → need separate()\n")
- Metrics (Revenue, Profit) are in column names → need separate()
cat("- Each company-year should be one observation\n")
- Each company-year should be one observation
# Step 2: Tidy transformationfinancial_tidy <- financial_messy %>%# First, pivot longer to get year_metric combinationspivot_longer(cols =-company,names_to ="year_metric",values_to ="value" ) %>%# Then separate the year and metricseparate(year_metric, into =c("year", "metric"), sep ="_") %>%# Convert year to numericmutate(year =as.numeric(year)) %>%# Pivot wider to get metrics as columnspivot_wider(names_from = metric, values_from = value) %>%# Calculate derived metricsmutate(profit_margin =round(Profit / Revenue *100, 1),profit_growth =round((Profit /lag(Profit) -1) *100, 1) ) %>%arrange(company, year)cat("\nTIDY: Clean financial data ready for analysis\n")
TIDY: Clean financial data ready for analysis
print(financial_tidy)
# A tibble: 9 × 6
company year Revenue Profit profit_margin profit_growth
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Apple 2021 366. 94.7 25.9 NA
2 Apple 2022 394. 99.8 25.3 5.4
3 Apple 2023 383. 97 25.3 -2.8
4 Google 2021 258. 76 29.5 -21.6
5 Google 2022 283. 82.5 29.2 8.6
6 Google 2023 307. 73.8 24 -10.5
7 Microsoft 2021 168. 61.3 36.5 -16.9
8 Microsoft 2022 198. 72.7 36.7 18.6
9 Microsoft 2023 212. 72.4 34.2 -0.4
# Now analysis is straightforwardcat("\nAnalysis example - Average profit margin by company:\n")
Analysis example - Average profit margin by company:
# Multi-step tidying processsurvey_tidy <- survey_messy %>%# Step 1: Separate name and ageseparate(name_age, into =c("name", "age"), sep ="_") %>%mutate(age =as.numeric(age)) %>%# Step 2: Pivot longer to get measurement types in rowspivot_longer(cols =matches("(satisfaction|recommendation)"),names_to ="measurement_info",values_to ="score" ) %>%# Step 3: Separate measurement infoseparate(measurement_info, into =c("time", "metric"), sep ="_") %>%# Step 4: Handle contact info (separate into multiple rows for multiple contacts)separate_rows(contact_info, sep =";") %>%# Step 5: Clean up and organizemutate(contact_type =case_when(str_detect(contact_info, "@") ~"email",str_detect(contact_info, "^\\d{3}-\\d{4}$") ~"phone",TRUE~"other" ) ) %>%arrange(id, time, metric)cat("\nTIDY: Survey data in analysis-ready format\n")
TIDY: Survey data in analysis-ready format
print(head(survey_tidy, 12))
# A tibble: 12 × 8
id name age contact_info time metric score contact_type
<int> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1 Alice 25 alice@email.com post recommendation 4 email
2 1 Alice 25 555-1234 post recommendation 4 phone
3 1 Alice 25 alice@email.com post satisfaction 4 email
4 1 Alice 25 555-1234 post satisfaction 4 phone
5 1 Alice 25 alice@email.com pre recommendation 2 email
6 1 Alice 25 555-1234 pre recommendation 2 phone
7 1 Alice 25 alice@email.com pre satisfaction 3 email
8 1 Alice 25 555-1234 pre satisfaction 3 phone
9 2 Bob 30 bob@company.org post recommendation 4 email
10 2 Bob 30 555-5678 post recommendation 4 phone
11 2 Bob 30 bob@company.org post satisfaction 5 email
12 2 Bob 30 555-5678 post satisfaction 5 phone
# Now we can easily analyze pre/post changescat("\nAnalysis example - Average improvement by metric:\n")
Analysis example - Average improvement by metric:
survey_improvement <- survey_tidy %>%filter(contact_type =="email") %>%# One record per personselect(id, name, time, metric, score) %>%pivot_wider(names_from = time, values_from = score) %>%mutate(improvement = post - pre) %>%group_by(metric) %>%summarise(avg_improvement =round(mean(improvement, na.rm =TRUE), 2), .groups ="drop")print(survey_improvement)
cat("✓ GOOD: Always verify row counts make sense\n\n")
✓ GOOD: Always verify row counts make sense
# Pitfall 2: Mixing data typescat("PITFALL 2: MIXING DATA TYPES\n")
PITFALL 2: MIXING DATA TYPES
mixed_types <-tibble(metric =c("revenue", "count", "rate"),value =c("1000", "50", "0.05") # All stored as character!)cat("✗ BAD: All values stored as character\n")
✗ BAD: All values stored as character
print(mixed_types)
# A tibble: 3 × 2
metric value
<chr> <chr>
1 revenue 1000
2 count 50
3 rate 0.05
# Better approachproper_types <-tibble(metric =c("revenue", "count", "rate"),value =c(1000, 50, 0.05),unit =c("USD", "items", "proportion"))cat("✓ GOOD: Proper data types with units documented\n")
✓ GOOD: Proper data types with units documented
print(proper_types)
# A tibble: 3 × 3
metric value unit
<chr> <dbl> <chr>
1 revenue 1000 USD
2 count 50 items
3 rate 0.05 proportion
cat("\nPITFALL 3: OVER-TIDYING\n")
PITFALL 3: OVER-TIDYING
cat("✗ BAD: Making data so long it's hard to understand\n")
✗ BAD: Making data so long it's hard to understand
cat("✓ GOOD: Keep related variables together when it makes sense\n")
✓ GOOD: Keep related variables together when it makes sense
cat("✓ GOOD: Consider your analysis needs when choosing structure\n\n")
✓ GOOD: Consider your analysis needs when choosing structure
cat("PITFALL 4: NOT HANDLING MISSING VALUES\n")
PITFALL 4: NOT HANDLING MISSING VALUES
cat("✗ BAD: Ignoring how NAs will behave in transformations\n")
✗ BAD: Ignoring how NAs will behave in transformations
cat("✓ GOOD: Explicitly handle missing values with coalesce(), replace_na()\n")
✓ GOOD: Explicitly handle missing values with coalesce(), replace_na()
cat("✓ GOOD: Document what missing values mean\n")
✓ GOOD: Document what missing values mean
Exercises
Exercise 1: Diagnostic Practice
Given these messy datasets, identify what makes them untidy: 1. A table with years as column headers 2. A survey with pre_post measurements combined 3. A contact list with multiple phone numbers in one cell 4. A gradebook mixing student and course information
Exercise 2: Planning Transformations
For each messy dataset, write out: 1. What the tidy version should look like 2. Which tidyr functions you’ll need 3. The order of operations 4. How to verify the transformation worked
Exercise 3: Multi-Step Tidying
Practice with complex datasets that require: 1. Multiple pivot operations 2. Separation and recombination 3. Handling missing values 4. Type conversions
Exercise 4: Real-World Scenarios
Work with realistic messy data: 1. Financial statements from spreadsheets 2. Survey data with encoded responses 3. Log files with mixed formats 4. Scientific data with experimental conditions in names
Summary
Understanding tidy data principles is fundamental to effective data analysis:
The Three Rules:
Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table
Common Problems:
Headers as values: Years, treatments, conditions in column names
Multiple variables in names: Combined information that should be separated
Multiple values in cells: Lists or combined data in single cells
Mixed observational units: Different types of entities in one table
Diagnostic Process:
Examine the structure: Look at dimensions and column names
Identify patterns: Spot repeated elements and encoding
Plan transformations: Map current state to desired tidy state
Execute step-by-step: Transform incrementally and verify
Validate results: Check row counts and data integrity
Best Practices:
Plan before you pivot: Understand your target structure
Document your process: Comment complex transformations
Validate each step: Check that data is preserved
Use consistent naming: Follow naming conventions
Handle missing data explicitly: Don’t let NAs surprise you
Mastering tidy data principles provides the foundation for all other data analysis work. When your data is tidy, analysis becomes straightforward and intuitive!
---title: "Tidy Data Principles"author: "IND215"date: todayformat: html: toc: true toc-depth: 3 code-fold: false code-tools: true---## Introduction to Tidy DataTidy data is a consistent way to organize data that makes analysis easier and more intuitive. The concept, formalized by Hadley Wickham, provides a standard way to structure datasets that works seamlessly with the tidyverse ecosystem.```{r}#| label: setup#| message: falselibrary(tidyverse)cat("The three fundamental rules of tidy data:\n")cat("1. Each variable forms a column\n")cat("2. Each observation forms a row\n")cat("3. Each type of observational unit forms a table\n")```## The Three Rules of Tidy Data### Rule 1: Each Variable Forms a ColumnA variable is something you can measure, count, or describe. Each variable should have its own column.```{r}#| label: rule1-examples# TIDY: Each variable (name, age, height) has its own columntidy_people <-tibble(name =c("Alice", "Bob", "Charlie"),age =c(25, 30, 35),height =c(165, 180, 175))cat("TIDY - Each variable in its own column:\n")print(tidy_people)# MESSY: Multiple variables squeezed into one columnmessy_people <-tibble(person =c("Alice", "Bob", "Charlie"),age_height =c("25_165", "30_180", "35_175"))cat("\nMESSY - Multiple variables in one column:\n")print(messy_people)# MESSY: Variable names as datamessy_measurements <-tibble(person =c("Alice", "Alice", "Bob", "Bob"),measurement =c("age", "height", "age", "height"),value =c(25, 165, 30, 180))cat("\nMESSY - Variable names stored as data:\n")print(messy_measurements)```### Rule 2: Each Observation Forms a RowAn observation is all the values measured on the same unit (person, company, country) at the same time.```{r}#| label: rule2-examples# TIDY: Each person (observation) gets one rowtidy_sales <-tibble(sales_rep =c("Alice", "Bob", "Charlie"),region =c("North", "South", "East"),Q1_sales =c(10000, 12000, 11000),Q2_sales =c(11000, 13000, 12000))cat("TIDY - Each sales rep (observation) in one row:\n")print(tidy_sales)# MESSY: Observations split across multiple rowsmessy_sales_split <-tibble(rep_quarter =c("Alice_Q1", "Alice_Q2", "Bob_Q1", "Bob_Q2"),region =c("North", "North", "South", "South"),sales =c(10000, 11000, 12000, 13000))cat("\nMESSY - Observations split across rows:\n")print(messy_sales_split)# MESSY: Multiple observations in one rowmessy_sales_combined <-tibble(quarter =c("Q1", "Q2"),alice_north =c(10000, 11000),bob_south =c(12000, 13000),charlie_east =c(11000, 12000))cat("\nMESSY - Multiple observations in one row:\n")print(messy_sales_combined)```### Rule 3: Each Type of Observational Unit Forms a TableDifferent types of data should be stored in separate tables. For example, customer data and order data are different types of observational units.```{r}#| label: rule3-examples# TIDY: Separate tables for different observational unitscustomers <-tibble(customer_id =c("C001", "C002", "C003"),customer_name =c("Alice Corp", "Bob Ltd", "Charlie Inc"),industry =c("Tech", "Finance", "Healthcare"))orders <-tibble(order_id =c("O001", "O002", "O003", "O004"),customer_id =c("C001", "C002", "C001", "C003"),order_date =ymd(c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18")),amount =c(1000, 1500, 800, 2000))cat("TIDY - Customers table:\n")print(customers)cat("\nTIDY - Orders table:\n")print(orders)# MESSY: Everything mashed togethermessy_combined <-tibble(order_id =c("O001", "O002", "O003", "O004"),customer_id =c("C001", "C002", "C001", "C003"),customer_name =c("Alice Corp", "Bob Ltd", "Alice Corp", "Charlie Inc"),industry =c("Tech", "Finance", "Tech", "Healthcare"),order_date =ymd(c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18")),amount =c(1000, 1500, 800, 2000))cat("\nMESSY - Everything in one table (redundant customer info):\n")print(messy_combined)```## Common Messy Data Patterns### Pattern 1: Column Headers are ValuesThis is one of the most common problems in real-world data.```{r}#| label: pattern1-headers-as-values# Common in spreadsheets: Years as column headersmessy_population <-tibble(country =c("USA", "Canada", "Mexico"),`2020`=c(331, 38, 129),`2021`=c(332, 38, 130),`2022`=c(333, 39, 131),`2023`=c(334, 39, 132))cat("MESSY - Years as column headers:\n")print(messy_population)# How to identify this pattern:cat("\nHow to identify this pattern:\n")cat("- Column names represent values of a variable (years, quarters, treatments)\n")cat("- You see repeated patterns in column names\n")cat("- Data is 'wider' than it should be\n")# What the tidy version should look like:tidy_population <- messy_population %>%pivot_longer(cols =-country,names_to ="year",values_to ="population_millions" ) %>%mutate(year =as.numeric(year))cat("\nTIDY version:\n")print(tidy_population)```### Pattern 2: Multiple Variables in Column NamesColumn names often encode multiple pieces of information.```{r}#| label: pattern2-multiple-variables# Treatment and measurement type encoded in column namesmessy_experiment <-tibble(subject_id =1:3,treatment_A_weight =c(70, 75, 68),treatment_A_height =c(170, 180, 165),treatment_B_weight =c(71, 76, 69),treatment_B_height =c(171, 181, 166))cat("MESSY - Treatment and measurement in column names:\n")print(messy_experiment)cat("\nHow to identify this pattern:\n")cat("- Column names contain underscores or separators\n")cat("- Multiple pieces of info encoded in names\n")cat("- Similar patterns repeated across columns\n")# Tidy approach: separate the informationtidy_experiment <- messy_experiment %>%pivot_longer(cols =-subject_id,names_to ="measurement_info",values_to ="value" ) %>%separate(measurement_info,into =c("treatment_label", "treatment", "measurement"),sep ="_") %>%select(-treatment_label) %>%# Remove redundant columnpivot_wider(names_from = measurement, values_from = value)cat("\nTIDY version:\n")print(tidy_experiment)```### Pattern 3: Variables Stored in Rows and ColumnsSometimes data is stored in a matrix-like format where both rows and columns contain variable information.```{r}#| label: pattern3-matrix-format# Sales data with products as rows and quarters as columnsmessy_sales_matrix <-tibble(product =c("Widget A", "Widget B", "Widget C"),Q1_North =c(100, 150, 120),Q1_South =c(90, 140, 110),Q2_North =c(110, 160, 130),Q2_South =c(95, 145, 115))cat("MESSY - Products in rows, quarters and regions in columns:\n")print(messy_sales_matrix)# Tidy version needs multiple stepstidy_sales_matrix <- messy_sales_matrix %>%pivot_longer(cols =-product,names_to ="quarter_region",values_to ="sales" ) %>%separate(quarter_region, into =c("quarter", "region"), sep ="_")cat("\nTIDY version:\n")print(tidy_sales_matrix)```### Pattern 4: Multiple Observational Units in One TableDifferent types of information mixed together.```{r}#| label: pattern4-mixed-units# Student and course information mixedmessy_grades <-tibble(record_id =1:4,student_name =c("Alice", "Alice", "Bob", "Bob"),student_age =c(20, 20, 22, 22),student_major =c("Math", "Math", "Physics", "Physics"),course_name =c("Calculus", "Statistics", "Calculus", "Physics"),course_credits =c(3, 3, 3, 4),grade =c("A", "B", "A-", "B+"))cat("MESSY - Student and course info mixed:\n")print(messy_grades)# Should be separate tablesstudents <- messy_grades %>%distinct(student_name, student_age, student_major)courses <- messy_grades %>%distinct(course_name, course_credits)enrollments <- messy_grades %>%select(student_name, course_name, grade)cat("\nTIDY - Students table:\n")print(students)cat("\nTIDY - Courses table:\n")print(courses)cat("\nTIDY - Enrollments table:\n")print(enrollments)```## Diagnosing Data Structure Problems### Quick Diagnostic Questions```{r}#| label: diagnostic-questionsdiagnostic_checklist <-function(data) {cat("TIDY DATA DIAGNOSTIC CHECKLIST\n")cat("==============================\n\n")cat("1. COLUMNS (Variables):\n")cat(" □ Does each column represent exactly one variable?\n")cat(" □ Are column names variable names (not values)?\n")cat(" □ Do you see patterns like '2020', '2021' in column names?\n")cat(" □ Do column names contain multiple pieces of info?\n\n")cat("2. ROWS (Observations):\n")cat(" □ Does each row represent exactly one observation?\n")cat(" □ Are observations split across multiple rows?\n")cat(" □ Are multiple observations squeezed into one row?\n\n")cat("3. CELLS (Values):\n")cat(" □ Does each cell contain exactly one value?\n")cat(" □ Are multiple values separated by commas/semicolons?\n")cat(" □ Are values and variable names mixed in cells?\n\n")cat("4. TABLES (Observational Units):\n")cat(" □ Does this table mix different types of entities?\n")cat(" □ Is information duplicated across rows?\n")cat(" □ Should this be split into multiple tables?\n\n")# Basic analysis of the provided datacat("QUICK ANALYSIS OF YOUR DATA:\n")cat("============================\n")cat("Dimensions:", nrow(data), "rows ×", ncol(data), "columns\n")cat("Column names:", paste(names(data), collapse =", "), "\n")# Check for common patterns col_names <-names(data)if (any(str_detect(col_names, "\\d{4}"))) {cat("⚠️ WARNING: Found years in column names - might need pivot_longer()\n") }if (any(str_detect(col_names, "_"))) {cat("⚠️ WARNING: Found underscores in column names - might need separate()\n") }if (length(unique(apply(data, 1, paste, collapse =""))) <nrow(data)) {cat("⚠️ WARNING: Found duplicate rows - check for repeated observations\n") }}# Example usagesample_messy <-tibble(id =1:3,name =c("Alice", "Bob", "Charlie"),`2023_sales`=c(1000, 1200, 1100),`2024_sales`=c(1100, 1300, 1200))diagnostic_checklist(sample_messy)```### Visual Pattern Recognition```{r}#| label: visual-patterns# Function to visualize data structure issuesshow_data_problems <-function() {cat("COMMON MESSY DATA PATTERNS - VISUAL GUIDE\n")cat("=========================================\n\n")# Pattern 1: Too widecat("PATTERN 1: TOO WIDE (headers are values)\n")cat("┌─────────┬─────┬─────┬─────┬─────┐\n")cat("│ country │2020 │2021 │2022 │2023 │\n")cat("├─────────┼─────┼─────┼─────┼─────┤\n")cat("│ USA │ 331 │ 332 │ 333 │ 334 │\n")cat("│ Canada │ 38 │ 38 │ 39 │ 39 │\n")cat("└─────────┴─────┴─────┴─────┴─────┘\n")cat("SHOULD BE: pivot_longer() to make it tall\n\n")# Pattern 2: Multiple variables in namescat("PATTERN 2: MULTIPLE VARIABLES IN COLUMN NAMES\n")cat("┌─────┬──────────┬──────────┬──────────┐\n")cat("│ id │treat_A_pre│treat_A_post│treat_B_pre│\n")cat("├─────┼──────────┼──────────┼──────────┤\n")cat("│ 1 │ 70 │ 75 │ 72 │\n")cat("└─────┴──────────┴──────────┴──────────┘\n")cat("SHOULD BE: pivot_longer() then separate()\n\n")# Pattern 3: Multiple values in cellscat("PATTERN 3: MULTIPLE VALUES IN CELLS\n")cat("┌─────────┬─────────────┬─────────────┐\n")cat("│ person │ contact │ skills │\n")cat("├─────────┼─────────────┼─────────────┤\n")cat("│ Alice │email;phone │ R,Python │\n")cat("│ Bob │email;phone │Excel,SQL │\n")cat("└─────────┴─────────────┴─────────────┘\n")cat("SHOULD BE: separate() into multiple columns\n\n")# Pattern 4: Variables as rowscat("PATTERN 4: VARIABLES STORED AS ROWS\n")cat("┌─────────┬─────────┬───────┐\n")cat("│ person │variable │ value │\n")cat("├─────────┼─────────┼───────┤\n")cat("│ Alice │ height │ 165 │\n")cat("│ Alice │ weight │ 60 │\n")cat("│ Bob │ height │ 180 │\n")cat("│ Bob │ weight │ 75 │\n")cat("└─────────┴─────────┴───────┘\n")cat("SHOULD BE: pivot_wider() to make columns\n\n")}show_data_problems()```## Planning Your Tidying Strategy### Step-by-Step Planning Process```{r}#| label: planning-strategy# Function to help plan tidying strategyplan_tidying <-function(data, target_description =NULL) {cat("TIDYING STRATEGY PLANNER\n")cat("========================\n\n")cat("STEP 1: UNDERSTAND YOUR CURRENT DATA\n")cat("Current dimensions:", nrow(data), "rows ×", ncol(data), "columns\n")cat("Current columns:", paste(names(data), collapse =", "), "\n\n")cat("STEP 2: ENVISION YOUR TARGET STRUCTURE\n")if (!is.null(target_description)) {cat("Target:", target_description, "\n") } else {cat("Ask yourself:\n")cat("- What should each row represent?\n")cat("- What variables do I need as columns?\n")cat("- What is the observational unit?\n") }cat("\n")cat("STEP 3: IDENTIFY REQUIRED TRANSFORMATIONS\n")# Check for common patterns col_names <-names(data)if (any(str_detect(col_names, "\\d{4}"))) {cat("📋 NEEDS: pivot_longer() - Found years/dates in column names\n") }if (any(str_detect(col_names, "_.*_"))) {cat("📋 NEEDS: separate() - Found multiple variables in column names\n") }# Check for values that look like they contain multiple items char_cols <- data %>%select(where(is.character))if (ncol(char_cols) >0) { multi_value_check <- char_cols %>%summarise(across(everything(), ~any(str_detect(.x, "[,;|]"), na.rm =TRUE)))if (any(multi_value_check)) {cat("📋 NEEDS: separate() - Found multiple values in cells\n") } }cat("\n")cat("STEP 4: PLAN THE SEQUENCE\n")cat("1. Start with the most fundamental transformation\n")cat("2. Work from general to specific\n")cat("3. Clean up column names and types at the end\n\n")cat("STEP 5: VERIFY EACH STEP\n")cat("- Check dimensions after each transformation\n")cat("- Verify that you haven't lost or duplicated data\n")cat("- Test with a small sample first\n")}# Example usagemessy_survey <-tibble(respondent =1:3,Q1_pre =c(3, 4, 2),Q1_post =c(4, 5, 4),Q2_pre =c(2, 3, 2),Q2_post =c(3, 4, 3),demographics =c("25_Male", "34_Female", "28_Male"))plan_tidying(messy_survey, "One row per respondent-question-time combination")```## Real-World Tidying Examples### Example 1: Financial Data```{r}#| label: financial-example# Typical financial data from spreadsheetsfinancial_messy <-tibble(company =c("Apple", "Google", "Microsoft"),`2021_Revenue`=c(365.8, 257.6, 168.1),`2021_Profit`=c(94.7, 76.0, 61.3),`2022_Revenue`=c(394.3, 282.8, 198.3),`2022_Profit`=c(99.8, 82.5, 72.7),`2023_Revenue`=c(383.3, 307.4, 211.9),`2023_Profit`=c(97.0, 73.8, 72.4))cat("MESSY: Financial data with years and metrics in column names\n")print(financial_messy)# Step 1: Identify the structurecat("\nANALYSIS:\n")cat("- Years (2021, 2022, 2023) are in column names → need pivot_longer()\n")cat("- Metrics (Revenue, Profit) are in column names → need separate()\n")cat("- Each company-year should be one observation\n")# Step 2: Tidy transformationfinancial_tidy <- financial_messy %>%# First, pivot longer to get year_metric combinationspivot_longer(cols =-company,names_to ="year_metric",values_to ="value" ) %>%# Then separate the year and metricseparate(year_metric, into =c("year", "metric"), sep ="_") %>%# Convert year to numericmutate(year =as.numeric(year)) %>%# Pivot wider to get metrics as columnspivot_wider(names_from = metric, values_from = value) %>%# Calculate derived metricsmutate(profit_margin =round(Profit / Revenue *100, 1),profit_growth =round((Profit /lag(Profit) -1) *100, 1) ) %>%arrange(company, year)cat("\nTIDY: Clean financial data ready for analysis\n")print(financial_tidy)# Now analysis is straightforwardcat("\nAnalysis example - Average profit margin by company:\n")financial_tidy %>%group_by(company) %>%summarise(avg_profit_margin =round(mean(profit_margin), 1), .groups ="drop") %>%arrange(desc(avg_profit_margin)) %>%print()```### Example 2: Survey Data```{r}#| label: survey-example# Complex survey data with multiple issuessurvey_messy <-tibble(id =1:4,name_age =c("Alice_25", "Bob_30", "Charlie_35", "Diana_28"),pre_satisfaction =c(3, 4, 2, 5),post_satisfaction =c(4, 5, 4, 5),pre_recommendation =c(2, 3, 2, 4),post_recommendation =c(4, 4, 3, 5),contact_info =c("alice@email.com;555-1234", "bob@company.org;555-5678","charlie@domain.com", "diana@startup.io;555-9999"))cat("MESSY: Survey data with multiple structural issues\n")print(survey_messy)# Multi-step tidying processsurvey_tidy <- survey_messy %>%# Step 1: Separate name and ageseparate(name_age, into =c("name", "age"), sep ="_") %>%mutate(age =as.numeric(age)) %>%# Step 2: Pivot longer to get measurement types in rowspivot_longer(cols =matches("(satisfaction|recommendation)"),names_to ="measurement_info",values_to ="score" ) %>%# Step 3: Separate measurement infoseparate(measurement_info, into =c("time", "metric"), sep ="_") %>%# Step 4: Handle contact info (separate into multiple rows for multiple contacts)separate_rows(contact_info, sep =";") %>%# Step 5: Clean up and organizemutate(contact_type =case_when(str_detect(contact_info, "@") ~"email",str_detect(contact_info, "^\\d{3}-\\d{4}$") ~"phone",TRUE~"other" ) ) %>%arrange(id, time, metric)cat("\nTIDY: Survey data in analysis-ready format\n")print(head(survey_tidy, 12))# Now we can easily analyze pre/post changescat("\nAnalysis example - Average improvement by metric:\n")survey_improvement <- survey_tidy %>%filter(contact_type =="email") %>%# One record per personselect(id, name, time, metric, score) %>%pivot_wider(names_from = time, values_from = score) %>%mutate(improvement = post - pre) %>%group_by(metric) %>%summarise(avg_improvement =round(mean(improvement, na.rm =TRUE), 2), .groups ="drop")print(survey_improvement)```## Best Practices for Tidy Data### Documentation and Naming```{r}#| label: best-practices# Good practices for maintaining tidy datacat("BEST PRACTICES FOR TIDY DATA\n")cat("============================\n\n")cat("1. CONSISTENT NAMING CONVENTIONS:\n")cat(" ✓ Use snake_case for column names\n")cat(" ✓ Use descriptive, unambiguous names\n")cat(" ✓ Avoid spaces, special characters\n")cat(" ✓ Be consistent with abbreviations\n\n")# Example of good naminggood_names_example <-tibble(customer_id =c("C001", "C002"),order_date =ymd(c("2024-01-15", "2024-01-16")),product_category =c("Electronics", "Clothing"),unit_price_usd =c(299.99, 49.99),quantity_ordered =c(1, 2),discount_percent =c(0.1, 0.0))cat("GOOD NAMING EXAMPLE:\n")print(good_names_example)cat("\n2. DATA TYPES:\n")cat(" ✓ Use appropriate data types (dates as Date, numbers as numeric)\n")cat(" ✓ Use factors for categorical data with known levels\n")cat(" ✓ Use logical (TRUE/FALSE) for binary variables\n")cat(" ✓ Document any special coding (NA, missing values)\n\n")cat("3. DOCUMENTATION:\n")cat(" ✓ Document your tidying process\n")cat(" ✓ Keep original data unchanged\n")cat(" ✓ Comment complex transformations\n")cat(" ✓ Include data dictionaries\n\n")cat("4. VALIDATION:\n")cat(" ✓ Check row counts before/after transformations\n")cat(" ✓ Verify no data was lost or duplicated\n")cat(" ✓ Test with edge cases and missing data\n")cat(" ✓ Compare aggregated totals before/after\n")```### Common Pitfalls and How to Avoid Them```{r}#| label: common-pitfallscat("COMMON PITFALLS IN DATA TIDYING\n")cat("===============================\n\n")# Pitfall 1: Losing data during transformationcat("PITFALL 1: LOSING DATA DURING TRANSFORMATION\n")example_data <-tibble(id =1:3,value1 =c(10, 20, 30),value2 =c(15, 25, 35))# Bad: Forgetting to check row countsbefore_rows <-nrow(example_data)transformed <- example_data %>%pivot_longer(cols =-id, names_to ="variable", values_to ="value")after_rows <-nrow(transformed)cat("Original rows:", before_rows, "\n")cat("After transformation:", after_rows, "\n")cat("✓ GOOD: Always verify row counts make sense\n\n")# Pitfall 2: Mixing data typescat("PITFALL 2: MIXING DATA TYPES\n")mixed_types <-tibble(metric =c("revenue", "count", "rate"),value =c("1000", "50", "0.05") # All stored as character!)cat("✗ BAD: All values stored as character\n")print(mixed_types)# Better approachproper_types <-tibble(metric =c("revenue", "count", "rate"),value =c(1000, 50, 0.05),unit =c("USD", "items", "proportion"))cat("✓ GOOD: Proper data types with units documented\n")print(proper_types)cat("\nPITFALL 3: OVER-TIDYING\n")cat("✗ BAD: Making data so long it's hard to understand\n")cat("✓ GOOD: Keep related variables together when it makes sense\n")cat("✓ GOOD: Consider your analysis needs when choosing structure\n\n")cat("PITFALL 4: NOT HANDLING MISSING VALUES\n")cat("✗ BAD: Ignoring how NAs will behave in transformations\n")cat("✓ GOOD: Explicitly handle missing values with coalesce(), replace_na()\n")cat("✓ GOOD: Document what missing values mean\n")```## Exercises### Exercise 1: Diagnostic PracticeGiven these messy datasets, identify what makes them untidy:1. A table with years as column headers2. A survey with pre_post measurements combined3. A contact list with multiple phone numbers in one cell4. A gradebook mixing student and course information### Exercise 2: Planning TransformationsFor each messy dataset, write out:1. What the tidy version should look like2. Which tidyr functions you'll need3. The order of operations4. How to verify the transformation worked### Exercise 3: Multi-Step TidyingPractice with complex datasets that require:1. Multiple pivot operations2. Separation and recombination3. Handling missing values4. Type conversions### Exercise 4: Real-World ScenariosWork with realistic messy data:1. Financial statements from spreadsheets2. Survey data with encoded responses3. Log files with mixed formats4. Scientific data with experimental conditions in names## SummaryUnderstanding tidy data principles is fundamental to effective data analysis:### The Three Rules:1. **Each variable forms a column**2. **Each observation forms a row**3. **Each type of observational unit forms a table**### Common Problems:- **Headers as values**: Years, treatments, conditions in column names- **Multiple variables in names**: Combined information that should be separated- **Multiple values in cells**: Lists or combined data in single cells- **Mixed observational units**: Different types of entities in one table### Diagnostic Process:1. **Examine the structure**: Look at dimensions and column names2. **Identify patterns**: Spot repeated elements and encoding3. **Plan transformations**: Map current state to desired tidy state4. **Execute step-by-step**: Transform incrementally and verify5. **Validate results**: Check row counts and data integrity### Best Practices:- **Plan before you pivot**: Understand your target structure- **Document your process**: Comment complex transformations- **Validate each step**: Check that data is preserved- **Use consistent naming**: Follow naming conventions- **Handle missing data explicitly**: Don't let NAs surprise youMastering tidy data principles provides the foundation for all other data analysis work. When your data is tidy, analysis becomes straightforward and intuitive!Next: **[Pivoting Data](pivoting-data.qmd)**