Nested Data and Rectangling: Advanced Data Structures

Author

IND215

Published

September 22, 2025

Understanding Nested Data Structures 🗂️

In the real world, data often comes in complex, nested formats - think JSON from APIs, lists within data frames, or hierarchical structures from databases. tidyr’s “rectangling” functions help you transform these complex structures into the rectangular, tidy format that R analysis tools expect.

What is Nested Data?

library(tidyverse)
library(jsonlite)

cat("NESTED DATA STRUCTURES - Common Forms:\n\n")
NESTED DATA STRUCTURES - Common Forms:
# Example 1: List-columns in data frames
survey_nested <- tibble(
  survey_id = 1:4,
  participant = c("Alice", "Bob", "Carol", "David"),
  responses = list(
    c(5, 4, 3, 5, 4),
    c(3, 3, 4, 2, 5),
    c(4, 5, 5, 4, 4),
    c(2, 3, 3, 3, 2)
  ),
  demographics = list(
    list(age = 25, education = "Bachelor", income = 50000),
    list(age = 34, education = "Master", income = 75000),
    list(age = 28, education = "PhD", income = 85000),
    list(age = 42, education = "Bachelor", income = 60000)
  )
)

cat("Example 1: Survey data with list-columns\n")
Example 1: Survey data with list-columns
print(survey_nested)
# A tibble: 4 × 4
  survey_id participant responses demographics    
      <int> <chr>       <list>    <list>          
1         1 Alice       <dbl [5]> <named list [3]>
2         2 Bob         <dbl [5]> <named list [3]>
3         3 Carol       <dbl [5]> <named list [3]>
4         4 David       <dbl [5]> <named list [3]>
cat("Column types:", sapply(survey_nested, class), "\n")
Column types: integer character list list 
# Example 2: JSON-like nested structure
api_response <- tibble(
  endpoint = c("/users", "/products", "/orders"),
  data = list(
    list(
      users = list(
        list(id = 1, name = "John", email = "john@example.com", purchases = c(100, 250, 150)),
        list(id = 2, name = "Jane", email = "jane@example.com", purchases = c(300, 180))
      )
    ),
    list(
      products = list(
        list(id = "A1", name = "Widget", price = 25.99, categories = c("electronics", "gadgets")),
        list(id = "B2", name = "Gadget", price = 45.50, categories = c("electronics"))
      )
    ),
    list(
      orders = list(
        list(order_id = 1001, customer_id = 1, items = c("A1", "B2"), total = 71.49),
        list(order_id = 1002, customer_id = 2, items = c("A1"), total = 25.99)
      )
    )
  )
)

cat("\nExample 2: API response with deeply nested structure\n")

Example 2: API response with deeply nested structure
str(api_response, max.level = 2)
tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
 $ endpoint: chr [1:3] "/users" "/products" "/orders"
 $ data    :List of 3

unnest(): Expanding List-Columns

Basic Unnesting Operations

# Unnest simple vector lists
responses_unnested <- survey_nested %>%
  select(survey_id, participant, responses) %>%
  unnest(responses) %>%
  group_by(survey_id, participant) %>%
  mutate(question = paste0("Q", row_number())) %>%
  ungroup()

cat("Unnested survey responses:\n")
Unnested survey responses:
print(responses_unnested)
# A tibble: 20 × 4
   survey_id participant responses question
       <int> <chr>           <dbl> <chr>   
 1         1 Alice               5 Q1      
 2         1 Alice               4 Q2      
 3         1 Alice               3 Q3      
 4         1 Alice               5 Q4      
 5         1 Alice               4 Q5      
 6         2 Bob                 3 Q1      
 7         2 Bob                 3 Q2      
 8         2 Bob                 4 Q3      
 9         2 Bob                 2 Q4      
10         2 Bob                 5 Q5      
11         3 Carol               4 Q1      
12         3 Carol               5 Q2      
13         3 Carol               5 Q3      
14         3 Carol               4 Q4      
15         3 Carol               4 Q5      
16         4 David               2 Q1      
17         4 David               3 Q2      
18         4 David               3 Q3      
19         4 David               3 Q4      
20         4 David               2 Q5      
# Create analysis-ready format
responses_wide <- responses_unnested %>%
  pivot_wider(names_from = question, values_from = responses)

cat("\nAnalysis-ready wide format:\n")

Analysis-ready wide format:
print(responses_wide)
# A tibble: 4 × 7
  survey_id participant    Q1    Q2    Q3    Q4    Q5
      <int> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>
1         1 Alice           5     4     3     5     4
2         2 Bob             3     3     4     2     5
3         3 Carol           4     5     5     4     4
4         4 David           2     3     3     3     2
# Unnest named lists (demographics)
demographics_unnested <- survey_nested %>%
  select(survey_id, participant, demographics) %>%
  unnest_wider(demographics)

cat("\nUnnested demographics (wide format):\n")

Unnested demographics (wide format):
print(demographics_unnested)
# A tibble: 4 × 5
  survey_id participant   age education income
      <int> <chr>       <dbl> <chr>      <dbl>
1         1 Alice          25 Bachelor   50000
2         2 Bob            34 Master     75000
3         3 Carol          28 PhD        85000
4         4 David          42 Bachelor   60000

Complex Unnesting Scenarios

# E-commerce order data with multiple levels
ecommerce_data <- tibble(
  order_id = 1:3,
  customer = c("Alice Johnson", "Bob Smith", "Carol Davis"),
  order_date = as.Date(c("2023-01-15", "2023-01-18", "2023-01-22")),
  items = list(
    list(
      list(product_id = "WIDGET_A", name = "Premium Widget", quantity = 2, unit_price = 25.99),
      list(product_id = "GADGET_B", name = "Smart Gadget", quantity = 1, unit_price = 45.50)
    ),
    list(
      list(product_id = "WIDGET_A", name = "Premium Widget", quantity = 3, unit_price = 25.99),
      list(product_id = "DEVICE_C", name = "IoT Device", quantity = 1, unit_price = 89.99)
    ),
    list(
      list(product_id = "GADGET_B", name = "Smart Gadget", quantity = 2, unit_price = 45.50)
    )
  ),
  shipping = list(
    list(method = "Standard", cost = 5.99, estimated_days = 5),
    list(method = "Express", cost = 12.99, estimated_days = 2),
    list(method = "Standard", cost = 5.99, estimated_days = 5)
  )
)

cat("Complex e-commerce nested data:\n")
Complex e-commerce nested data:
str(ecommerce_data, max.level = 2)
tibble [3 × 5] (S3: tbl_df/tbl/data.frame)
 $ order_id  : int [1:3] 1 2 3
 $ customer  : chr [1:3] "Alice Johnson" "Bob Smith" "Carol Davis"
 $ order_date: Date[1:3], format: "2023-01-15" "2023-01-18" ...
 $ items     :List of 3
 $ shipping  :List of 3
# Unnest items to get order details
order_items <- ecommerce_data %>%
  select(order_id, customer, order_date, items) %>%
  unnest(items) %>%
  unnest_wider(items) %>%
  mutate(
    line_total = quantity * unit_price,
    item_number = row_number()
  )

cat("\nOrder items unnested:\n")

Order items unnested:
print(order_items)
# A tibble: 5 × 9
  order_id customer   order_date product_id name  quantity unit_price line_total
     <int> <chr>      <date>     <chr>      <chr>    <dbl>      <dbl>      <dbl>
1        1 Alice Joh… 2023-01-15 WIDGET_A   Prem…        2       26.0       52.0
2        1 Alice Joh… 2023-01-15 GADGET_B   Smar…        1       45.5       45.5
3        2 Bob Smith  2023-01-18 WIDGET_A   Prem…        3       26.0       78.0
4        2 Bob Smith  2023-01-18 DEVICE_C   IoT …        1       90.0       90.0
5        3 Carol Dav… 2023-01-22 GADGET_B   Smar…        2       45.5       91  
# ℹ 1 more variable: item_number <int>
# Unnest shipping information
shipping_info <- ecommerce_data %>%
  select(order_id, shipping) %>%
  unnest_wider(shipping)

cat("\nShipping information unnested:\n")

Shipping information unnested:
print(shipping_info)
# A tibble: 3 × 4
  order_id method    cost estimated_days
     <int> <chr>    <dbl>          <dbl>
1        1 Standard  5.99              5
2        2 Express  13.0               2
3        3 Standard  5.99              5
# Combine for complete order analysis
complete_orders <- order_items %>%
  group_by(order_id, customer, order_date) %>%
  summarise(
    total_items = sum(quantity),
    order_total = sum(line_total),
    unique_products = n(),
    .groups = "drop"
  ) %>%
  left_join(shipping_info, by = "order_id") %>%
  mutate(
    grand_total = order_total + cost,
    total_delivery_days = estimated_days
  )

cat("\nComplete order analysis:\n")

Complete order analysis:
print(complete_orders)
# A tibble: 3 × 11
  order_id customer    order_date total_items order_total unique_products method
     <int> <chr>       <date>           <dbl>       <dbl>           <int> <chr> 
1        1 Alice John… 2023-01-15           3        97.5               2 Stand…
2        2 Bob Smith   2023-01-18           4       168.                2 Expre…
3        3 Carol Davis 2023-01-22           2        91                 1 Stand…
# ℹ 4 more variables: cost <dbl>, estimated_days <dbl>, grand_total <dbl>,
#   total_delivery_days <dbl>

Working with Varied List Structures

# Customer data with inconsistent nested structures
customer_profiles <- tibble(
  customer_id = 1:5,
  profile = list(
    list(name = "Alice", age = 25, preferences = c("electronics", "books"),
         contact = list(email = "alice@email.com", phone = "555-0123")),
    list(name = "Bob", age = 34, preferences = c("sports", "electronics"),
         contact = list(email = "bob@email.com")),  # No phone
    list(name = "Carol", age = 28, preferences = c("fashion", "beauty", "books"),
         contact = list(email = "carol@email.com", phone = "555-0156", address = "123 Main St")),
    list(name = "David", preferences = c("music"),  # No age
         contact = list(phone = "555-0189")),       # No email
    list(name = "Emma", age = 31, preferences = c("travel", "photography"))  # No contact
  )
)

cat("Customer profiles with inconsistent structures:\n")
Customer profiles with inconsistent structures:
str(customer_profiles, max.level = 2)
tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
 $ customer_id: int [1:5] 1 2 3 4 5
 $ profile    :List of 5
# Safe unnesting with error handling
basic_info <- customer_profiles %>%
  unnest_wider(profile) %>%
  select(customer_id, name, age) %>%
  mutate(age = as.numeric(age))  # Handle missing ages

cat("\nBasic customer information:\n")

Basic customer information:
print(basic_info)
# A tibble: 5 × 3
  customer_id name    age
        <int> <chr> <dbl>
1           1 Alice    25
2           2 Bob      34
3           3 Carol    28
4           4 David    NA
5           5 Emma     31
# Handle preferences (vector lists)
preferences_unnested <- customer_profiles %>%
  unnest_wider(profile) %>%
  select(customer_id, name, preferences) %>%
  unnest(preferences)

cat("\nCustomer preferences (long format):\n")

Customer preferences (long format):
print(preferences_unnested)
# A tibble: 10 × 3
   customer_id name  preferences
         <int> <chr> <chr>      
 1           1 Alice electronics
 2           1 Alice books      
 3           2 Bob   sports     
 4           2 Bob   electronics
 5           3 Carol fashion    
 6           3 Carol beauty     
 7           3 Carol books      
 8           4 David music      
 9           5 Emma  travel     
10           5 Emma  photography
# Handle contact information (nested lists with varying structures)
contact_info <- customer_profiles %>%
  unnest_wider(profile) %>%
  select(customer_id, name, contact) %>%
  # Handle missing contact info
  filter(!is.null(contact)) %>%
  unnest_wider(contact)

cat("\nContact information:\n")

Contact information:
print(contact_info)
# A tibble: 5 × 5
  customer_id name  email           phone    address    
        <int> <chr> <chr>           <chr>    <chr>      
1           1 Alice alice@email.com 555-0123 <NA>       
2           2 Bob   bob@email.com   <NA>     <NA>       
3           3 Carol carol@email.com 555-0156 123 Main St
4           4 David <NA>            555-0189 <NA>       
5           5 Emma  <NA>            <NA>     <NA>       
# Comprehensive customer analysis
customer_summary <- basic_info %>%
  left_join(
    preferences_unnested %>%
      group_by(customer_id) %>%
      summarise(
        preference_count = n(),
        top_preferences = paste(preferences, collapse = ", "),
        .groups = "drop"
      ),
    by = "customer_id"
  ) %>%
  left_join(
    contact_info %>%
      select(customer_id, email, phone),
    by = "customer_id"
  ) %>%
  mutate(
    age_group = case_when(
      age < 25 ~ "Under 25",
      age < 35 ~ "25-34",
      age < 45 ~ "35-44",
      TRUE ~ "45+"
    ),
    contact_completeness = case_when(
      !is.na(email) & !is.na(phone) ~ "Complete",
      !is.na(email) | !is.na(phone) ~ "Partial",
      TRUE ~ "Missing"
    )
  )

cat("\nComprehensive customer summary:\n")

Comprehensive customer summary:
print(customer_summary)
# A tibble: 5 × 9
  customer_id name    age preference_count top_preferences email phone age_group
        <int> <chr> <dbl>            <int> <chr>           <chr> <chr> <chr>    
1           1 Alice    25                2 electronics, b… alic… 555-… 25-34    
2           2 Bob      34                2 sports, electr… bob@… <NA>  25-34    
3           3 Carol    28                3 fashion, beaut… caro… 555-… 25-34    
4           4 David    NA                1 music           <NA>  555-… 45+      
5           5 Emma     31                2 travel, photog… <NA>  <NA>  25-34    
# ℹ 1 more variable: contact_completeness <chr>

hoist(): Extracting Specific Elements

Selective Extraction from Nested Lists

# Product catalog with nested specifications
product_catalog <- tibble(
  product_id = c("LAPTOP_001", "PHONE_002", "TABLET_003"),
  category = c("Electronics", "Electronics", "Electronics"),
  specifications = list(
    list(
      brand = "TechCorp",
      model = "UltraBook Pro",
      specs = list(cpu = "Intel i7", ram = "16GB", storage = "512GB SSD"),
      features = c("Touchscreen", "Backlit Keyboard", "Fingerprint Reader"),
      price = list(base = 1299.99, sale = 1199.99, currency = "USD")
    ),
    list(
      brand = "MobileTech",
      model = "SmartPhone X",
      specs = list(cpu = "Snapdragon 888", ram = "8GB", storage = "128GB"),
      features = c("5G", "Wireless Charging", "Water Resistant"),
      price = list(base = 899.99, sale = 799.99, currency = "USD")
    ),
    list(
      brand = "TabletCorp",
      model = "Tablet Pro",
      specs = list(cpu = "A14 Bionic", ram = "6GB", storage = "256GB"),
      features = c("Apple Pencil Support", "Face ID", "Retina Display"),
      price = list(base = 649.99, currency = "USD")  # No sale price
    )
  )
)

cat("Product catalog with complex nested specifications:\n")
Product catalog with complex nested specifications:
str(product_catalog, max.level = 2)
tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
 $ product_id    : chr [1:3] "LAPTOP_001" "PHONE_002" "TABLET_003"
 $ category      : chr [1:3] "Electronics" "Electronics" "Electronics"
 $ specifications:List of 3
# Use hoist() to extract specific elements
products_hoisted <- product_catalog %>%
  hoist(specifications,
    brand = "brand",
    model = "model",
    base_price = c("price", "base"),
    sale_price = c("price", "sale"),
    cpu = c("specs", "cpu"),
    ram = c("specs", "ram")
  ) %>%
  mutate(
    has_sale = !is.na(sale_price),
    discount_percent = round((base_price - sale_price) / base_price * 100, 1)
  )

cat("\nProducts with hoisted key information:\n")

Products with hoisted key information:
print(products_hoisted)
# A tibble: 3 × 11
  product_id category    brand      model      base_price sale_price cpu   ram  
  <chr>      <chr>       <chr>      <chr>           <dbl>      <dbl> <chr> <chr>
1 LAPTOP_001 Electronics TechCorp   UltraBook…      1300.      1200. Inte… 16GB 
2 PHONE_002  Electronics MobileTech SmartPhon…       900.       800. Snap… 8GB  
3 TABLET_003 Electronics TabletCorp Tablet Pro       650.        NA  A14 … 6GB  
# ℹ 3 more variables: specifications <list>, has_sale <lgl>,
#   discount_percent <dbl>
# Extract features using unnest after hoist
product_features <- product_catalog %>%
  hoist(specifications, features = "features") %>%
  select(product_id, category, features) %>%
  unnest(features)

cat("\nProduct features extracted:\n")

Product features extracted:
print(product_features)
# A tibble: 9 × 3
  product_id category    features            
  <chr>      <chr>       <chr>               
1 LAPTOP_001 Electronics Touchscreen         
2 LAPTOP_001 Electronics Backlit Keyboard    
3 LAPTOP_001 Electronics Fingerprint Reader  
4 PHONE_002  Electronics 5G                  
5 PHONE_002  Electronics Wireless Charging   
6 PHONE_002  Electronics Water Resistant     
7 TABLET_003 Electronics Apple Pencil Support
8 TABLET_003 Electronics Face ID             
9 TABLET_003 Electronics Retina Display      

Advanced Hoisting Techniques

# Financial data with nested time series and metadata
financial_data <- tibble(
  symbol = c("AAPL", "GOOGL", "MSFT"),
  data = list(
    list(
      company = "Apple Inc.",
      sector = "Technology",
      prices = list(
        "2023-01-01" = 150.25,
        "2023-01-02" = 152.10,
        "2023-01-03" = 149.80
      ),
      metrics = list(
        pe_ratio = 25.4,
        market_cap = "2.8T",
        dividend_yield = 0.52
      ),
      analysts = list(
        rating = "Buy",
        target_price = 180.00,
        consensus = list(buy = 15, hold = 3, sell = 1)
      )
    ),
    list(
      company = "Alphabet Inc.",
      sector = "Technology",
      prices = list(
        "2023-01-01" = 88.73,
        "2023-01-02" = 89.45,
        "2023-01-03" = 87.92
      ),
      metrics = list(
        pe_ratio = 18.7,
        market_cap = "1.1T",
        dividend_yield = 0.0
      ),
      analysts = list(
        rating = "Buy",
        target_price = 120.00,
        consensus = list(buy = 18, hold = 2, sell = 0)
      )
    ),
    list(
      company = "Microsoft Corporation",
      sector = "Technology",
      prices = list(
        "2023-01-01" = 240.22,
        "2023-01-02" = 242.15,
        "2023-01-03" = 238.90
      ),
      metrics = list(
        pe_ratio = 28.1,
        market_cap = "1.8T",
        dividend_yield = 0.73
      ),
      analysts = list(
        rating = "Strong Buy",
        target_price = 280.00,
        consensus = list(buy = 20, hold = 1, sell = 0)
      )
    )
  )
)

cat("Financial data with deeply nested structure:\n")
Financial data with deeply nested structure:
str(financial_data, max.level = 2)
tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
 $ symbol: chr [1:3] "AAPL" "GOOGL" "MSFT"
 $ data  :List of 3
# Extract key financial metrics
financial_metrics <- financial_data %>%
  hoist(data,
    company = "company",
    sector = "sector",
    pe_ratio = c("metrics", "pe_ratio"),
    market_cap = c("metrics", "market_cap"),
    dividend_yield = c("metrics", "dividend_yield"),
    analyst_rating = c("analysts", "rating"),
    target_price = c("analysts", "target_price"),
    buy_ratings = c("analysts", "consensus", "buy"),
    hold_ratings = c("analysts", "consensus", "hold")
  ) %>%
  mutate(
    total_ratings = buy_ratings + hold_ratings,
    buy_percentage = round(buy_ratings / total_ratings * 100, 1)
  )

cat("\nExtracted financial metrics:\n")

Extracted financial metrics:
print(financial_metrics)
# A tibble: 3 × 13
  symbol company        sector pe_ratio market_cap dividend_yield analyst_rating
  <chr>  <chr>          <chr>     <dbl> <chr>               <dbl> <chr>         
1 AAPL   Apple Inc.     Techn…     25.4 2.8T                 0.52 Buy           
2 GOOGL  Alphabet Inc.  Techn…     18.7 1.1T                 0    Buy           
3 MSFT   Microsoft Cor… Techn…     28.1 1.8T                 0.73 Strong Buy    
# ℹ 6 more variables: target_price <dbl>, buy_ratings <dbl>,
#   hold_ratings <dbl>, data <list>, total_ratings <dbl>, buy_percentage <dbl>
# Extract and process price data
price_data <- financial_data %>%
  hoist(data, prices = "prices") %>%
  select(symbol, prices) %>%
  unnest_longer(prices, indices_to = "date") %>%
  mutate(
    date = as.Date(date),
    price = as.numeric(prices)
  ) %>%
  select(-prices) %>%
  group_by(symbol) %>%
  mutate(
    price_change = price - lag(price),
    price_change_pct = round((price / lag(price) - 1) * 100, 2)
  ) %>%
  ungroup()

cat("\nProcessed price data:\n")

Processed price data:
print(price_data)
# A tibble: 9 × 5
  symbol date       price price_change price_change_pct
  <chr>  <date>     <dbl>        <dbl>            <dbl>
1 AAPL   2023-01-01 150.        NA                NA   
2 AAPL   2023-01-02 152.         1.85              1.23
3 AAPL   2023-01-03 150.        -2.30             -1.51
4 GOOGL  2023-01-01  88.7       NA                NA   
5 GOOGL  2023-01-02  89.4        0.720             0.81
6 GOOGL  2023-01-03  87.9       -1.53             -1.71
7 MSFT   2023-01-01 240.        NA                NA   
8 MSFT   2023-01-02 242.         1.93              0.8 
9 MSFT   2023-01-03 239.        -3.25             -1.34
# Combine for investment analysis
investment_analysis <- financial_metrics %>%
  left_join(
    price_data %>%
      group_by(symbol) %>%
      summarise(
        avg_price = round(mean(price), 2),
        price_volatility = round(sd(price_change, na.rm = TRUE), 2),
        latest_price = last(price),
        .groups = "drop"
      ),
    by = "symbol"
  ) %>%
  mutate(
    upside_potential = round((target_price - latest_price) / latest_price * 100, 1),
    investment_score = round((buy_percentage / 100) * (upside_potential / 10) + (dividend_yield * 10), 1)
  ) %>%
  arrange(desc(investment_score))

cat("\nInvestment analysis summary:\n")

Investment analysis summary:
print(investment_analysis)
# A tibble: 3 × 18
  symbol company        sector pe_ratio market_cap dividend_yield analyst_rating
  <chr>  <chr>          <chr>     <dbl> <chr>               <dbl> <chr>         
1 MSFT   Microsoft Cor… Techn…     28.1 1.8T                 0.73 Strong Buy    
2 AAPL   Apple Inc.     Techn…     25.4 2.8T                 0.52 Buy           
3 GOOGL  Alphabet Inc.  Techn…     18.7 1.1T                 0    Buy           
# ℹ 11 more variables: target_price <dbl>, buy_ratings <dbl>,
#   hold_ratings <dbl>, data <list>, total_ratings <dbl>, buy_percentage <dbl>,
#   avg_price <dbl>, price_volatility <dbl>, latest_price <dbl>,
#   upside_potential <dbl>, investment_score <dbl>

unnest_longer() and unnest_wider(): Specialized Unnesting

Understanding the Difference

# Sample data for comparison
sample_nested <- tibble(
  id = 1:3,
  measurements = list(
    list(height = 175, weight = 70, bmi = 22.9),
    list(height = 180, weight = 75, bmi = 23.1),
    list(height = 165, weight = 60, bmi = 22.0)
  ),
  tags = list(
    c("healthy", "active", "young"),
    c("athletic", "healthy"),
    c("young", "active", "healthy", "student")
  )
)

cat("Sample nested data:\n")
Sample nested data:
print(sample_nested)
# A tibble: 3 × 3
     id measurements     tags     
  <int> <list>           <list>   
1     1 <named list [3]> <chr [3]>
2     2 <named list [3]> <chr [2]>
3     3 <named list [3]> <chr [4]>
# unnest_wider: Spreads list elements into columns
measurements_wide <- sample_nested %>%
  select(id, measurements) %>%
  unnest_wider(measurements)

cat("\nunnest_wider result (measurements as columns):\n")

unnest_wider result (measurements as columns):
print(measurements_wide)
# A tibble: 3 × 4
     id height weight   bmi
  <int>  <dbl>  <dbl> <dbl>
1     1    175     70  22.9
2     2    180     75  23.1
3     3    165     60  22  
# unnest_longer: Stacks list elements into rows
tags_long <- sample_nested %>%
  select(id, tags) %>%
  unnest_longer(tags)

cat("\nunnest_longer result (tags as rows):\n")

unnest_longer result (tags as rows):
print(tags_long)
# A tibble: 9 × 2
     id tags    
  <int> <chr>   
1     1 healthy 
2     1 active  
3     1 young   
4     2 athletic
5     2 healthy 
6     3 young   
7     3 active  
8     3 healthy 
9     3 student 
# Combined approach for comprehensive analysis
complete_analysis <- sample_nested %>%
  unnest_wider(measurements) %>%
  left_join(
    tags_long %>%
      group_by(id) %>%
      summarise(tag_list = paste(tags, collapse = ", "), .groups = "drop"),
    by = "id"
  )

cat("\nCombined analysis:\n")

Combined analysis:
print(complete_analysis)
# A tibble: 3 × 6
     id height weight   bmi tags      tag_list                       
  <int>  <dbl>  <dbl> <dbl> <list>    <chr>                          
1     1    175     70  22.9 <chr [3]> healthy, active, young         
2     2    180     75  23.1 <chr [2]> athletic, healthy              
3     3    165     60  22   <chr [4]> young, active, healthy, student

Real-World JSON Processing

# Simulate API response data (like from a social media platform)
social_media_data <- tibble(
  user_id = 1:4,
  profile = list(
    list(
      username = "tech_alice",
      followers = 1250,
      following = 340,
      posts = list(
        list(id = 1, text = "Great day at the conference!", likes = 45, comments = 8, hashtags = c("tech", "networking")),
        list(id = 2, text = "New project launching soon", likes = 32, comments = 12, hashtags = c("startup", "innovation"))
      ),
      interests = c("technology", "startups", "design")
    ),
    list(
      username = "fitness_bob",
      followers = 890,
      following = 180,
      posts = list(
        list(id = 3, text = "Morning workout complete!", likes = 67, comments = 4, hashtags = c("fitness", "morning")),
        list(id = 4, text = "Healthy meal prep tips", likes = 89, comments = 15, hashtags = c("nutrition", "health"))
      ),
      interests = c("fitness", "nutrition", "wellness")
    ),
    list(
      username = "travel_carol",
      followers = 2100,
      following = 450,
      posts = list(
        list(id = 5, text = "Amazing sunset in Bali", likes = 156, comments = 23, hashtags = c("travel", "sunset", "bali")),
        list(id = 6, text = "Travel tips for budget backpacking", likes = 98, comments = 31, hashtags = c("travel", "budget", "backpacking"))
      ),
      interests = c("travel", "photography", "culture")
    ),
    list(
      username = "foodie_david",
      followers = 670,
      following = 290,
      posts = list(
        list(id = 7, text = "Homemade pasta recipe", likes = 78, comments = 19, hashtags = c("cooking", "pasta", "homemade"))
      ),
      interests = c("cooking", "food", "recipes")
    )
  )
)

cat("Social media nested data structure:\n")
Social media nested data structure:
str(social_media_data, max.level = 2)
tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
 $ user_id: int [1:4] 1 2 3 4
 $ profile:List of 4
# Extract user profile information
user_profiles <- social_media_data %>%
  hoist(profile,
    username = "username",
    followers = "followers",
    following = "following",
    interests = "interests"
  ) %>%
  select(-profile) %>%
  unnest_longer(interests) %>%
  group_by(user_id, username, followers, following) %>%
  summarise(
    interest_count = n(),
    interest_list = paste(interests, collapse = ", "),
    .groups = "drop"
  ) %>%
  mutate(
    follower_ratio = round(followers / following, 2),
    popularity_tier = case_when(
      followers >= 2000 ~ "High",
      followers >= 1000 ~ "Medium",
      TRUE ~ "Low"
    )
  )

cat("\nUser profile analysis:\n")

User profile analysis:
print(user_profiles)
# A tibble: 4 × 8
  user_id username     followers following interest_count interest_list         
    <int> <chr>            <dbl>     <dbl>          <int> <chr>                 
1       1 tech_alice        1250       340              3 technology, startups,…
2       2 fitness_bob        890       180              3 fitness, nutrition, w…
3       3 travel_carol      2100       450              3 travel, photography, …
4       4 foodie_david       670       290              3 cooking, food, recipes
# ℹ 2 more variables: follower_ratio <dbl>, popularity_tier <chr>
# Extract and analyze posts
posts_analysis <- social_media_data %>%
  hoist(profile, username = "username", posts = "posts") %>%
  select(user_id, username, posts) %>%
  unnest_longer(posts) %>%
  unnest_wider(posts) %>%
  unnest_longer(hashtags) %>%
  group_by(user_id, username, id, text, likes, comments) %>%
  summarise(
    hashtag_count = n(),
    hashtag_list = paste(hashtags, collapse = ", "),
    .groups = "drop"
  ) %>%
  mutate(
    engagement_rate = round((likes + comments) / likes * 100, 1),
    content_category = case_when(
      str_detect(text, "workout|fitness|health") ~ "Fitness",
      str_detect(text, "travel|trip|vacation") ~ "Travel",
      str_detect(text, "food|recipe|cooking") ~ "Food",
      str_detect(text, "tech|project|conference") ~ "Technology",
      TRUE ~ "Other"
    )
  )

cat("\nPost analysis:\n")

Post analysis:
print(posts_analysis)
# A tibble: 7 × 10
  user_id username        id text      likes comments hashtag_count hashtag_list
    <int> <chr>        <dbl> <chr>     <dbl>    <dbl>         <int> <chr>       
1       1 tech_alice       1 Great da…    45        8             2 tech, netwo…
2       1 tech_alice       2 New proj…    32       12             2 startup, in…
3       2 fitness_bob      3 Morning …    67        4             2 fitness, mo…
4       2 fitness_bob      4 Healthy …    89       15             2 nutrition, …
5       3 travel_carol     5 Amazing …   156       23             3 travel, sun…
6       3 travel_carol     6 Travel t…    98       31             3 travel, bud…
7       4 foodie_david     7 Homemade…    78       19             3 cooking, pa…
# ℹ 2 more variables: engagement_rate <dbl>, content_category <chr>
# Content performance by category
category_performance <- posts_analysis %>%
  group_by(content_category) %>%
  summarise(
    post_count = n(),
    avg_likes = round(mean(likes), 1),
    avg_comments = round(mean(comments), 1),
    avg_hashtags = round(mean(hashtag_count), 1),
    avg_engagement = round(mean(engagement_rate), 1),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_engagement))

cat("\nContent performance by category:\n")

Content performance by category:
print(category_performance)
# A tibble: 4 × 6
  content_category post_count avg_likes avg_comments avg_hashtags avg_engagement
  <chr>                 <int>     <dbl>        <dbl>        <dbl>          <dbl>
1 Technology                2      38.5           10          2             128.
2 Food                      1      78             19          3             124.
3 Other                     3     114.            23          2.7           121.
4 Fitness                   1      67              4          2             106 

Advanced Rectangling Patterns

Handling Deeply Nested JSON-like Structures

# Complex enterprise data structure
enterprise_data <- tibble(
  department_id = 1:3,
  department_info = list(
    list(
      name = "Engineering",
      budget = 2500000,
      teams = list(
        list(
          team_name = "Backend",
          members = list(
            list(name = "Alice", role = "Senior", salary = 120000, skills = c("Python", "AWS", "Docker")),
            list(name = "Bob", role = "Junior", salary = 80000, skills = c("Python", "MySQL"))
          ),
          projects = list(
            list(name = "API Redesign", status = "Active", budget = 150000),
            list(name = "Database Migration", status = "Planning", budget = 200000)
          )
        ),
        list(
          team_name = "Frontend",
          members = list(
            list(name = "Carol", role = "Senior", salary = 115000, skills = c("React", "TypeScript", "CSS")),
            list(name = "David", role = "Mid", salary = 95000, skills = c("Vue", "JavaScript"))
          ),
          projects = list(
            list(name = "UI Refresh", status = "Active", budget = 100000)
          )
        )
      )
    ),
    list(
      name = "Marketing",
      budget = 800000,
      teams = list(
        list(
          team_name = "Digital",
          members = list(
            list(name = "Emma", role = "Manager", salary = 105000, skills = c("SEO", "Analytics", "AdWords")),
            list(name = "Frank", role = "Specialist", salary = 70000, skills = c("Social Media", "Content"))
          ),
          projects = list(
            list(name = "Brand Campaign", status = "Active", budget = 250000),
            list(name = "Website Optimization", status = "Completed", budget = 75000)
          )
        )
      )
    ),
    list(
      name = "Sales",
      budget = 1200000,
      teams = list(
        list(
          team_name = "Enterprise",
          members = list(
            list(name = "Grace", role = "Director", salary = 140000, skills = c("B2B Sales", "Negotiation")),
            list(name = "Henry", role = "Rep", salary = 85000, skills = c("Cold Calling", "CRM"))
          ),
          projects = list(
            list(name = "Q1 Targets", status = "Active", budget = 50000)
          )
        )
      )
    )
  )
)

cat("Complex enterprise data structure:\n")
Complex enterprise data structure:
str(enterprise_data, max.level = 3)
tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
 $ department_id  : int [1:3] 1 2 3
 $ department_info:List of 3
  ..$ :List of 3
  ..$ :List of 3
  ..$ :List of 3
# Extract department overview
departments <- enterprise_data %>%
  hoist(department_info,
    dept_name = "name",
    budget = "budget",
    teams = "teams"
  )

cat("\nDepartment overview:\n")

Department overview:
print(departments)
# A tibble: 3 × 4
  department_id dept_name    budget teams     
          <int> <chr>         <dbl> <list>    
1             1 Engineering 2500000 <list [2]>
2             2 Marketing    800000 <list [1]>
3             3 Sales       1200000 <list [1]>
# Extract team and member information
team_members <- departments %>%
  unnest_longer(teams) %>%
  hoist(teams,
    team_name = "team_name",
    members = "members"
  ) %>%
  unnest_longer(members) %>%
  unnest_wider(members) %>%
  unnest_longer(skills) %>%
  group_by(department_id, dept_name, budget, team_name, name, role, salary) %>%
  summarise(
    skill_count = n(),
    skill_list = paste(skills, collapse = ", "),
    .groups = "drop"
  )

cat("\nTeam members with skills:\n")

Team members with skills:
print(team_members)
# A tibble: 8 × 9
  department_id dept_name    budget team_name  name  role     salary skill_count
          <int> <chr>         <dbl> <chr>      <chr> <chr>     <dbl>       <int>
1             1 Engineering 2500000 Backend    Alice Senior   120000           3
2             1 Engineering 2500000 Backend    Bob   Junior    80000           2
3             1 Engineering 2500000 Frontend   Carol Senior   115000           3
4             1 Engineering 2500000 Frontend   David Mid       95000           2
5             2 Marketing    800000 Digital    Emma  Manager  105000           3
6             2 Marketing    800000 Digital    Frank Special…  70000           2
7             3 Sales       1200000 Enterprise Grace Director 140000           2
8             3 Sales       1200000 Enterprise Henry Rep       85000           2
# ℹ 1 more variable: skill_list <chr>
# Department analytics
dept_analytics <- team_members %>%
  group_by(department_id, dept_name, budget) %>%
  summarise(
    total_employees = n(),
    avg_salary = round(mean(salary), 0),
    total_payroll = sum(salary),
    skill_diversity = length(unique(unlist(str_split(skill_list, ", ")))),
    .groups = "drop"
  ) %>%
  mutate(
    payroll_ratio = round(total_payroll / budget * 100, 1),
    budget_per_employee = round(budget / total_employees, 0)
  ) %>%
  arrange(desc(budget))

cat("\nDepartment analytics:\n")

Department analytics:
print(dept_analytics)
# A tibble: 3 × 9
  department_id dept_name    budget total_employees avg_salary total_payroll
          <int> <chr>         <dbl>           <int>      <dbl>         <dbl>
1             1 Engineering 2500000               4     102500        410000
2             3 Sales       1200000               2     112500        225000
3             2 Marketing    800000               2      87500        175000
# ℹ 3 more variables: skill_diversity <int>, payroll_ratio <dbl>,
#   budget_per_employee <dbl>
# Skills analysis across organization
skills_analysis <- team_members %>%
  select(dept_name, team_name, name, skill_list) %>%
  separate_rows(skill_list, sep = ", ") %>%
  rename(skill = skill_list) %>%
  group_by(skill) %>%
  summarise(
    employee_count = n(),
    departments = length(unique(dept_name)),
    teams = length(unique(paste(dept_name, team_name))),
    .groups = "drop"
  ) %>%
  arrange(desc(employee_count))

cat("\nSkills analysis across organization:\n")

Skills analysis across organization:
print(skills_analysis)
# A tibble: 18 × 4
   skill        employee_count departments teams
   <chr>                 <int>       <int> <int>
 1 Python                    2           1     1
 2 AWS                       1           1     1
 3 AdWords                   1           1     1
 4 Analytics                 1           1     1
 5 B2B Sales                 1           1     1
 6 CRM                       1           1     1
 7 CSS                       1           1     1
 8 Cold Calling              1           1     1
 9 Content                   1           1     1
10 Docker                    1           1     1
11 JavaScript                1           1     1
12 MySQL                     1           1     1
13 Negotiation               1           1     1
14 React                     1           1     1
15 SEO                       1           1     1
16 Social Media              1           1     1
17 TypeScript                1           1     1
18 Vue                       1           1     1

Performance Considerations and Best Practices

cat("🚀 PERFORMANCE OPTIMIZATION FOR NESTED DATA:\n\n")
🚀 PERFORMANCE OPTIMIZATION FOR NESTED DATA:
cat("1. MEMORY MANAGEMENT:\n")
1. MEMORY MANAGEMENT:
cat("   - Process nested data in chunks for large datasets\n")
   - Process nested data in chunks for large datasets
cat("   - Remove unnecessary nesting levels early\n")
   - Remove unnecessary nesting levels early
cat("   - Use specific hoist() extractions instead of full unnesting\n")
   - Use specific hoist() extractions instead of full unnesting
cat("   - Consider data.table::rbindlist() for very large list processing\n\n")
   - Consider data.table::rbindlist() for very large list processing
cat("2. STRATEGIC APPROACH:\n")
2. STRATEGIC APPROACH:
cat("   - Plan your rectangling strategy before starting\n")
   - Plan your rectangling strategy before starting
cat("   - Extract what you need, not everything\n")
   - Extract what you need, not everything
cat("   - Combine unnest operations efficiently\n")
   - Combine unnest operations efficiently
cat("   - Validate structure at each step\n\n")
   - Validate structure at each step
# Demonstrate efficient vs inefficient approaches
cat("EFFICIENT APPROACH:\n")
EFFICIENT APPROACH:
cat("data %>%\n")
data %>%
cat("  hoist(column, key1 = 'key1', key2 = c('nested', 'key2')) %>%\n")
  hoist(column, key1 = 'key1', key2 = c('nested', 'key2')) %>%
cat("  select(needed_columns) %>%\n")
  select(needed_columns) %>%
cat("  process_further()\n\n")
  process_further()
cat("LESS EFFICIENT APPROACH:\n")
LESS EFFICIENT APPROACH:
cat("data %>%\n")
data %>%
cat("  unnest_wider(column) %>%          # Expands everything\n")
  unnest_wider(column) %>%          # Expands everything
cat("  unnest_wider(nested_column) %>%   # More expansion\n")
  unnest_wider(nested_column) %>%   # More expansion
cat("  select(needed_columns) %>%        # Selection after expansion\n")
  select(needed_columns) %>%        # Selection after expansion
cat("  process_further()\n\n")
  process_further()
cat("3. ERROR HANDLING:\n")
3. ERROR HANDLING:
error_safe_unnest <- function(data, col) {
  tryCatch({
    data %>% unnest_wider(!!sym(col))
  }, error = function(e) {
    warning(paste("Unnesting failed for column", col, ":", e$message))
    return(data)
  })
}

cat("   - Use tryCatch() for robust unnesting\n")
   - Use tryCatch() for robust unnesting
cat("   - Check for NULL values before unnesting\n")
   - Check for NULL values before unnesting
cat("   - Validate list structures before processing\n")
   - Validate list structures before processing
cat("   - Provide default values for missing elements\n\n")
   - Provide default values for missing elements
cat("4. VALIDATION HELPERS:\n")
4. VALIDATION HELPERS:
check_list_structure <- function(data, col) {
  col_data <- data[[col]]
  if (is.list(col_data)) {
    list(
      is_list_column = TRUE,
      length = length(col_data),
      first_element_type = class(col_data[[1]]),
      has_names = !is.null(names(col_data[[1]]))
    )
  } else {
    list(is_list_column = FALSE)
  }
}

cat("   - Create helper functions to inspect list structures\n")
   - Create helper functions to inspect list structures
cat("   - Use str() and glimpse() liberally during development\n")
   - Use str() and glimpse() liberally during development
cat("   - Test with small samples before processing large datasets\n")
   - Test with small samples before processing large datasets

Integration with Other tidyr Functions

# Complete data processing pipeline combining all tidyr functions
complex_workflow_data <- tibble(
  survey_id = 1:3,
  metadata = c("Study_A|2023|Online", "Study_B|2023|Phone", "Study_C|2024|Online"),
  responses = list(
    list(
      demographics = list(age = "25-34", income = "50-75K", education = "Bachelor"),
      ratings = list(q1 = 4, q2 = 5, q3 = 3, q4 = 4),
      comments = list(positive = c("Great service", "Easy to use"),
                     negative = c("Slow response"))
    ),
    list(
      demographics = list(age = "35-44", income = "75-100K", education = "Master"),
      ratings = list(q1 = 3, q2 = 4, q3 = 4, q4 = 5),
      comments = list(positive = c("Professional staff"),
                     negative = c("Website issues", "Long wait"))
    ),
    list(
      demographics = list(age = "45-54", income = "100K+", education = "PhD"),
      ratings = list(q1 = 5, q2 = 5, q3 = 4, q4 = 4),
      comments = list(positive = c("Excellent quality", "Quick delivery", "Good value"))
    )
  )
)

cat("Complex survey data requiring multiple tidyr operations:\n")
Complex survey data requiring multiple tidyr operations:
str(complex_workflow_data, max.level = 2)
tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
 $ survey_id: int [1:3] 1 2 3
 $ metadata : chr [1:3] "Study_A|2023|Online" "Study_B|2023|Phone" "Study_C|2024|Online"
 $ responses:List of 3
# Complete processing pipeline
processed_survey <- complex_workflow_data %>%
  # Step 1: Separate metadata
  separate(metadata, into = c("study", "year", "method"), sep = "\\|") %>%

  # Step 2: Extract demographics using hoist
  hoist(responses,
    age_group = c("demographics", "age"),
    income_bracket = c("demographics", "income"),
    education_level = c("demographics", "education"),
    ratings = "ratings",
    comments = "comments"
  ) %>%

  # Step 3: Process ratings (unnest_wider)
  unnest_wider(ratings) %>%
  rename_with(~ paste0("rating_", .x), .cols = c(q1, q2, q3, q4)) %>%

  # Step 4: Calculate rating metrics
  rowwise() %>%
  mutate(
    avg_rating = mean(c(rating_q1, rating_q2, rating_q3, rating_q4)),
    rating_variance = var(c(rating_q1, rating_q2, rating_q3, rating_q4))
  ) %>%
  ungroup() %>%

  # Step 5: Process comments separately
  select(-comments) %>%
  left_join(
    complex_workflow_data %>%
      select(survey_id, responses) %>%
      hoist(responses, comments = "comments") %>%
      unnest_wider(comments) %>%
      mutate(
        positive_count = map_dbl(positive, length),
        negative_count = map_dbl(negative, length),
        sentiment_ratio = positive_count / (positive_count + negative_count)
      ) %>%
      select(survey_id, positive_count, negative_count, sentiment_ratio),
    by = "survey_id"
  ) %>%

  # Step 6: Create final analysis metrics
  mutate(
    satisfaction_score = round((avg_rating * 25) + (sentiment_ratio * 25), 1),
    response_completeness = ifelse(is.na(rating_variance), "Incomplete", "Complete")
  )

cat("\nFinal processed survey data:\n")

Final processed survey data:
print(processed_survey)
# A tibble: 3 × 19
  survey_id study   year  method age_group income_bracket education_level
      <int> <chr>   <chr> <chr>  <chr>     <chr>          <chr>          
1         1 Study_A 2023  Online 25-34     50-75K         Bachelor       
2         2 Study_B 2023  Phone  35-44     75-100K        Master         
3         3 Study_C 2024  Online 45-54     100K+          PhD            
# ℹ 12 more variables: rating_q1 <dbl>, rating_q2 <dbl>, rating_q3 <dbl>,
#   rating_q4 <dbl>, responses <list>, avg_rating <dbl>, rating_variance <dbl>,
#   positive_count <dbl>, negative_count <dbl>, sentiment_ratio <dbl>,
#   satisfaction_score <dbl>, response_completeness <chr>
# Summary analysis
cat("\nSurvey analysis summary:\n")

Survey analysis summary:
processed_survey %>%
  group_by(study, method) %>%
  summarise(
    responses = n(),
    avg_satisfaction = round(mean(satisfaction_score, na.rm = TRUE), 1),
    avg_rating = round(mean(avg_rating, na.rm = TRUE), 2),
    avg_sentiment = round(mean(sentiment_ratio, na.rm = TRUE), 2),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_satisfaction)) %>%
  print()
# A tibble: 3 × 6
  study   method responses avg_satisfaction avg_rating avg_sentiment
  <chr>   <chr>      <int>            <dbl>      <dbl>         <dbl>
1 Study_C Online         1             138.        4.5          1   
2 Study_A Online         1             117.        4            0.67
3 Study_B Phone          1             108.        4            0.33

Summary

Mastering nested data and rectangling enables you to:

  • 🗂️ Handle complex data structures: Transform JSON, APIs, and hierarchical data into analysis-ready formats
  • 🎯 Extract precisely what you need: Use hoist() for selective extraction instead of full unnesting
  • 📊 Process real-world data: Handle inconsistent structures and missing elements gracefully
  • 🔄 Integrate workflows: Combine rectangling with other tidyr functions for complete data pipelines
  • ⚡ Optimize performance: Choose the right unnesting strategy for your data size and structure

Key principles to remember:

  • Plan your approach: Understand the nested structure before choosing your rectangling strategy
  • Extract strategically: Use hoist() for specific elements, unnesting for comprehensive expansion
  • Handle variation: Real-world nested data is often inconsistent - prepare for missing elements
  • Validate continuously: Check structure and results at each transformation step
  • Combine techniques: Rectangling works best when integrated with pivoting, separating, and filtering

These advanced data structure skills unlock the ability to work with modern, complex data sources that are increasingly common in data science! 🎯