Joining Data

Author

IND215

Published

September 22, 2025

Combining Data from Multiple Sources

In real-world data analysis, information is rarely contained in a single table. The ability to combine data from multiple sources is essential. dplyr provides a comprehensive set of join functions that mirror SQL operations while maintaining the tidyverse philosophy.

library(tidyverse)

# Create the sales dataset
set.seed(123)
sales <- tibble(
  order_id = paste0("ORD-", str_pad(1:500, 4, pad = "0")),
  customer_id = sample(paste0("CUST-", 1:100), 500, replace = TRUE),
  product_id = sample(paste0("PROD-", 1:30), 500, replace = TRUE),
  order_date = sample(seq(ymd("2024-01-01"), ymd("2024-12-31"), by = "day"), 500, replace = TRUE),
  quantity = sample(1:10, 500, replace = TRUE),
  unit_price = round(runif(500, 10, 200), 2)
) %>%
  mutate(
    total_amount = quantity * unit_price
  )

# Create customer information
customers <- tibble(
  customer_id = paste0("CUST-", 1:100),
  customer_name = paste("Customer", 1:100),
  customer_type = sample(c("Individual", "Small Business", "Enterprise"), 100, replace = TRUE, prob = c(0.6, 0.3, 0.1)),
  join_date = sample(seq(ymd("2020-01-01"), ymd("2023-12-31"), by = "day"), 100),
  credit_limit = sample(c(1000, 5000, 10000, 25000, 50000), 100, replace = TRUE),
  city = sample(c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix"), 100, replace = TRUE),
  state = c(rep("NY", 20), rep("CA", 20), rep("IL", 20), rep("TX", 20), rep("AZ", 20))
)

# Create product information
products <- tibble(
  product_id = paste0("PROD-", 1:30),
  product_name = paste("Product", LETTERS[1:30]),
  category = rep(c("Electronics", "Clothing", "Food", "Home", "Sports"), 6),
  cost = round(runif(30, 5, 100), 2),
  supplier_id = sample(paste0("SUPP-", 1:10), 30, replace = TRUE)
)

# Create some additional data for demonstrating different join scenarios
# Some products that don't have sales
new_products <- tibble(
  product_id = paste0("PROD-", 31:35),
  product_name = paste("New Product", 1:5),
  category = "New Items",
  cost = round(runif(5, 20, 150), 2),
  supplier_id = "SUPP-11"
)

# Some customers who haven't made purchases
inactive_customers <- tibble(
  customer_id = paste0("CUST-", 101:110),
  customer_name = paste("Inactive Customer", 1:10),
  customer_type = "Individual",
  join_date = ymd("2024-01-01"),
  credit_limit = 1000,
  city = "Boston",
  state = "MA"
)

cat("Created sample datasets:\n")

Created sample datasets:

cat("- Sales:", nrow(sales), "transactions\n")

- Sales: 500 transactions

cat("- Customers:", nrow(customers), "records\n")

- Customers: 100 records

cat("- Products:", nrow(products), "items\n")

- Products: 30 items

Types of Joins

dplyr provides several types of joins, each serving different purposes:

Inner Join

Keeps only rows that have matching keys in both tables:

# Join sales with customer information
sales_with_customers <- sales %>%
  inner_join(customers, by = "customer_id")

cat("Inner join result:\n")

Inner join result:

cat("- Original sales:", nrow(sales), "rows\n")

- Original sales: 500 rows

cat("- After inner join:", nrow(sales_with_customers), "rows\n")

- After inner join: 500 rows

cat("- Columns added:", setdiff(names(sales_with_customers), names(sales)), "\n\n")

- Columns added: customer_name customer_type join_date credit_limit city state

# Show a sample
sales_with_customers %>%
  select(order_id, customer_name, customer_type, order_date, total_amount) %>%
  slice_head(n = 5)

# A tibble: 5 × 5
  order_id customer_name customer_type  order_date total_amount
  <chr>    <chr>         <chr>          <date>            <dbl>
1 ORD-0001 Customer 31   Individual     2024-10-04        316. 
2 ORD-0002 Customer 79   Small Business 2024-04-08        724. 
3 ORD-0003 Customer 51   Individual     2024-05-21         72.0
4 ORD-0004 Customer 14   Individual     2024-08-16        809. 
5 ORD-0005 Customer 67   Small Business 2024-03-04        538.

Left Join

Keeps all rows from the left table, adding NA for non-matching right table rows:

# Combine all customers and products
all_products <- bind_rows(products, new_products)

# Left join to keep all sales, even if product info is missing
sales_complete <- sales %>%
  left_join(customers, by = "customer_id") %>%
  left_join(all_products, by = "product_id")

cat("Left join preserves all sales records:\n")

Left join preserves all sales records:

cat("- Sales records:", nrow(sales_complete), "\n")

- Sales records: 500

cat("- Products without info:", sum(is.na(sales_complete$product_name)), "\n\n")

- Products without info: 0

# Check for missing product information
sales_complete %>%
  filter(is.na(product_name)) %>%
  select(order_id, product_id, product_name, category) %>%
  slice_head(n = 5)

# A tibble: 0 × 4
# ℹ 4 variables: order_id <chr>, product_id <chr>, product_name <chr>,
#   category <chr>

Right Join

Keeps all rows from the right table:

# Find all products and their sales (if any)
product_sales <- sales %>%
  group_by(product_id) %>%
  summarise(
    total_sold = sum(quantity),
    revenue = sum(total_amount),
    .groups = "drop"
  ) %>%
  right_join(all_products, by = "product_id")

cat("Right join shows all products, including those without sales:\n")

Right join shows all products, including those without sales:

product_sales %>%
  filter(is.na(total_sold)) %>%
  select(product_id, product_name, category, total_sold, revenue)

# A tibble: 5 × 5
  product_id product_name  category  total_sold revenue
  <chr>      <chr>         <chr>          <int>   <dbl>
1 PROD-31    New Product 1 New Items         NA      NA
2 PROD-32    New Product 2 New Items         NA      NA
3 PROD-33    New Product 3 New Items         NA      NA
4 PROD-34    New Product 4 New Items         NA      NA
5 PROD-35    New Product 5 New Items         NA      NA

Full Join

Keeps all rows from both tables:

# Combine all customers (including inactive ones)
all_customers <- bind_rows(customers, inactive_customers)

# Full join to see all customers and all orders
customer_order_summary <- sales %>%
  group_by(customer_id) %>%
  summarise(
    order_count = n(),
    total_spent = sum(total_amount),
    .groups = "drop"
  ) %>%
  full_join(all_customers, by = "customer_id")

cat("Full join shows both customers with orders and those without:\n")

Full join shows both customers with orders and those without:

cat("- Total customers:", nrow(customer_order_summary), "\n")

- Total customers: 110

cat("- Customers with orders:", sum(!is.na(customer_order_summary$order_count)), "\n")

- Customers with orders: 99

cat("- Customers without orders:", sum(is.na(customer_order_summary$order_count)), "\n\n")

- Customers without orders: 11

# Show customers without orders
customer_order_summary %>%
  filter(is.na(order_count)) %>%
  select(customer_id, customer_name, join_date, order_count) %>%
  slice_head(n = 5)

# A tibble: 5 × 4
  customer_id customer_name       join_date  order_count
  <chr>       <chr>               <date>           <int>
1 CUST-65     Customer 65         2023-04-28          NA
2 CUST-101    Inactive Customer 1 2024-01-01          NA
3 CUST-102    Inactive Customer 2 2024-01-01          NA
4 CUST-103    Inactive Customer 3 2024-01-01          NA
5 CUST-104    Inactive Customer 4 2024-01-01          NA

Filtering Joins

These joins filter rows based on matches, without adding columns:

Semi Join

Keeps rows in the first table that have a match in the second table:

# Find customers who bought electronics
electronics_products <- products %>%
  filter(category == "Electronics")

electronics_buyers <- sales %>%
  semi_join(electronics_products, by = "product_id") %>%
  distinct(customer_id) %>%
  semi_join(customers, by = "customer_id")

cat("Semi join to find electronics buyers:\n")

Semi join to find electronics buyers:

cat("- Total unique customers:", n_distinct(sales$customer_id), "\n")

- Total unique customers: 99

cat("- Electronics buyers:", nrow(electronics_buyers), "\n")

- Electronics buyers: 57

# Get their details
electronics_buyers %>%
  left_join(customers, by = "customer_id") %>%
  select(customer_id, customer_name, customer_type) %>%
  slice_head(n = 5)

# A tibble: 5 × 3
  customer_id customer_name customer_type 
  <chr>       <chr>         <chr>         
1 CUST-14     Customer 14   Individual    
2 CUST-91     Customer 91   Enterprise    
3 CUST-72     Customer 72   Small Business
4 CUST-7      Customer 7    Small Business
5 CUST-78     Customer 78   Individual

Anti Join

Keeps rows in the first table that do NOT have a match in the second table:

# Find customers who have never bought electronics
non_electronics_buyers <- customers %>%
  anti_join(electronics_buyers, by = "customer_id")

cat("Anti join to find non-electronics buyers:\n")

Anti join to find non-electronics buyers:

cat("- Customers who never bought electronics:", nrow(non_electronics_buyers), "\n\n")

- Customers who never bought electronics: 43

# Find products that have never been sold
unsold_products <- all_products %>%
  anti_join(sales, by = "product_id")

cat("Products with no sales:\n")

Products with no sales:

unsold_products %>%
  select(product_id, product_name, category)

# A tibble: 5 × 3
  product_id product_name  category 
  <chr>      <chr>         <chr>    
1 PROD-31    New Product 1 New Items
2 PROD-32    New Product 2 New Items
3 PROD-33    New Product 3 New Items
4 PROD-34    New Product 4 New Items
5 PROD-35    New Product 5 New Items

Joining on Multiple Keys

Sometimes you need to join on multiple columns:

# Create order details with multiple keys
order_details <- tibble(
  order_id = sample(sales$order_id, 300),
  product_id = sample(products$product_id, 300, replace = TRUE),
  discount_pct = sample(c(0, 5, 10, 15, 20), 300, replace = TRUE, prob = c(0.5, 0.2, 0.15, 0.1, 0.05))
)

# Join using both order_id and product_id
sales_with_discounts <- sales %>%
  left_join(order_details, by = c("order_id", "product_id")) %>%
  mutate(
    discount_amount = if_else(is.na(discount_pct), 0, total_amount * discount_pct / 100),
    final_amount = total_amount - discount_amount
  )

cat("Joining on multiple keys:\n")

Joining on multiple keys:

cat("- Orders with discounts:", sum(!is.na(sales_with_discounts$discount_pct)), "\n")

- Orders with discounts: 13

cat("- Total discount amount: $", round(sum(sales_with_discounts$discount_amount), 2), "\n")

- Total discount amount: $ 834.29

Handling Key Conflicts

When column names don’t match, specify the relationship:

# Create supplier data with different column name
suppliers <- tibble(
  supplier_code = paste0("SUPP-", 1:11),  # Different name than supplier_id
  supplier_name = paste("Supplier", LETTERS[1:11]),
  country = sample(c("USA", "Canada", "Mexico"), 11, replace = TRUE),
  rating = round(runif(11, 3, 5), 1)
)

# Join with name mapping
products_with_suppliers <- all_products %>%
  left_join(suppliers, by = c("supplier_id" = "supplier_code"))

cat("Joining with different column names:\n")

Joining with different column names:

products_with_suppliers %>%
  select(product_name, category, supplier_id, supplier_name, country, rating) %>%
  slice_head(n = 5)

# A tibble: 5 × 6
  product_name category    supplier_id supplier_name country rating
  <chr>        <chr>       <chr>       <chr>         <chr>    <dbl>
1 Product A    Electronics SUPP-4      Supplier D    Mexico     3.2
2 Product B    Clothing    SUPP-10     Supplier J    USA        4.7
3 Product C    Food        SUPP-6      Supplier F    Canada     4.3
4 Product D    Home        SUPP-5      Supplier E    Mexico     3.2
5 Product E    Sports      SUPP-2      Supplier B    Canada     3.3

Dealing with Duplicate Keys

Understanding how joins handle duplicates is crucial:

# Create a scenario with duplicate keys
# Multiple promotions for the same product
promotions <- tibble(
  product_id = rep(paste0("PROD-", 1:5), each = 2),
  promo_name = c("Summer Sale", "Back to School", "Black Friday", "Cyber Monday", 
                 "Holiday Special", "New Year", "Spring Sale", "Flash Deal", 
                 "Weekend Special", "Member Exclusive"),
  discount = c(10, 15, 20, 25, 30, 10, 15, 20, 25, 30)
)

# Join will create multiple rows for each match
sales_with_promos <- sales %>%
  filter(product_id %in% paste0("PROD-", 1:5)) %>%
  inner_join(promotions, by = "product_id")

cat("Effect of duplicate keys in joins:\n")

Effect of duplicate keys in joins:

cat("- Sales records for PROD-1 to PROD-5:", 
    sum(sales$product_id %in% paste0("PROD-", 1:5)), "\n")

- Sales records for PROD-1 to PROD-5: 88

cat("- Records after join with promotions:", nrow(sales_with_promos), "\n")

- Records after join with promotions: 176

cat("- Each sale is duplicated for each matching promotion\n\n")

- Each sale is duplicated for each matching promotion

# Show example
sales_with_promos %>%
  filter(order_id == first(order_id)) %>%
  select(order_id, product_id, promo_name, discount)

# A tibble: 2 × 4
  order_id product_id promo_name       discount
  <chr>    <chr>      <chr>               <dbl>
1 ORD-0006 PROD-5     Weekend Special        25
2 ORD-0006 PROD-5     Member Exclusive       30

Complex Multi-Table Joins

Real-world analysis often requires joining multiple tables:

# Create a complete sales analysis dataset
sales_analysis <- sales %>%
  # Add customer information
  left_join(customers, by = "customer_id") %>%
  # Add product information
  left_join(products, by = "product_id") %>%
  # Add supplier information
  left_join(suppliers, by = c("supplier_id" = "supplier_code")) %>%
  # Calculate additional metrics
  mutate(
    profit_margin = (unit_price - cost) / unit_price,
    profit = total_amount * profit_margin,
    order_month = floor_date(order_date, "month")
  )

cat("Complete sales analysis dataset:\n")

Complete sales analysis dataset:

cat("- Total columns:", ncol(sales_analysis), "\n")

- Total columns: 23

cat("- Column categories:\n")

- Column categories:

cat("  - Order info:", sum(str_detect(names(sales_analysis), "order")), "columns\n")

  - Order info: 3 columns

cat("  - Customer info:", sum(str_detect(names(sales_analysis), "customer")), "columns\n")

  - Customer info: 3 columns

cat("  - Product info:", sum(str_detect(names(sales_analysis), "product")), "columns\n")

  - Product info: 2 columns

cat("  - Financial metrics:", sum(names(sales_analysis) %in% c("profit", "profit_margin", "total_amount")), "columns\n")

  - Financial metrics: 3 columns

# Sample the complete dataset
sales_analysis %>%
  select(order_id, customer_name, product_name, supplier_name, quantity, profit) %>%
  slice_sample(n = 5)

# A tibble: 5 × 6
  order_id customer_name product_name supplier_name quantity profit
  <chr>    <chr>         <chr>        <chr>            <int>  <dbl>
1 ORD-0251 Customer 17   Product Y    Supplier A           5   482.
2 ORD-0369 Customer 74   Product Q    Supplier H           9   560.
3 ORD-0480 Customer 43   Product NA   Supplier B           4   133.
4 ORD-0039 Customer 60   Product D    Supplier E           1   114.
5 ORD-0290 Customer 2    Product T    Supplier H           8   922.

Performance Considerations

When working with large datasets, join order matters:

# Example of efficient join ordering
# Start with the smallest dataset and join progressively

# Inefficient: Starting with large table
start_time1 <- Sys.time()
result1 <- sales %>%
  left_join(sales_analysis, by = "order_id") %>%
  nrow()
time1 <- Sys.time() - start_time1

# Efficient: Filter first, then join
start_time2 <- Sys.time()
result2 <- sales %>%
  filter(order_date >= "2024-06-01") %>%
  left_join(customers, by = "customer_id") %>%
  left_join(products, by = "product_id") %>%
  nrow()
time2 <- Sys.time() - start_time2

cat("Join performance tips:\n")

Join performance tips:

cat("1. Filter data before joining when possible\n")

1. Filter data before joining when possible

cat("2. Join smaller tables first\n")

2. Join smaller tables first

cat("3. Use appropriate join types (inner join is fastest)\n")

3. Use appropriate join types (inner join is fastest)

cat("4. Index key columns in databases\n")

4. Index key columns in databases

Common Join Patterns

Customer Lifetime Value Analysis

# Calculate customer lifetime value
customer_ltv <- sales %>%
  group_by(customer_id) %>%
  summarise(
    first_order = min(order_date),
    last_order = max(order_date),
    total_orders = n(),
    total_spent = sum(total_amount),
    avg_order_value = mean(total_amount),
    .groups = "drop"
  ) %>%
  inner_join(customers, by = "customer_id") %>%
  mutate(
    customer_age_days = as.numeric(today() - join_date),
    days_active = as.numeric(last_order - first_order) + 1,
    order_frequency = total_orders / days_active * 30  # Orders per 30 days
  ) %>%
  arrange(desc(total_spent))

cat("Top 5 customers by lifetime value:\n")

Top 5 customers by lifetime value:

customer_ltv %>%
  select(customer_name, customer_type, total_orders, total_spent, avg_order_value, order_frequency) %>%
  slice_head(n = 5)

# A tibble: 5 × 6
  customer_name customer_type  total_orders total_spent avg_order_value
  <chr>         <chr>                 <int>       <dbl>           <dbl>
1 Customer 89   Small Business            8       7766.            971.
2 Customer 81   Small Business            7       7522.           1075.
3 Customer 63   Small Business            8       6889.            861.
4 Customer 94   Small Business           11       6114.            556.
5 Customer 31   Individual                8       6045.            756.
# ℹ 1 more variable: order_frequency <dbl>

Product Affinity Analysis

# Find products frequently bought together
product_pairs <- sales %>%
  inner_join(sales, by = "customer_id", suffix = c("_1", "_2")) %>%
  filter(product_id_1 < product_id_2) %>%  # Avoid duplicates and self-joins
  count(product_id_1, product_id_2, name = "times_bought_together") %>%
  filter(times_bought_together >= 5) %>%
  left_join(products, by = c("product_id_1" = "product_id")) %>%
  rename(product_name_1 = product_name) %>%
  left_join(products, by = c("product_id_2" = "product_id")) %>%
  rename(product_name_2 = product_name) %>%
  arrange(desc(times_bought_together))

cat("Products frequently bought together:\n")

Products frequently bought together:

product_pairs %>%
  select(product_name_1, product_name_2, times_bought_together) %>%
  slice_head(n = 10)

# A tibble: 10 × 3
   product_name_1 product_name_2 times_bought_together
   <chr>          <chr>                          <int>
 1 Product W      Product NA                        10
 2 Product W      Product C                         10
 3 Product O      Product Z                          9
 4 Product C      Product H                          9
 5 Product J      Product H                          8
 6 Product K      Product W                          8
 7 Product K      Product H                          8
 8 Product W      Product H                          8
 9 Product H      Product I                          8
10 Product A      Product J                          7

Practice Exercises

Sales Dashboard: Create a comprehensive sales report joining all tables
Customer Segmentation: Use joins to categorize customers based on purchase behavior
Inventory Analysis: Join sales and product data to identify slow-moving items
Geographic Analysis: Join customer location data with sales to analyze regional patterns

Summary

Mastering joins is essential for real-world data analysis. Key points:

Choose the right join type for your analysis needs
Understand how joins handle missing values and duplicates
Pay attention to column names and key relationships
Consider performance when joining large datasets
Filter before joining when possible
Use filtering joins (semi/anti) for efficient subsetting
Always verify join results match expectations

Next, explore Advanced Techniques for more sophisticated dplyr operations.