In real-world data analysis, information is rarely contained in a single table. The ability to combine data from multiple sources is essential. dplyr provides a comprehensive set of join functions that mirror SQL operations while maintaining the tidyverse philosophy.
Keeps all rows from the left table, adding NA for non-matching right table rows:
# Combine all customers and productsall_products <-bind_rows(products, new_products)# Left join to keep all sales, even if product info is missingsales_complete <- sales %>%left_join(customers, by ="customer_id") %>%left_join(all_products, by ="product_id")cat("Left join preserves all sales records:\n")
# Find all products and their sales (if any)product_sales <- sales %>%group_by(product_id) %>%summarise(total_sold =sum(quantity),revenue =sum(total_amount),.groups ="drop" ) %>%right_join(all_products, by ="product_id")cat("Right join shows all products, including those without sales:\n")
Right join shows all products, including those without sales:
# A tibble: 5 × 5
product_id product_name category total_sold revenue
<chr> <chr> <chr> <int> <dbl>
1 PROD-31 New Product 1 New Items NA NA
2 PROD-32 New Product 2 New Items NA NA
3 PROD-33 New Product 3 New Items NA NA
4 PROD-34 New Product 4 New Items NA NA
5 PROD-35 New Product 5 New Items NA NA
Full Join
Keeps all rows from both tables:
# Combine all customers (including inactive ones)all_customers <-bind_rows(customers, inactive_customers)# Full join to see all customers and all orderscustomer_order_summary <- sales %>%group_by(customer_id) %>%summarise(order_count =n(),total_spent =sum(total_amount),.groups ="drop" ) %>%full_join(all_customers, by ="customer_id")cat("Full join shows both customers with orders and those without:\n")
Full join shows both customers with orders and those without:
cat("- Total customers:", nrow(customer_order_summary), "\n")
- Total customers: 110
cat("- Customers with orders:", sum(!is.na(customer_order_summary$order_count)), "\n")
- Customers with orders: 99
cat("- Customers without orders:", sum(is.na(customer_order_summary$order_count)), "\n\n")
- Customers without orders: 11
# Show customers without orderscustomer_order_summary %>%filter(is.na(order_count)) %>%select(customer_id, customer_name, join_date, order_count) %>%slice_head(n =5)
# A tibble: 5 × 4
customer_id customer_name join_date order_count
<chr> <chr> <date> <int>
1 CUST-65 Customer 65 2023-04-28 NA
2 CUST-101 Inactive Customer 1 2024-01-01 NA
3 CUST-102 Inactive Customer 2 2024-01-01 NA
4 CUST-103 Inactive Customer 3 2024-01-01 NA
5 CUST-104 Inactive Customer 4 2024-01-01 NA
Filtering Joins
These joins filter rows based on matches, without adding columns:
Semi Join
Keeps rows in the first table that have a match in the second table:
# Find customers who bought electronicselectronics_products <- products %>%filter(category =="Electronics")electronics_buyers <- sales %>%semi_join(electronics_products, by ="product_id") %>%distinct(customer_id) %>%semi_join(customers, by ="customer_id")cat("Semi join to find electronics buyers:\n")
Semi join to find electronics buyers:
cat("- Total unique customers:", n_distinct(sales$customer_id), "\n")
# Get their detailselectronics_buyers %>%left_join(customers, by ="customer_id") %>%select(customer_id, customer_name, customer_type) %>%slice_head(n =5)
# A tibble: 5 × 3
customer_id customer_name customer_type
<chr> <chr> <chr>
1 CUST-14 Customer 14 Individual
2 CUST-91 Customer 91 Enterprise
3 CUST-72 Customer 72 Small Business
4 CUST-7 Customer 7 Small Business
5 CUST-78 Customer 78 Individual
Anti Join
Keeps rows in the first table that do NOT have a match in the second table:
# Find customers who have never bought electronicsnon_electronics_buyers <- customers %>%anti_join(electronics_buyers, by ="customer_id")cat("Anti join to find non-electronics buyers:\n")
Anti join to find non-electronics buyers:
cat("- Customers who never bought electronics:", nrow(non_electronics_buyers), "\n\n")
- Customers who never bought electronics: 43
# Find products that have never been soldunsold_products <- all_products %>%anti_join(sales, by ="product_id")cat("Products with no sales:\n")
# A tibble: 5 × 3
product_id product_name category
<chr> <chr> <chr>
1 PROD-31 New Product 1 New Items
2 PROD-32 New Product 2 New Items
3 PROD-33 New Product 3 New Items
4 PROD-34 New Product 4 New Items
5 PROD-35 New Product 5 New Items
Joining on Multiple Keys
Sometimes you need to join on multiple columns:
# Create order details with multiple keysorder_details <-tibble(order_id =sample(sales$order_id, 300),product_id =sample(products$product_id, 300, replace =TRUE),discount_pct =sample(c(0, 5, 10, 15, 20), 300, replace =TRUE, prob =c(0.5, 0.2, 0.15, 0.1, 0.05)))# Join using both order_id and product_idsales_with_discounts <- sales %>%left_join(order_details, by =c("order_id", "product_id")) %>%mutate(discount_amount =if_else(is.na(discount_pct), 0, total_amount * discount_pct /100),final_amount = total_amount - discount_amount )cat("Joining on multiple keys:\n")
Joining on multiple keys:
cat("- Orders with discounts:", sum(!is.na(sales_with_discounts$discount_pct)), "\n")
- Orders with discounts: 13
cat("- Total discount amount: $", round(sum(sales_with_discounts$discount_amount), 2), "\n")
- Total discount amount: $ 834.29
Handling Key Conflicts
When column names don’t match, specify the relationship:
# Create supplier data with different column namesuppliers <-tibble(supplier_code =paste0("SUPP-", 1:11), # Different name than supplier_idsupplier_name =paste("Supplier", LETTERS[1:11]),country =sample(c("USA", "Canada", "Mexico"), 11, replace =TRUE),rating =round(runif(11, 3, 5), 1))# Join with name mappingproducts_with_suppliers <- all_products %>%left_join(suppliers, by =c("supplier_id"="supplier_code"))cat("Joining with different column names:\n")
# A tibble: 5 × 6
product_name category supplier_id supplier_name country rating
<chr> <chr> <chr> <chr> <chr> <dbl>
1 Product A Electronics SUPP-4 Supplier D Mexico 3.2
2 Product B Clothing SUPP-10 Supplier J USA 4.7
3 Product C Food SUPP-6 Supplier F Canada 4.3
4 Product D Home SUPP-5 Supplier E Mexico 3.2
5 Product E Sports SUPP-2 Supplier B Canada 3.3
Dealing with Duplicate Keys
Understanding how joins handle duplicates is crucial:
# Create a scenario with duplicate keys# Multiple promotions for the same productpromotions <-tibble(product_id =rep(paste0("PROD-", 1:5), each =2),promo_name =c("Summer Sale", "Back to School", "Black Friday", "Cyber Monday", "Holiday Special", "New Year", "Spring Sale", "Flash Deal", "Weekend Special", "Member Exclusive"),discount =c(10, 15, 20, 25, 30, 10, 15, 20, 25, 30))# Join will create multiple rows for each matchsales_with_promos <- sales %>%filter(product_id %in%paste0("PROD-", 1:5)) %>%inner_join(promotions, by ="product_id")cat("Effect of duplicate keys in joins:\n")
Effect of duplicate keys in joins:
cat("- Sales records for PROD-1 to PROD-5:", sum(sales$product_id %in%paste0("PROD-", 1:5)), "\n")
- Sales records for PROD-1 to PROD-5: 88
cat("- Records after join with promotions:", nrow(sales_with_promos), "\n")
- Records after join with promotions: 176
cat("- Each sale is duplicated for each matching promotion\n\n")
- Each sale is duplicated for each matching promotion
# Show examplesales_with_promos %>%filter(order_id ==first(order_id)) %>%select(order_id, product_id, promo_name, discount)
# A tibble: 2 × 4
order_id product_id promo_name discount
<chr> <chr> <chr> <dbl>
1 ORD-0006 PROD-5 Weekend Special 25
2 ORD-0006 PROD-5 Member Exclusive 30
Complex Multi-Table Joins
Real-world analysis often requires joining multiple tables:
# A tibble: 5 × 6
order_id customer_name product_name supplier_name quantity profit
<chr> <chr> <chr> <chr> <int> <dbl>
1 ORD-0251 Customer 17 Product Y Supplier A 5 482.
2 ORD-0369 Customer 74 Product Q Supplier H 9 560.
3 ORD-0480 Customer 43 Product NA Supplier B 4 133.
4 ORD-0039 Customer 60 Product D Supplier E 1 114.
5 ORD-0290 Customer 2 Product T Supplier H 8 922.
Performance Considerations
When working with large datasets, join order matters:
# Example of efficient join ordering# Start with the smallest dataset and join progressively# Inefficient: Starting with large tablestart_time1 <-Sys.time()result1 <- sales %>%left_join(sales_analysis, by ="order_id") %>%nrow()time1 <-Sys.time() - start_time1# Efficient: Filter first, then joinstart_time2 <-Sys.time()result2 <- sales %>%filter(order_date >="2024-06-01") %>%left_join(customers, by ="customer_id") %>%left_join(products, by ="product_id") %>%nrow()time2 <-Sys.time() - start_time2cat("Join performance tips:\n")
Join performance tips:
cat("1. Filter data before joining when possible\n")
1. Filter data before joining when possible
cat("2. Join smaller tables first\n")
2. Join smaller tables first
cat("3. Use appropriate join types (inner join is fastest)\n")
3. Use appropriate join types (inner join is fastest)
# A tibble: 10 × 3
product_name_1 product_name_2 times_bought_together
<chr> <chr> <int>
1 Product W Product NA 10
2 Product W Product C 10
3 Product O Product Z 9
4 Product C Product H 9
5 Product J Product H 8
6 Product K Product W 8
7 Product K Product H 8
8 Product W Product H 8
9 Product H Product I 8
10 Product A Product J 7
Practice Exercises
Sales Dashboard: Create a comprehensive sales report joining all tables
Customer Segmentation: Use joins to categorize customers based on purchase behavior
Inventory Analysis: Join sales and product data to identify slow-moving items
Geographic Analysis: Join customer location data with sales to analyze regional patterns
Summary
Mastering joins is essential for real-world data analysis. Key points:
Choose the right join type for your analysis needs
Understand how joins handle missing values and duplicates
Pay attention to column names and key relationships
Consider performance when joining large datasets
Filter before joining when possible
Use filtering joins (semi/anti) for efficient subsetting
---title: "Joining Data"author: "IND215"date: todayformat: html: toc: true toc-depth: 3 code-fold: false code-tools: true---## Combining Data from Multiple SourcesIn real-world data analysis, information is rarely contained in a single table. The ability to combine data from multiple sources is essential. dplyr provides a comprehensive set of join functions that mirror SQL operations while maintaining the tidyverse philosophy.```{r}#| label: setup#| message: falselibrary(tidyverse)# Create the sales datasetset.seed(123)sales <-tibble(order_id =paste0("ORD-", str_pad(1:500, 4, pad ="0")),customer_id =sample(paste0("CUST-", 1:100), 500, replace =TRUE),product_id =sample(paste0("PROD-", 1:30), 500, replace =TRUE),order_date =sample(seq(ymd("2024-01-01"), ymd("2024-12-31"), by ="day"), 500, replace =TRUE),quantity =sample(1:10, 500, replace =TRUE),unit_price =round(runif(500, 10, 200), 2)) %>%mutate(total_amount = quantity * unit_price )# Create customer informationcustomers <-tibble(customer_id =paste0("CUST-", 1:100),customer_name =paste("Customer", 1:100),customer_type =sample(c("Individual", "Small Business", "Enterprise"), 100, replace =TRUE, prob =c(0.6, 0.3, 0.1)),join_date =sample(seq(ymd("2020-01-01"), ymd("2023-12-31"), by ="day"), 100),credit_limit =sample(c(1000, 5000, 10000, 25000, 50000), 100, replace =TRUE),city =sample(c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix"), 100, replace =TRUE),state =c(rep("NY", 20), rep("CA", 20), rep("IL", 20), rep("TX", 20), rep("AZ", 20)))# Create product informationproducts <-tibble(product_id =paste0("PROD-", 1:30),product_name =paste("Product", LETTERS[1:30]),category =rep(c("Electronics", "Clothing", "Food", "Home", "Sports"), 6),cost =round(runif(30, 5, 100), 2),supplier_id =sample(paste0("SUPP-", 1:10), 30, replace =TRUE))# Create some additional data for demonstrating different join scenarios# Some products that don't have salesnew_products <-tibble(product_id =paste0("PROD-", 31:35),product_name =paste("New Product", 1:5),category ="New Items",cost =round(runif(5, 20, 150), 2),supplier_id ="SUPP-11")# Some customers who haven't made purchasesinactive_customers <-tibble(customer_id =paste0("CUST-", 101:110),customer_name =paste("Inactive Customer", 1:10),customer_type ="Individual",join_date =ymd("2024-01-01"),credit_limit =1000,city ="Boston",state ="MA")cat("Created sample datasets:\n")cat("- Sales:", nrow(sales), "transactions\n")cat("- Customers:", nrow(customers), "records\n")cat("- Products:", nrow(products), "items\n")```## Types of Joinsdplyr provides several types of joins, each serving different purposes:### Inner JoinKeeps only rows that have matching keys in both tables:```{r}#| label: inner-join# Join sales with customer informationsales_with_customers <- sales %>%inner_join(customers, by ="customer_id")cat("Inner join result:\n")cat("- Original sales:", nrow(sales), "rows\n")cat("- After inner join:", nrow(sales_with_customers), "rows\n")cat("- Columns added:", setdiff(names(sales_with_customers), names(sales)), "\n\n")# Show a samplesales_with_customers %>%select(order_id, customer_name, customer_type, order_date, total_amount) %>%slice_head(n =5)```### Left JoinKeeps all rows from the left table, adding NA for non-matching right table rows:```{r}#| label: left-join# Combine all customers and productsall_products <-bind_rows(products, new_products)# Left join to keep all sales, even if product info is missingsales_complete <- sales %>%left_join(customers, by ="customer_id") %>%left_join(all_products, by ="product_id")cat("Left join preserves all sales records:\n")cat("- Sales records:", nrow(sales_complete), "\n")cat("- Products without info:", sum(is.na(sales_complete$product_name)), "\n\n")# Check for missing product informationsales_complete %>%filter(is.na(product_name)) %>%select(order_id, product_id, product_name, category) %>%slice_head(n =5)```### Right JoinKeeps all rows from the right table:```{r}#| label: right-join# Find all products and their sales (if any)product_sales <- sales %>%group_by(product_id) %>%summarise(total_sold =sum(quantity),revenue =sum(total_amount),.groups ="drop" ) %>%right_join(all_products, by ="product_id")cat("Right join shows all products, including those without sales:\n")product_sales %>%filter(is.na(total_sold)) %>%select(product_id, product_name, category, total_sold, revenue)```### Full JoinKeeps all rows from both tables:```{r}#| label: full-join# Combine all customers (including inactive ones)all_customers <-bind_rows(customers, inactive_customers)# Full join to see all customers and all orderscustomer_order_summary <- sales %>%group_by(customer_id) %>%summarise(order_count =n(),total_spent =sum(total_amount),.groups ="drop" ) %>%full_join(all_customers, by ="customer_id")cat("Full join shows both customers with orders and those without:\n")cat("- Total customers:", nrow(customer_order_summary), "\n")cat("- Customers with orders:", sum(!is.na(customer_order_summary$order_count)), "\n")cat("- Customers without orders:", sum(is.na(customer_order_summary$order_count)), "\n\n")# Show customers without orderscustomer_order_summary %>%filter(is.na(order_count)) %>%select(customer_id, customer_name, join_date, order_count) %>%slice_head(n =5)```## Filtering JoinsThese joins filter rows based on matches, without adding columns:### Semi JoinKeeps rows in the first table that have a match in the second table:```{r}#| label: semi-join# Find customers who bought electronicselectronics_products <- products %>%filter(category =="Electronics")electronics_buyers <- sales %>%semi_join(electronics_products, by ="product_id") %>%distinct(customer_id) %>%semi_join(customers, by ="customer_id")cat("Semi join to find electronics buyers:\n")cat("- Total unique customers:", n_distinct(sales$customer_id), "\n")cat("- Electronics buyers:", nrow(electronics_buyers), "\n")# Get their detailselectronics_buyers %>%left_join(customers, by ="customer_id") %>%select(customer_id, customer_name, customer_type) %>%slice_head(n =5)```### Anti JoinKeeps rows in the first table that do NOT have a match in the second table:```{r}#| label: anti-join# Find customers who have never bought electronicsnon_electronics_buyers <- customers %>%anti_join(electronics_buyers, by ="customer_id")cat("Anti join to find non-electronics buyers:\n")cat("- Customers who never bought electronics:", nrow(non_electronics_buyers), "\n\n")# Find products that have never been soldunsold_products <- all_products %>%anti_join(sales, by ="product_id")cat("Products with no sales:\n")unsold_products %>%select(product_id, product_name, category)```## Joining on Multiple KeysSometimes you need to join on multiple columns:```{r}#| label: multiple-keys# Create order details with multiple keysorder_details <-tibble(order_id =sample(sales$order_id, 300),product_id =sample(products$product_id, 300, replace =TRUE),discount_pct =sample(c(0, 5, 10, 15, 20), 300, replace =TRUE, prob =c(0.5, 0.2, 0.15, 0.1, 0.05)))# Join using both order_id and product_idsales_with_discounts <- sales %>%left_join(order_details, by =c("order_id", "product_id")) %>%mutate(discount_amount =if_else(is.na(discount_pct), 0, total_amount * discount_pct /100),final_amount = total_amount - discount_amount )cat("Joining on multiple keys:\n")cat("- Orders with discounts:", sum(!is.na(sales_with_discounts$discount_pct)), "\n")cat("- Total discount amount: $", round(sum(sales_with_discounts$discount_amount), 2), "\n")```## Handling Key ConflictsWhen column names don't match, specify the relationship:```{r}#| label: key-conflicts# Create supplier data with different column namesuppliers <-tibble(supplier_code =paste0("SUPP-", 1:11), # Different name than supplier_idsupplier_name =paste("Supplier", LETTERS[1:11]),country =sample(c("USA", "Canada", "Mexico"), 11, replace =TRUE),rating =round(runif(11, 3, 5), 1))# Join with name mappingproducts_with_suppliers <- all_products %>%left_join(suppliers, by =c("supplier_id"="supplier_code"))cat("Joining with different column names:\n")products_with_suppliers %>%select(product_name, category, supplier_id, supplier_name, country, rating) %>%slice_head(n =5)```## Dealing with Duplicate KeysUnderstanding how joins handle duplicates is crucial:```{r}#| label: duplicate-keys# Create a scenario with duplicate keys# Multiple promotions for the same productpromotions <-tibble(product_id =rep(paste0("PROD-", 1:5), each =2),promo_name =c("Summer Sale", "Back to School", "Black Friday", "Cyber Monday", "Holiday Special", "New Year", "Spring Sale", "Flash Deal", "Weekend Special", "Member Exclusive"),discount =c(10, 15, 20, 25, 30, 10, 15, 20, 25, 30))# Join will create multiple rows for each matchsales_with_promos <- sales %>%filter(product_id %in%paste0("PROD-", 1:5)) %>%inner_join(promotions, by ="product_id")cat("Effect of duplicate keys in joins:\n")cat("- Sales records for PROD-1 to PROD-5:", sum(sales$product_id %in%paste0("PROD-", 1:5)), "\n")cat("- Records after join with promotions:", nrow(sales_with_promos), "\n")cat("- Each sale is duplicated for each matching promotion\n\n")# Show examplesales_with_promos %>%filter(order_id ==first(order_id)) %>%select(order_id, product_id, promo_name, discount)```## Complex Multi-Table JoinsReal-world analysis often requires joining multiple tables:```{r}#| label: complex-joins# Create a complete sales analysis datasetsales_analysis <- sales %>%# Add customer informationleft_join(customers, by ="customer_id") %>%# Add product informationleft_join(products, by ="product_id") %>%# Add supplier informationleft_join(suppliers, by =c("supplier_id"="supplier_code")) %>%# Calculate additional metricsmutate(profit_margin = (unit_price - cost) / unit_price,profit = total_amount * profit_margin,order_month =floor_date(order_date, "month") )cat("Complete sales analysis dataset:\n")cat("- Total columns:", ncol(sales_analysis), "\n")cat("- Column categories:\n")cat(" - Order info:", sum(str_detect(names(sales_analysis), "order")), "columns\n")cat(" - Customer info:", sum(str_detect(names(sales_analysis), "customer")), "columns\n")cat(" - Product info:", sum(str_detect(names(sales_analysis), "product")), "columns\n")cat(" - Financial metrics:", sum(names(sales_analysis) %in%c("profit", "profit_margin", "total_amount")), "columns\n")# Sample the complete datasetsales_analysis %>%select(order_id, customer_name, product_name, supplier_name, quantity, profit) %>%slice_sample(n =5)```## Performance ConsiderationsWhen working with large datasets, join order matters:```{r}#| label: join-performance# Example of efficient join ordering# Start with the smallest dataset and join progressively# Inefficient: Starting with large tablestart_time1 <-Sys.time()result1 <- sales %>%left_join(sales_analysis, by ="order_id") %>%nrow()time1 <-Sys.time() - start_time1# Efficient: Filter first, then joinstart_time2 <-Sys.time()result2 <- sales %>%filter(order_date >="2024-06-01") %>%left_join(customers, by ="customer_id") %>%left_join(products, by ="product_id") %>%nrow()time2 <-Sys.time() - start_time2cat("Join performance tips:\n")cat("1. Filter data before joining when possible\n")cat("2. Join smaller tables first\n")cat("3. Use appropriate join types (inner join is fastest)\n")cat("4. Index key columns in databases\n")```## Common Join Patterns### Customer Lifetime Value Analysis```{r}#| label: customer-ltv# Calculate customer lifetime valuecustomer_ltv <- sales %>%group_by(customer_id) %>%summarise(first_order =min(order_date),last_order =max(order_date),total_orders =n(),total_spent =sum(total_amount),avg_order_value =mean(total_amount),.groups ="drop" ) %>%inner_join(customers, by ="customer_id") %>%mutate(customer_age_days =as.numeric(today() - join_date),days_active =as.numeric(last_order - first_order) +1,order_frequency = total_orders / days_active *30# Orders per 30 days ) %>%arrange(desc(total_spent))cat("Top 5 customers by lifetime value:\n")customer_ltv %>%select(customer_name, customer_type, total_orders, total_spent, avg_order_value, order_frequency) %>%slice_head(n =5)```### Product Affinity Analysis```{r}#| label: product-affinity# Find products frequently bought togetherproduct_pairs <- sales %>%inner_join(sales, by ="customer_id", suffix =c("_1", "_2")) %>%filter(product_id_1 < product_id_2) %>%# Avoid duplicates and self-joinscount(product_id_1, product_id_2, name ="times_bought_together") %>%filter(times_bought_together >=5) %>%left_join(products, by =c("product_id_1"="product_id")) %>%rename(product_name_1 = product_name) %>%left_join(products, by =c("product_id_2"="product_id")) %>%rename(product_name_2 = product_name) %>%arrange(desc(times_bought_together))cat("Products frequently bought together:\n")product_pairs %>%select(product_name_1, product_name_2, times_bought_together) %>%slice_head(n =10)```## Practice Exercises1. **Sales Dashboard**: Create a comprehensive sales report joining all tables2. **Customer Segmentation**: Use joins to categorize customers based on purchase behavior3. **Inventory Analysis**: Join sales and product data to identify slow-moving items4. **Geographic Analysis**: Join customer location data with sales to analyze regional patterns## SummaryMastering joins is essential for real-world data analysis. Key points:- Choose the right join type for your analysis needs- Understand how joins handle missing values and duplicates- Pay attention to column names and key relationships- Consider performance when joining large datasets- Filter before joining when possible- Use filtering joins (semi/anti) for efficient subsetting- Always verify join results match expectationsNext, explore [Advanced Techniques](advanced-techniques.qmd) for more sophisticated dplyr operations.