Filtering and Selecting Data

Author

IND215

Published

September 22, 2025

Introduction to Data Subsetting

Filtering and selecting are fundamental operations in data analysis. The filter() function chooses which rows to keep based on conditions, while select() chooses which columns to keep. These are often the first steps in any data analysis pipeline.

library(tidyverse)

# Load the sample dataset from Module 5 index
set.seed(123)
sales_raw <- tibble(
  order_id = paste0("ORD-", str_pad(1:1000, 4, pad = "0")),
  customer_id = sample(paste0("CUST-", 1:200), 1000, replace = TRUE),
  product_id = sample(paste0("PROD-", 1:50), 1000, replace = TRUE),
  order_date = sample(seq(ymd("2023-01-01"), ymd("2024-12-31"), by = "day"), 1000, replace = TRUE),
  quantity = sample(1:10, 1000, replace = TRUE),
  unit_price = round(runif(1000, 5, 200), 2),
  discount_pct = sample(c(0, 0.05, 0.10, 0.15, 0.20), 1000, replace = TRUE, prob = c(0.6, 0.2, 0.1, 0.05, 0.05)),
  sales_rep = sample(c("Alice Johnson", "Bob Smith", "Charlie Brown", "Diana Prince", "Eve Wilson"), 1000, replace = TRUE),
  region = sample(c("North", "South", "East", "West"), 1000, replace = TRUE),
  customer_type = sample(c("Individual", "Small Business", "Enterprise"), 1000, replace = TRUE, prob = c(0.5, 0.3, 0.2))
) %>%
  mutate(
    sales_rep = ifelse(row_number() %in% sample(1:1000, 50), NA, sales_rep),
    discount_pct = ifelse(row_number() %in% sample(1:1000, 30), NA, discount_pct),
    gross_amount = quantity * unit_price,
    discount_amount = gross_amount * coalesce(discount_pct, 0),
    net_amount = gross_amount - discount_amount,
    unit_price = ifelse(row_number() %in% sample(1:1000, 5), unit_price * 100, unit_price),
    quantity = ifelse(row_number() %in% sample(1:1000, 3), 0, quantity)
  )

cat("Working with", nrow(sales_raw), "sales transactions\n")
Working with 1000 sales transactions
cat("Columns:", ncol(sales_raw), "variables\n")
Columns: 13 variables

Filtering Rows with filter()

Basic Filtering

# Single condition
large_orders <- sales_raw %>%
  filter(net_amount > 1000)

cat("Large orders (>$1000):", nrow(large_orders), "transactions\n")
Large orders (>$1000): 146 transactions
# Multiple conditions with AND (comma-separated or &)
premium_orders <- sales_raw %>%
  filter(
    net_amount > 500,
    customer_type == "Enterprise",
    quantity >= 5
  )

cat("Premium enterprise orders:", nrow(premium_orders), "transactions\n")
Premium enterprise orders: 66 transactions
# OR conditions
discount_or_large <- sales_raw %>%
  filter(discount_pct > 0.1 | net_amount > 800)

cat("High discount or large orders:", nrow(discount_or_large), "transactions\n")
High discount or large orders: 296 transactions
# NOT conditions
no_discount <- sales_raw %>%
  filter(discount_pct == 0 | is.na(discount_pct))

cat("Orders without discount:", nrow(no_discount), "transactions\n")
Orders without discount: 613 transactions

Working with Dates

# Recent orders (last 6 months)
recent_cutoff <- ymd("2024-06-01")
recent_orders <- sales_raw %>%
  filter(order_date >= recent_cutoff)

cat("Recent orders (since June 2024):", nrow(recent_orders), "transactions\n")
Recent orders (since June 2024): 298 transactions
# Specific date ranges
q4_2024 <- sales_raw %>%
  filter(
    order_date >= ymd("2024-10-01"),
    order_date <= ymd("2024-12-31")
  )

cat("Q4 2024 orders:", nrow(q4_2024), "transactions\n")
Q4 2024 orders: 133 transactions
# Using between() for ranges
summer_2024 <- sales_raw %>%
  filter(between(order_date, ymd("2024-06-01"), ymd("2024-08-31")))

cat("Summer 2024 orders:", nrow(summer_2024), "transactions\n")
Summer 2024 orders: 122 transactions
# Filtering by day of week or month
weekend_orders <- sales_raw %>%
  filter(wday(order_date, label = TRUE) %in% c("Sat", "Sun"))

december_orders <- sales_raw %>%
  filter(month(order_date) == 12)

cat("Weekend orders:", nrow(weekend_orders), "\n")
Weekend orders: 283 
cat("December orders:", nrow(december_orders), "\n")
December orders: 94 

String Filtering

# Exact matches
alice_orders <- sales_raw %>%
  filter(sales_rep == "Alice Johnson")

# Pattern matching with str_detect()
johnson_reps <- sales_raw %>%
  filter(str_detect(sales_rep, "Johnson"))

# Case-insensitive patterns
alice_any_case <- sales_raw %>%
  filter(str_detect(str_to_lower(sales_rep), "alice"))

# Multiple string patterns
senior_reps <- sales_raw %>%
  filter(sales_rep %in% c("Alice Johnson", "Diana Prince"))

# String starts with or ends with
north_products <- sales_raw %>%
  filter(str_starts(product_id, "PROD-1"))

cat("Alice's orders:", nrow(alice_orders), "\n")
Alice's orders: 198 
cat("Johnson family reps:", nrow(johnson_reps), "\n")
Johnson family reps: 198 
cat("Senior reps orders:", nrow(senior_reps), "\n")
Senior reps orders: 398 

Handling Missing Values

# Remove rows with missing sales rep
complete_sales_rep <- sales_raw %>%
  filter(!is.na(sales_rep))

# Only rows with missing values
missing_sales_rep <- sales_raw %>%
  filter(is.na(sales_rep))

# Complete cases only (no missing values in any column)
complete_cases <- sales_raw %>%
  filter(complete.cases(.))

# Rows with any missing values
any_missing <- sales_raw %>%
  filter(!complete.cases(.))

cat("Orders with sales rep assigned:", nrow(complete_sales_rep), "\n")
Orders with sales rep assigned: 950 
cat("Orders missing sales rep:", nrow(missing_sales_rep), "\n")
Orders missing sales rep: 50 
cat("Complete cases (no missing data):", nrow(complete_cases), "\n")
Complete cases (no missing data): 921 
cat("Cases with any missing data:", nrow(any_missing), "\n")
Cases with any missing data: 79 

Advanced Filtering Techniques

# Using %in% for multiple values
target_regions <- c("North", "East")
target_region_orders <- sales_raw %>%
  filter(region %in% target_regions)

# Complex logical conditions
complex_filter <- sales_raw %>%
  filter(
    (customer_type == "Enterprise" & net_amount > 500) |
    (customer_type == "Small Business" & net_amount > 200) |
    (customer_type == "Individual" & net_amount > 100)
  )

# Filtering based on ranking
top_10_percent <- sales_raw %>%
  filter(net_amount >= quantile(net_amount, 0.9, na.rm = TRUE))

# Outlier detection and removal
q1 <- quantile(sales_raw$net_amount, 0.25, na.rm = TRUE)
q3 <- quantile(sales_raw$net_amount, 0.75, na.rm = TRUE)
iqr <- q3 - q1
outlier_bounds <- c(q1 - 1.5 * iqr, q3 + 1.5 * iqr)

no_outliers <- sales_raw %>%
  filter(between(net_amount, outlier_bounds[1], outlier_bounds[2]))

cat("Target regions:", nrow(target_region_orders), "orders\n")
Target regions: 497 orders
cat("Complex filter:", nrow(complex_filter), "orders\n")
Complex filter: 731 orders
cat("Top 10% by value:", nrow(top_10_percent), "orders\n")
Top 10% by value: 100 orders
cat("After removing outliers:", nrow(no_outliers), "orders\n")
After removing outliers: 978 orders

Selecting Columns with select()

Basic Column Selection

# Select specific columns
basic_info <- sales_raw %>%
  select(order_id, customer_id, order_date, net_amount)

# Select columns by position
first_five <- sales_raw %>%
  select(1:5)

# Select all except certain columns
no_internals <- sales_raw %>%
  select(-gross_amount, -discount_amount)

# Reorder columns
reordered <- sales_raw %>%
  select(order_id, order_date, customer_id, region, everything())

cat("Basic info columns:", ncol(basic_info), "\n")
Basic info columns: 4 
cat("First five columns:", ncol(first_five), "\n")
First five columns: 5 
cat("Without internal calculations:", ncol(no_internals), "\n")
Without internal calculations: 11 
cat("Reordered columns:", paste(names(reordered)[1:6], collapse = ", "), "...\n")
Reordered columns: order_id, order_date, customer_id, region, product_id, quantity ...

Helper Functions for Selection

# Select columns by name pattern
id_columns <- sales_raw %>%
  select(ends_with("_id"))

# Select by data type
numeric_columns <- sales_raw %>%
  select(where(is.numeric))

character_columns <- sales_raw %>%
  select(where(is.character))

# Select columns containing certain text
amount_columns <- sales_raw %>%
  select(contains("amount"))

# Select columns matching regex
date_and_id <- sales_raw %>%
  select(matches("(date|_id)"))

# Multiple helper functions
key_metrics <- sales_raw %>%
  select(starts_with("order"), contains("amount"), region, customer_type)

cat("ID columns:", ncol(id_columns), "variables\n")
ID columns: 3 variables
cat("Numeric columns:", ncol(numeric_columns), "variables\n")
Numeric columns: 6 variables
cat("Character columns:", ncol(character_columns), "variables\n")
Character columns: 6 variables
cat("Amount columns:", ncol(amount_columns), "variables\n")
Amount columns: 3 variables
cat("Key metrics:", ncol(key_metrics), "variables\n")
Key metrics: 7 variables
# Show the helper function results
cat("\nID columns:", paste(names(id_columns), collapse = ", "), "\n")

ID columns: order_id, customer_id, product_id 
cat("Amount columns:", paste(names(amount_columns), collapse = ", "), "\n")
Amount columns: gross_amount, discount_amount, net_amount 

Renaming During Selection

# Rename while selecting
renamed_selection <- sales_raw %>%
  select(
    order = order_id,
    customer = customer_id,
    date = order_date,
    value = net_amount,
    rep = sales_rep
  )

# Rename with everything()
cleaner_names <- sales_raw %>%
  select(
    order_number = order_id,
    customer_number = customer_id,
    everything()
  )

cat("Renamed selection columns:\n")
Renamed selection columns:
print(names(renamed_selection))
[1] "order"    "customer" "date"     "value"    "rep"     
cat("\nCleaner names (first 8 columns):\n")

Cleaner names (first 8 columns):
print(names(cleaner_names)[1:8])
[1] "order_number"    "customer_number" "product_id"      "order_date"     
[5] "quantity"        "unit_price"      "discount_pct"    "sales_rep"      

Combining filter() and select()

Common Patterns

# Filter then select (most common)
recent_summary <- sales_raw %>%
  filter(
    order_date >= ymd("2024-01-01"),
    !is.na(sales_rep)
  ) %>%
  select(order_id, order_date, sales_rep, region, net_amount, customer_type)

# Select then filter (less common, but sometimes useful)
high_value_basics <- sales_raw %>%
  select(order_id, customer_id, net_amount, customer_type) %>%
  filter(net_amount > 500)

# Complex combination with multiple steps
analysis_ready <- sales_raw %>%
  # Remove problematic data
  filter(
    quantity > 0,                    # No zero quantities
    unit_price <= 300,               # Remove pricing errors
    !is.na(sales_rep)                # Must have assigned rep
  ) %>%
  # Select relevant columns
  select(
    order_id,
    customer_id,
    order_date,
    quantity,
    unit_price,
    discount_pct,
    net_amount,
    sales_rep,
    region,
    customer_type
  ) %>%
  # Add derived columns for analysis
  mutate(
    order_month = floor_date(order_date, "month"),
    is_discounted = !is.na(discount_pct) & discount_pct > 0,
    order_size = case_when(
      net_amount >= 1000 ~ "Large",
      net_amount >= 300 ~ "Medium",
      TRUE ~ "Small"
    )
  )

cat("Recent summary:", nrow(recent_summary), "rows,", ncol(recent_summary), "columns\n")
Recent summary: 452 rows, 6 columns
cat("Analysis ready:", nrow(analysis_ready), "rows,", ncol(analysis_ready), "columns\n")
Analysis ready: 942 rows, 13 columns
# Show the first few rows of analysis-ready data
print(head(analysis_ready, 5))
# A tibble: 5 × 13
  order_id customer_id order_date quantity unit_price discount_pct net_amount
  <chr>    <chr>       <date>        <dbl>      <dbl>        <dbl>      <dbl>
1 ORD-0001 CUST-159    2023-04-27        1      159.           0         159.
2 ORD-0002 CUST-179    2023-09-10        4      190.           0.1       684.
3 ORD-0003 CUST-14     2023-08-15        7      134.           0         938.
4 ORD-0004 CUST-195    2024-05-07        9       93.9          0.1       760.
5 ORD-0005 CUST-170    2023-09-11        6      189.           0        1135.
# ℹ 6 more variables: sales_rep <chr>, region <chr>, customer_type <chr>,
#   order_month <date>, is_discounted <lgl>, order_size <chr>

Working with Grouped Data

# Filter within groups
top_orders_by_region <- sales_raw %>%
  group_by(region) %>%
  filter(net_amount >= quantile(net_amount, 0.9, na.rm = TRUE)) %>%
  ungroup()

# Select and filter with grouping
regional_summaries <- sales_raw %>%
  filter(!is.na(sales_rep)) %>%
  group_by(region, sales_rep) %>%
  select(region, sales_rep, net_amount, quantity) %>%
  summarise(
    total_sales = sum(net_amount),
    total_orders = n(),
    avg_order_value = mean(net_amount),
    .groups = "drop"
  )

cat("Top orders by region:", nrow(top_orders_by_region), "orders\n")
Top orders by region: 102 orders
cat("Regional summaries:", nrow(regional_summaries), "rep-region combinations\n")
Regional summaries: 20 rep-region combinations
print(head(regional_summaries))
# A tibble: 6 × 5
  region sales_rep     total_sales total_orders avg_order_value
  <chr>  <chr>               <dbl>        <int>           <dbl>
1 East   Alice Johnson      24219.           47            515.
2 East   Bob Smith          28238.           49            576.
3 East   Charlie Brown      21353.           44            485.
4 East   Diana Prince       27990.           52            538.
5 East   Eve Wilson         25531.           45            567.
6 North  Alice Johnson      29621.           51            581.

Advanced Filtering Techniques

Using filter() with Window Functions

# Top N orders per customer
top_orders_per_customer <- sales_raw %>%
  group_by(customer_id) %>%
  filter(rank(desc(net_amount)) <= 3) %>%
  arrange(customer_id, desc(net_amount)) %>%
  ungroup()

# Orders above regional average
above_regional_avg <- sales_raw %>%
  group_by(region) %>%
  filter(net_amount > mean(net_amount, na.rm = TRUE)) %>%
  ungroup()

# Recent orders within each customer
recent_per_customer <- sales_raw %>%
  group_by(customer_id) %>%
  filter(order_date >= max(order_date) - days(30)) %>%
  ungroup()

cat("Top 3 orders per customer:", nrow(top_orders_per_customer), "orders\n")
Top 3 orders per customer: 568 orders
cat("Above regional average:", nrow(above_regional_avg), "orders\n")
Above regional average: 395 orders
cat("Recent per customer:", nrow(recent_per_customer), "orders\n")
Recent per customer: 237 orders

Conditional Filtering

# Filter based on conditions in other columns
inconsistent_discounts <- sales_raw %>%
  filter(
    case_when(
      customer_type == "Enterprise" ~ discount_pct <= 0.15,
      customer_type == "Small Business" ~ discount_pct <= 0.10,
      customer_type == "Individual" ~ discount_pct <= 0.05,
      TRUE ~ TRUE
    )
  )

# Filter using if_else logic
weekend_enterprise <- sales_raw %>%
  filter(
    if_else(
      wday(order_date) %in% c(1, 7),  # Weekend
      customer_type == "Enterprise",   # Only enterprise on weekends
      TRUE                            # Any customer type on weekdays
    )
  )

cat("Consistent discount policies:", nrow(inconsistent_discounts), "orders\n")
Consistent discount policies: 836 orders
cat("Weekend enterprise filter:", nrow(weekend_enterprise), "orders\n")
Weekend enterprise filter: 766 orders

Performance and Best Practices

Efficient Filtering

# Early filtering reduces computation
# Good: Filter early
efficient_pipeline <- sales_raw %>%
  filter(
    order_date >= ymd("2024-01-01"),
    !is.na(sales_rep),
    net_amount > 100
  ) %>%
  select(order_id, customer_id, order_date, net_amount, region) %>%
  mutate(
    quarter = quarter(order_date),
    high_value = net_amount > 500
  )

# Less efficient: Filter after expensive operations
# (We won't run this, but it shows the pattern to avoid)
# inefficient_pipeline <- sales_raw %>%
#   mutate(complex_calculation = net_amount * some_expensive_function()) %>%
#   left_join(large_table) %>%
#   filter(order_date >= ymd("2024-01-01"))

# Use specific conditions instead of complex ones
# Good: Specific numeric ranges
specific_range <- sales_raw %>%
  filter(between(net_amount, 100, 1000))

# Less good: Complex calculations in filter
# avoid_complex <- sales_raw %>%
#   filter(log(net_amount) > some_threshold)

cat("Efficient pipeline result:", nrow(efficient_pipeline), "rows\n")
Efficient pipeline result: 401 rows
cat("Specific range filter:", nrow(specific_range), "rows\n")
Specific range filter: 736 rows

Readable Filter Conditions

# Create intermediate variables for complex conditions
high_value_threshold <- 1000
recent_date_cutoff <- ymd("2024-06-01")
target_customer_types <- c("Enterprise", "Small Business")

readable_filter <- sales_raw %>%
  filter(
    net_amount >= high_value_threshold,
    order_date >= recent_date_cutoff,
    customer_type %in% target_customer_types,
    !is.na(sales_rep)
  )

# Use meaningful helper functions
is_valid_order <- function(data) {
  data %>%
    filter(
      quantity > 0,
      unit_price > 0,
      unit_price <= 500,  # Remove pricing errors
      !is.na(sales_rep)
    )
}

valid_orders <- sales_raw %>%
  is_valid_order()

cat("Readable filter:", nrow(readable_filter), "orders\n")
Readable filter: 27 orders
cat("Valid orders:", nrow(valid_orders), "orders\n")
Valid orders: 942 orders

Real-World Examples

Example 1: Customer Segmentation Analysis

# Identify different customer segments for targeted analysis
customer_segments <- sales_raw %>%
  filter(
    !is.na(sales_rep),
    quantity > 0,
    order_date >= ymd("2024-01-01")
  ) %>%
  group_by(customer_id) %>%
  summarise(
    total_orders = n(),
    total_spent = sum(net_amount),
    avg_order_value = mean(net_amount),
    first_order = min(order_date),
    last_order = max(order_date),
    customer_type = first(customer_type),
    .groups = "drop"
  ) %>%
  mutate(
    recency_days = as.numeric(Sys.Date() - last_order),
    customer_lifetime = as.numeric(last_order - first_order),
    segment = case_when(
      total_spent >= 5000 & recency_days <= 30 ~ "VIP Active",
      total_spent >= 5000 & recency_days > 30 ~ "VIP At Risk",
      total_spent >= 1000 & recency_days <= 60 ~ "Regular Active",
      total_spent >= 1000 & recency_days > 60 ~ "Regular At Risk",
      recency_days <= 30 ~ "New Active",
      TRUE ~ "Low Value"
    )
  )

# Filter for each segment
vip_customers <- customer_segments %>%
  filter(str_starts(segment, "VIP"))

at_risk_customers <- customer_segments %>%
  filter(str_ends(segment, "At Risk"))

# Segment summary
segment_summary <- customer_segments %>%
  count(segment, sort = TRUE) %>%
  mutate(percentage = round(n / sum(n) * 100, 1))

cat("Customer segmentation results:\n")
Customer segmentation results:
print(segment_summary)
# A tibble: 2 × 3
  segment             n percentage
  <chr>           <int>      <dbl>
1 Regular At Risk   103       55.7
2 Low Value          82       44.3
cat("\nVIP customers:", nrow(vip_customers), "\n")

VIP customers: 0 
cat("At-risk customers:", nrow(at_risk_customers), "\n")
At-risk customers: 103 

Example 2: Sales Performance Analysis

# Analyze sales rep performance with filters
rep_performance <- sales_raw %>%
  filter(
    !is.na(sales_rep),
    order_date >= ymd("2024-01-01"),
    quantity > 0,
    unit_price <= 300  # Remove pricing errors
  ) %>%
  group_by(sales_rep, region) %>%
  summarise(
    total_sales = sum(net_amount),
    order_count = n(),
    avg_order_value = mean(net_amount),
    unique_customers = n_distinct(customer_id),
    .groups = "drop"
  ) %>%
  mutate(
    performance_score = scale(total_sales)[,1] + scale(unique_customers)[,1],
    performance_tier = case_when(
      performance_score >= 1 ~ "Top Performer",
      performance_score >= 0 ~ "Good Performer",
      performance_score >= -1 ~ "Average Performer",
      TRUE ~ "Needs Improvement"
    )
  )

# Filter for different performance tiers
top_performers <- rep_performance %>%
  filter(performance_tier == "Top Performer")

underperformers <- rep_performance %>%
  filter(performance_tier == "Needs Improvement")

cat("Sales rep performance analysis:\n")
Sales rep performance analysis:
print(rep_performance %>% arrange(desc(performance_score)))
# A tibble: 20 × 8
   sales_rep     region total_sales order_count avg_order_value unique_customers
   <chr>         <chr>        <dbl>       <int>           <dbl>            <int>
 1 Bob Smith     West        15656.          29            540.               26
 2 Eve Wilson    West        14708.          26            566.               25
 3 Diana Prince  South       13045.          28            466.               27
 4 Diana Prince  East        14435.          26            555.               23
 5 Alice Johnson West        14320.          25            573.               23
 6 Diana Prince  North       14280.          28            510.               23
 7 Eve Wilson    East        13719.          24            572.               23
 8 Charlie Brown North       11918.          26            458.               24
 9 Charlie Brown West        12222.          24            509.               23
10 Alice Johnson North       13003.          23            565.               21
11 Eve Wilson    North       11593.          23            504.               22
12 Charlie Brown East        10386.          23            452.               21
13 Bob Smith     South       11820.          19            622.               18
14 Diana Prince  West         9331.          21            444.               20
15 Charlie Brown South       10822.          17            637.               17
16 Bob Smith     East        11734.          16            733.               15
17 Eve Wilson    South        9807.          20            490.               18
18 Alice Johnson South        7869.          21            375.               20
19 Alice Johnson East         9434.          18            524.               17
20 Bob Smith     North        6793.          11            618.               11
# ℹ 2 more variables: performance_score <dbl>, performance_tier <chr>
cat("\nTop performers:", nrow(top_performers), "\n")

Top performers: 7 
cat("Need improvement:", nrow(underperformers), "\n")
Need improvement: 7 

Example 3: Product Analysis

# Analyze product performance across different dimensions
product_analysis <- sales_raw %>%
  filter(
    !is.na(sales_rep),
    quantity > 0,
    order_date >= ymd("2024-01-01")
  ) %>%
  group_by(product_id, region, customer_type) %>%
  summarise(
    units_sold = sum(quantity),
    revenue = sum(net_amount),
    orders = n(),
    avg_price = mean(unit_price),
    .groups = "drop"
  ) %>%
  group_by(product_id) %>%
  mutate(
    product_total_revenue = sum(revenue),
    regional_share = revenue / product_total_revenue
  ) %>%
  ungroup()

# Filter for top products
top_products <- product_analysis %>%
  group_by(product_id) %>%
  summarise(total_revenue = sum(revenue), .groups = "drop") %>%
  filter(total_revenue >= quantile(total_revenue, 0.8)) %>%
  pull(product_id)

top_product_details <- product_analysis %>%
  filter(product_id %in% top_products) %>%
  arrange(desc(revenue))

# Products that perform well in specific segments
enterprise_favorites <- product_analysis %>%
  filter(customer_type == "Enterprise") %>%
  group_by(product_id) %>%
  summarise(enterprise_revenue = sum(revenue), .groups = "drop") %>%
  filter(enterprise_revenue >= quantile(enterprise_revenue, 0.9))

cat("Product analysis results:\n")
Product analysis results:
cat("Top products (top 20%):", length(top_products), "products\n")
Top products (top 20%): 10 products
cat("Enterprise favorites (top 10%):", nrow(enterprise_favorites), "products\n")
Enterprise favorites (top 10%): 5 products
print(head(top_product_details))
# A tibble: 6 × 9
  product_id region customer_type  units_sold revenue orders avg_price
  <chr>      <chr>  <chr>               <dbl>   <dbl>  <int>     <dbl>
1 PROD-46    West   Individual             24   3610.      4      158.
2 PROD-34    East   Individual             19   2563.      2      137.
3 PROD-10    East   Small Business         14   2515.      3      182.
4 PROD-1     South  Individual             16   2465.      2      154.
5 PROD-34    South  Individual             17   2121.      2      133.
6 PROD-46    South  Enterprise             18   2025.      2      127.
# ℹ 2 more variables: product_total_revenue <dbl>, regional_share <dbl>

Common Pitfalls and Solutions

Missing Data Handling

# Pitfall: Forgetting about NA values in conditions
# This might not work as expected
# problematic_filter <- sales_raw %>%
#   filter(discount_pct > 0.1)  # Excludes NA values!

# Better: Explicitly handle NA values
explicit_na_handling <- sales_raw %>%
  filter(!is.na(discount_pct) & discount_pct > 0.1)

# Or include NA values if appropriate
include_na <- sales_raw %>%
  filter(is.na(discount_pct) | discount_pct > 0.1)

cat("Explicit NA handling:", nrow(explicit_na_handling), "rows\n")
Explicit NA handling: 91 rows
cat("Including NA values:", nrow(include_na), "rows\n")
Including NA values: 121 rows

Logical Operator Precedence

# Pitfall: Operator precedence can cause unexpected results
# This might not work as intended:
# ambiguous <- sales_raw %>%
#   filter(customer_type == "Enterprise" | customer_type == "Small Business" & net_amount > 500)

# Better: Use parentheses for clarity
clear_logic <- sales_raw %>%
  filter(
    (customer_type == "Enterprise" | customer_type == "Small Business") &
    net_amount > 500
  )

# Or use %in% for multiple values
cleaner_multiple <- sales_raw %>%
  filter(
    customer_type %in% c("Enterprise", "Small Business"),
    net_amount > 500
  )

cat("Clear logic with parentheses:", nrow(clear_logic), "rows\n")
Clear logic with parentheses: 195 rows
cat("Cleaner multiple value filter:", nrow(cleaner_multiple), "rows\n")
Cleaner multiple value filter: 195 rows

Exercises

Exercise 1: Data Quality Assessment

Using the sales dataset: 1. Filter for orders with potential data quality issues 2. Identify orders with suspicious pricing (very high or very low) 3. Find orders with missing critical information 4. Create a “clean” dataset by removing problematic records

Exercise 2: Time-Based Analysis

Create filters for: 1. Orders from the last quarter 2. Weekend vs weekday patterns 3. Month-end effects (last 3 days of each month) 4. Seasonal patterns (summer vs winter orders)

Exercise 3: Customer Analysis

Filter and analyze: 1. First-time customers vs repeat customers 2. High-value customers (top 10% by total spending) 3. Customers with recent activity 4. Customers who haven’t ordered in 90+ days

Exercise 4: Advanced Selection

Practice column selection: 1. Create different views of the data for different stakeholders 2. Select columns based on data types and patterns 3. Rename columns for better readability 4. Create summary datasets with only key metrics

Summary

Filtering and selecting are the foundation of data manipulation in dplyr:

Key Functions:

  • filter(): Choose rows based on conditions
  • select(): Choose columns by name, pattern, or type
  • Helper functions: starts_with(), ends_with(), contains(), where()

Best Practices:

  • Filter early in your pipeline to improve performance
  • Handle missing values explicitly in filter conditions
  • Use meaningful variable names for complex conditions
  • Combine multiple simple filters rather than one complex filter
  • Test filters incrementally to ensure they work as expected

Common Patterns:

  • Filter then select (most common workflow)
  • Use %in% for multiple value matching
  • Combine with date functions for time-based analysis
  • Use window functions for relative filtering

Remember:

  • Logical operators: & (AND), | (OR), ! (NOT)
  • Missing values require special handling
  • Use parentheses to clarify complex logical conditions
  • Performance matters: filter early and select what you need

Mastering filtering and selecting sets you up for success with all other dplyr operations. These skills form the foundation for every data analysis pipeline!

Next: Mutating and Arranging Data