Filtering and selecting are fundamental operations in data analysis. The filter() function chooses which rows to keep based on conditions, while select() chooses which columns to keep. These are often the first steps in any data analysis pipeline.
# Multiple conditions with AND (comma-separated or &)premium_orders <- sales_raw %>%filter( net_amount >500, customer_type =="Enterprise", quantity >=5 )cat("Premium enterprise orders:", nrow(premium_orders), "transactions\n")
Premium enterprise orders: 66 transactions
# OR conditionsdiscount_or_large <- sales_raw %>%filter(discount_pct >0.1| net_amount >800)cat("High discount or large orders:", nrow(discount_or_large), "transactions\n")
High discount or large orders: 296 transactions
# NOT conditionsno_discount <- sales_raw %>%filter(discount_pct ==0|is.na(discount_pct))cat("Orders without discount:", nrow(no_discount), "transactions\n")
# Remove rows with missing sales repcomplete_sales_rep <- sales_raw %>%filter(!is.na(sales_rep))# Only rows with missing valuesmissing_sales_rep <- sales_raw %>%filter(is.na(sales_rep))# Complete cases only (no missing values in any column)complete_cases <- sales_raw %>%filter(complete.cases(.))# Rows with any missing valuesany_missing <- sales_raw %>%filter(!complete.cases(.))cat("Orders with sales rep assigned:", nrow(complete_sales_rep), "\n")
# A tibble: 6 × 5
region sales_rep total_sales total_orders avg_order_value
<chr> <chr> <dbl> <int> <dbl>
1 East Alice Johnson 24219. 47 515.
2 East Bob Smith 28238. 49 576.
3 East Charlie Brown 21353. 44 485.
4 East Diana Prince 27990. 52 538.
5 East Eve Wilson 25531. 45 567.
6 North Alice Johnson 29621. 51 581.
Advanced Filtering Techniques
Using filter() with Window Functions
# Top N orders per customertop_orders_per_customer <- sales_raw %>%group_by(customer_id) %>%filter(rank(desc(net_amount)) <=3) %>%arrange(customer_id, desc(net_amount)) %>%ungroup()# Orders above regional averageabove_regional_avg <- sales_raw %>%group_by(region) %>%filter(net_amount >mean(net_amount, na.rm =TRUE)) %>%ungroup()# Recent orders within each customerrecent_per_customer <- sales_raw %>%group_by(customer_id) %>%filter(order_date >=max(order_date) -days(30)) %>%ungroup()cat("Top 3 orders per customer:", nrow(top_orders_per_customer), "orders\n")
# A tibble: 20 × 8
sales_rep region total_sales order_count avg_order_value unique_customers
<chr> <chr> <dbl> <int> <dbl> <int>
1 Bob Smith West 15656. 29 540. 26
2 Eve Wilson West 14708. 26 566. 25
3 Diana Prince South 13045. 28 466. 27
4 Diana Prince East 14435. 26 555. 23
5 Alice Johnson West 14320. 25 573. 23
6 Diana Prince North 14280. 28 510. 23
7 Eve Wilson East 13719. 24 572. 23
8 Charlie Brown North 11918. 26 458. 24
9 Charlie Brown West 12222. 24 509. 23
10 Alice Johnson North 13003. 23 565. 21
11 Eve Wilson North 11593. 23 504. 22
12 Charlie Brown East 10386. 23 452. 21
13 Bob Smith South 11820. 19 622. 18
14 Diana Prince West 9331. 21 444. 20
15 Charlie Brown South 10822. 17 637. 17
16 Bob Smith East 11734. 16 733. 15
17 Eve Wilson South 9807. 20 490. 18
18 Alice Johnson South 7869. 21 375. 20
19 Alice Johnson East 9434. 18 524. 17
20 Bob Smith North 6793. 11 618. 11
# ℹ 2 more variables: performance_score <dbl>, performance_tier <chr>
# A tibble: 6 × 9
product_id region customer_type units_sold revenue orders avg_price
<chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
1 PROD-46 West Individual 24 3610. 4 158.
2 PROD-34 East Individual 19 2563. 2 137.
3 PROD-10 East Small Business 14 2515. 3 182.
4 PROD-1 South Individual 16 2465. 2 154.
5 PROD-34 South Individual 17 2121. 2 133.
6 PROD-46 South Enterprise 18 2025. 2 127.
# ℹ 2 more variables: product_total_revenue <dbl>, regional_share <dbl>
Common Pitfalls and Solutions
Missing Data Handling
# Pitfall: Forgetting about NA values in conditions# This might not work as expected# problematic_filter <- sales_raw %>%# filter(discount_pct > 0.1) # Excludes NA values!# Better: Explicitly handle NA valuesexplicit_na_handling <- sales_raw %>%filter(!is.na(discount_pct) & discount_pct >0.1)# Or include NA values if appropriateinclude_na <- sales_raw %>%filter(is.na(discount_pct) | discount_pct >0.1)cat("Explicit NA handling:", nrow(explicit_na_handling), "rows\n")
Explicit NA handling: 91 rows
cat("Including NA values:", nrow(include_na), "rows\n")
Including NA values: 121 rows
Logical Operator Precedence
# Pitfall: Operator precedence can cause unexpected results# This might not work as intended:# ambiguous <- sales_raw %>%# filter(customer_type == "Enterprise" | customer_type == "Small Business" & net_amount > 500)# Better: Use parentheses for clarityclear_logic <- sales_raw %>%filter( (customer_type =="Enterprise"| customer_type =="Small Business") & net_amount >500 )# Or use %in% for multiple valuescleaner_multiple <- sales_raw %>%filter( customer_type %in%c("Enterprise", "Small Business"), net_amount >500 )cat("Clear logic with parentheses:", nrow(clear_logic), "rows\n")
Clear logic with parentheses: 195 rows
cat("Cleaner multiple value filter:", nrow(cleaner_multiple), "rows\n")
Cleaner multiple value filter: 195 rows
Exercises
Exercise 1: Data Quality Assessment
Using the sales dataset: 1. Filter for orders with potential data quality issues 2. Identify orders with suspicious pricing (very high or very low) 3. Find orders with missing critical information 4. Create a “clean” dataset by removing problematic records
Exercise 2: Time-Based Analysis
Create filters for: 1. Orders from the last quarter 2. Weekend vs weekday patterns 3. Month-end effects (last 3 days of each month) 4. Seasonal patterns (summer vs winter orders)
Exercise 3: Customer Analysis
Filter and analyze: 1. First-time customers vs repeat customers 2. High-value customers (top 10% by total spending) 3. Customers with recent activity 4. Customers who haven’t ordered in 90+ days
Exercise 4: Advanced Selection
Practice column selection: 1. Create different views of the data for different stakeholders 2. Select columns based on data types and patterns 3. Rename columns for better readability 4. Create summary datasets with only key metrics
Summary
Filtering and selecting are the foundation of data manipulation in dplyr:
Key Functions:
filter(): Choose rows based on conditions
select(): Choose columns by name, pattern, or type
Filter early in your pipeline to improve performance
Handle missing values explicitly in filter conditions
Use meaningful variable names for complex conditions
Combine multiple simple filters rather than one complex filter
Test filters incrementally to ensure they work as expected
Common Patterns:
Filter then select (most common workflow)
Use %in% for multiple value matching
Combine with date functions for time-based analysis
Use window functions for relative filtering
Remember:
Logical operators: & (AND), | (OR), ! (NOT)
Missing values require special handling
Use parentheses to clarify complex logical conditions
Performance matters: filter early and select what you need
Mastering filtering and selecting sets you up for success with all other dplyr operations. These skills form the foundation for every data analysis pipeline!