Factors are R’s way of representing categorical data - data that can only take on a limited set of values. Examples include survey responses (Poor/Fair/Good/Excellent), education levels (High School/Bachelor’s/Master’s/PhD), or product categories. The forcats package makes working with factors intuitive and powerful.
library(tidyverse)library(forcats)# All forcats functions start with fct_ for easy discoverycat("Key forcats functions we'll explore:\n")
Key forcats functions we'll explore:
cat("- fct_count(): count factor levels\n")
- fct_count(): count factor levels
cat("- fct_reorder(): reorder by another variable\n")
- fct_reorder(): reorder by another variable
cat("- fct_relevel(): manually reorder levels\n")
- fct_relevel(): manually reorder levels
cat("- fct_recode(): change level names\n")
- fct_recode(): change level names
cat("- fct_collapse(): combine levels\n")
- fct_collapse(): combine levels
cat("- fct_lump(): combine rare levels\n")
- fct_lump(): combine rare levels
cat("- fct_drop(): remove unused levels\n")
- fct_drop(): remove unused levels
Understanding Factors vs Characters
Creating and Examining Factors
# Character vector vs factorsatisfaction_char <-c("Good", "Excellent", "Poor", "Good", "Fair", "Excellent")satisfaction_factor <-factor(satisfaction_char,levels =c("Poor", "Fair", "Good", "Excellent"))# Compare the twocat("Character vector:\n")
satisfaction_char
Excellent Fair Good Poor
2 1 2 1
cat("\nFactor with ordered levels:\n")
Factor with ordered levels:
print(satisfaction_factor)
[1] Good Excellent Poor Good Fair Excellent
Levels: Poor Fair Good Excellent
print(table(satisfaction_factor))
satisfaction_factor
Poor Fair Good Excellent
1 1 2 2
# Notice how factors maintain level order even for missing categoriessatisfaction_subset <- satisfaction_factor[1:3] # Only has "Good", "Excellent", "Poor"print(table(satisfaction_subset)) # Still shows all 4 levels
satisfaction_subset
Poor Fair Good Excellent
1 0 1 1
# A tibble: 5 × 2
f n
<ord> <int>
1 Very Dissatisfied 3
2 Dissatisfied 5
3 Neutral 4
4 Satisfied 3
5 Very Satisfied 5
fct_count(survey_data$department)
# A tibble: 5 × 2
f n
<fct> <int>
1 Finance 2
2 HR 5
3 IT 6
4 Marketing 2
5 Sales 5
fct_count(survey_data$experience)
# A tibble: 4 × 2
f n
<ord> <int>
1 0-2 years 5
2 3-5 years 7
3 6-10 years 3
4 10+ years 5
# Count with sortingfct_count(survey_data$department, sort =TRUE)
# A tibble: 5 × 2
f n
<fct> <int>
1 IT 6
2 HR 5
3 Sales 5
4 Finance 2
5 Marketing 2
Reordering Factor Levels
# Create sample sales data by regionsales_data <-tibble(region =c("North", "South", "East", "West", "North", "South", "East", "West"),quarter =rep(c("Q1", "Q2"), 4),sales =c(120, 95, 110, 88, 135, 102, 118, 92))# Convert region to factorsales_data$region <-factor(sales_data$region)# Default factor order (alphabetical)levels(sales_data$region)
[1] "East" "North" "South" "West"
# Reorder by total sales (descending)sales_by_region <- sales_data %>%group_by(region) %>%summarise(total_sales =sum(sales), .groups ="drop") %>%mutate(region =fct_reorder(region, total_sales, .desc =TRUE))print(sales_by_region)
# A tibble: 4 × 2
region total_sales
<fct> <dbl>
1 East 228
2 North 255
3 South 197
4 West 180
levels(sales_by_region$region)
[1] "North" "East" "South" "West"
# Reorder by another variable within groupssales_data_ordered <- sales_data %>%mutate(region =fct_reorder(region, sales, .fun = mean, .desc =TRUE))print(sales_data_ordered)
# A tibble: 8 × 3
region quarter sales
<fct> <chr> <dbl>
1 North Q1 120
2 South Q2 95
3 East Q1 110
4 West Q2 88
5 North Q1 135
6 South Q2 102
7 East Q1 118
8 West Q2 92
# Sample product categoriesproducts <-tibble(item =1:15,category =sample(c("Electronics", "Clothing", "Home", "Sports", "Books"), 15, replace =TRUE),subcategory =sample(c("Phone", "Laptop", "Shirt", "Pants", "Furniture","Basketball", "Soccer", "Fiction", "Non-fiction"), 15, replace =TRUE))# Convert to factorproducts$category <-factor(products$category)# Recode factor levelsproducts$category_recoded <-fct_recode(products$category,"Tech"="Electronics","Apparel"="Clothing","Home & Garden"="Home","Sporting Goods"="Sports"# "Books" stays the same (not mentioned))# Compare original and recodedcomparison <- products %>%count(category, category_recoded)print(comparison)
# A tibble: 5 × 3
category category_recoded n
<fct> <fct> <int>
1 Books Books 5
2 Clothing Apparel 3
3 Electronics Tech 2
4 Home Home & Garden 2
5 Sports Sporting Goods 3
# Recode with error checking (forcats will warn about typos)# This would give a warning:# products$category_bad <- fct_recode(products$category, "Tech" = "Electronic") # Note the typo
# A tibble: 25 × 2
f n
<fct> <int>
1 CA 10
2 FL 10
3 IL 8
4 NY 8
5 NC 7
6 TX 6
7 GA 5
8 AL 4
9 MA 4
10 OH 4
# ℹ 15 more rows
# Lump least frequent states into "Other"customer_data$state_lumped <-fct_lump_n(customer_data$state, n =10) # Keep top 10cat("\nAfter lumping (keep top 10):\n")
After lumping (keep top 10):
fct_count(customer_data$state_lumped, sort =TRUE)
# A tibble: 13 × 2
f n
<fct> <int>
1 Other 26
2 CA 10
3 FL 10
4 IL 8
5 NY 8
6 NC 7
7 TX 6
8 GA 5
9 AL 4
10 MA 4
11 OH 4
12 PA 4
13 VA 4
# Lump states with fewer than 3 customerscustomer_data$state_min <-fct_lump_min(customer_data$state, min =3)cat("\nAfter lumping (minimum 3 customers):\n")
After lumping (minimum 3 customers):
fct_count(customer_data$state_min, sort =TRUE)
# A tibble: 17 × 2
f n
<fct> <int>
1 Other 14
2 CA 10
3 FL 10
4 IL 8
5 NY 8
6 NC 7
7 TX 6
8 GA 5
9 AL 4
10 MA 4
11 OH 4
12 PA 4
13 VA 4
14 AZ 3
15 CT 3
16 MI 3
17 OK 3
# Lump by proportion (keep states with at least 5% of customers)customer_data$state_prop <-fct_lump_prop(customer_data$state, prop =0.05)cat("\nAfter lumping (minimum 5% proportion):\n")
After lumping (minimum 5% proportion):
fct_count(customer_data$state_prop, sort =TRUE)
# A tibble: 7 × 2
f n
<fct> <int>
1 Other 51
2 CA 10
3 FL 10
4 IL 8
5 NY 8
6 NC 7
7 TX 6
Handling Missing Levels and NA Values
# Create data with unused levels and NA valuesrating_data <-tibble(product_id =1:20,rating =sample(c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars", NA),20, replace =TRUE))# Create factor with extra levelsrating_data$rating <-factor(rating_data$rating,levels =c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars", "6 Stars"))cat("Original factor with unused level:\n")
# Handle NA values explicitlyrating_data$rating_with_na <-fct_explicit_na(rating_data$rating_clean, na_level ="No Rating")cat("\nAfter making NA explicit:\n")
After making NA explicit:
print(summary(rating_data$rating_with_na))
1 Star 2 Stars 3 Stars 4 Stars 5 Stars No Rating
3 2 5 5 4 1
# Demonstrate NA handling workflow# First, convert NA values to explicit levelrating_data$rating_with_unknown <-fct_na_value_to_level(rating_data$rating_clean, level ="Unknown")cat("\nAfter converting NA to 'Unknown' level:\n")
# Convert explicit NA level back to actual NA values# Note: fct_na_level_to_value() converts the default "(Missing)" level back to NArating_data$rating_with_missing <-fct_explicit_na(rating_data$rating_clean)rating_data$rating_back_to_na <-fct_na_level_to_value(rating_data$rating_with_missing)cat("\nAfter converting '(Missing)' level back to NA:\n")
# Collapse education levels for analysissurvey_responses$education_simple <-fct_collapse(survey_responses$education,"High School or Less"=c("High School"),"Some College"=c("Some College"),"College Graduate"=c("Bachelor's"),"Advanced Degree"=c("Master's", "PhD"))cat("\nSatisfaction by Education Level:\n")
# A tibble: 20 × 4
# Groups: education_simple [4]
education_simple satisfaction n prop
<ord> <ord> <int> <dbl>
1 High School or Less Very Dissatisfied 3 0.045
2 High School or Less Dissatisfied 6 0.09
3 High School or Less Neutral 17 0.254
4 High School or Less Satisfied 23 0.343
5 High School or Less Very Satisfied 18 0.269
6 Some College Very Dissatisfied 1 0.026
7 Some College Dissatisfied 6 0.154
8 Some College Neutral 7 0.179
9 Some College Satisfied 19 0.487
10 Some College Very Satisfied 6 0.154
11 College Graduate Very Dissatisfied 3 0.053
12 College Graduate Dissatisfied 10 0.175
13 College Graduate Neutral 16 0.281
14 College Graduate Satisfied 20 0.351
15 College Graduate Very Satisfied 8 0.14
16 Advanced Degree Very Dissatisfied 1 0.027
17 Advanced Degree Dissatisfied 5 0.135
18 Advanced Degree Neutral 7 0.189
19 Advanced Degree Satisfied 13 0.351
20 Advanced Degree Very Satisfied 11 0.297
# A tibble: 5 × 2
f n
<fct> <int>
1 Electronics 45
2 Home 38
3 Clothing 37
4 Books 20
5 Sports 10
# Reorder categories by average priceproducts_detailed$main_category_by_price <-fct_reorder(products_detailed$main_category, products_detailed$price,.fun = mean, .desc =TRUE)cat("\nCategories ordered by average price:\n")
# A tibble: 3 × 2
f n
<ord> <int>
1 Low Performer 40
2 Medium Performer 72
3 High Performer 38
Example 3: Geographic Data Processing
# Create customer location datacustomer_locations <-tibble(customer_id =1:300,country =sample(c("United States", "Canada", "United Kingdom", "Germany", "France","Australia", "Japan", "Brazil", "Mexico", "India", "China", "Other"),300, replace =TRUE,prob =c(0.4, 0.15, 0.1, 0.08, 0.06, 0.05, 0.04, 0.03, 0.03, 0.02, 0.02, 0.02)),region =case_when( country %in%c("United States", "Canada", "Mexico") ~"North America", country %in%c("United Kingdom", "Germany", "France") ~"Europe", country %in%c("Australia", "Japan") ~"Asia-Pacific", country %in%c("Brazil") ~"South America", country %in%c("India", "China") ~"Asia",TRUE~"Other" ),order_value =round(runif(300, 20, 500), 2))# Convert to factors with logical orderingcustomer_locations$country <-factor(customer_locations$country)customer_locations$region <-factor(customer_locations$region)# Reorder countries by total order valuecountry_summary <- customer_locations %>%group_by(country) %>%summarise(total_orders =n(),total_value =sum(order_value),avg_value =mean(order_value),.groups ="drop" ) %>%mutate(country =fct_reorder(country, total_value, .desc =TRUE))cat("Countries by total order value:\n")
Countries by total order value:
print(country_summary)
# A tibble: 12 × 4
country total_orders total_value avg_value
<fct> <int> <dbl> <dbl>
1 Australia 15 4144. 276.
2 Brazil 8 1935. 242.
3 Canada 45 10704. 238.
4 China 6 1697. 283.
5 France 17 3581. 211.
6 Germany 20 6013. 301.
7 India 8 1968. 246.
8 Japan 14 3874. 277.
9 Mexico 10 3400. 340.
10 Other 7 1722. 246.
11 United Kingdom 35 9156. 262.
12 United States 115 29209. 254.
# Lump smaller countries by order volumecustomer_locations$country_lumped <-fct_lump_n(customer_locations$country, n =6,w = customer_locations$order_value)cat("\nAfter lumping smaller countries:\n")
# Best Practice 1: Always specify levels explicitly for ordered data# Goodsatisfaction_good <-factor(c("Poor", "Good", "Excellent"),levels =c("Poor", "Fair", "Good", "Excellent"))# Avoid: Letting R determine order alphabeticallysatisfaction_bad <-factor(c("Poor", "Good", "Excellent"))cat("Good approach - explicit levels:\n")
Good approach - explicit levels:
print(levels(satisfaction_good))
[1] "Poor" "Fair" "Good" "Excellent"
cat("\nBad approach - alphabetical:\n")
Bad approach - alphabetical:
print(levels(satisfaction_bad))
[1] "Excellent" "Good" "Poor"
# Best Practice 2: Use ordered factors for ordinal dataeducation_ordered <-factor(c("High School", "Bachelor's", "Master's"),levels =c("High School", "Bachelor's", "Master's", "PhD"),ordered =TRUE)# Best Practice 3: Handle missing data explicitlysurvey_with_na <-c("Yes", "No", "Yes", NA, "No")survey_factor <-factor(survey_with_na)survey_explicit <-fct_explicit_na(survey_factor, "No Response")cat("\nHandling missing data:\n")
Handling missing data:
print(summary(survey_factor))
No Yes NA's
2 2 1
print(summary(survey_explicit))
No Yes No Response
2 2 1
Common Mistakes
# Mistake 1: Converting factors to numeric incorrectlyrating_factor <-factor(c("1", "2", "3", "4", "5"))# Wrong way (gets the underlying codes, not the values)wrong_numeric <-as.numeric(rating_factor)cat("Wrong conversion to numeric:\n")
Wrong conversion to numeric:
print(wrong_numeric)
[1] 1 2 3 4 5
# Right wayright_numeric <-as.numeric(as.character(rating_factor))cat("Correct conversion to numeric:\n")
Correct conversion to numeric:
print(right_numeric)
[1] 1 2 3 4 5
# Mistake 2: Not dropping unused levelsoriginal_data <-factor(c("A", "B", "C", "A", "B"))subset_data <- original_data[1:2] # Only A and Bcat("\nSubset still has unused level C:\n")
Subset still has unused level C:
print(levels(subset_data))
[1] "A" "B" "C"
print(table(subset_data))
subset_data
A B C
1 1 0
# Fix by dropping unused levelssubset_clean <-fct_drop(subset_data)cat("\nAfter dropping unused levels:\n")
[1] Category A category_a CATEGORY A Category A
Levels: Category A CATEGORY A category_a
# Better: Clean first, then convert to factorclean_categories <-str_to_title(str_replace_all(messy_categories, "[_\\s]+", " "))factor(clean_categories)
[1] Category A Category A Category A Category A
Levels: Category A
Exercises
Exercise 1: Survey Data Processing
Given survey responses with inconsistent category names: 1. Clean and standardize the response categories 2. Create proper ordered factors for Likert scales 3. Collapse detailed categories into broader groups 4. Handle missing responses appropriately
Exercise 2: Sales Data Analysis
You have sales data with product categories: 1. Reorder categories by sales performance 2. Lump low-performing categories into “Other” 3. Create high/medium/low performance tiers 4. Prepare the data for visualization
Exercise 3: Geographic Analysis
Working with customer location data: 1. Standardize country and region names 2. Group countries into major markets 3. Order regions by customer value 4. Create market size categories
Exercise 4: Factor Validation Pipeline
Create a data validation system for categorical data: 1. Detect and fix inconsistent category names 2. Identify and handle unexpected categories 3. Ensure proper factor ordering 4. Generate validation reports
Summary
The forcats package makes factor manipulation intuitive and powerful:
Survey analysis: Proper ordering of response scales
Data visualization: Reordering for better plots
Reporting: Grouping categories for summaries
Machine learning: Preparing categorical variables
Remember:
Factors preserve level order even when subsetting
Proper factor handling improves visualization and analysis
forcats integrates seamlessly with dplyr and ggplot2
Always validate factor levels in real-world data
Factors are essential for working with categorical data effectively. With forcats, you can handle even complex categorical data scenarios with confidence!