Managing Factors with forcats

Author

IND215

Published

September 22, 2025

Introduction to Factors

Factors are R’s way of representing categorical data - data that can only take on a limited set of values. Examples include survey responses (Poor/Fair/Good/Excellent), education levels (High School/Bachelor’s/Master’s/PhD), or product categories. The forcats package makes working with factors intuitive and powerful.

library(tidyverse)
library(forcats)

# All forcats functions start with fct_ for easy discovery
cat("Key forcats functions we'll explore:\n")

Key forcats functions we'll explore:

cat("- fct_count(): count factor levels\n")

- fct_count(): count factor levels

cat("- fct_reorder(): reorder by another variable\n")

- fct_reorder(): reorder by another variable

cat("- fct_relevel(): manually reorder levels\n")

- fct_relevel(): manually reorder levels

cat("- fct_recode(): change level names\n")

- fct_recode(): change level names

cat("- fct_collapse(): combine levels\n")

- fct_collapse(): combine levels

cat("- fct_lump(): combine rare levels\n")

- fct_lump(): combine rare levels

cat("- fct_drop(): remove unused levels\n")

- fct_drop(): remove unused levels

Understanding Factors vs Characters

Creating and Examining Factors

# Character vector vs factor
satisfaction_char <- c("Good", "Excellent", "Poor", "Good", "Fair", "Excellent")
satisfaction_factor <- factor(satisfaction_char,
                             levels = c("Poor", "Fair", "Good", "Excellent"))

# Compare the two
cat("Character vector:\n")

Character vector:

print(satisfaction_char)

[1] "Good"      "Excellent" "Poor"      "Good"      "Fair"      "Excellent"

print(table(satisfaction_char))

satisfaction_char
Excellent      Fair      Good      Poor 
        2         1         2         1

cat("\nFactor with ordered levels:\n")


Factor with ordered levels:

print(satisfaction_factor)

[1] Good      Excellent Poor      Good      Fair      Excellent
Levels: Poor Fair Good Excellent

print(table(satisfaction_factor))

satisfaction_factor
     Poor      Fair      Good Excellent 
        1         1         2         2

# Notice how factors maintain level order even for missing categories
satisfaction_subset <- satisfaction_factor[1:3]  # Only has "Good", "Excellent", "Poor"
print(table(satisfaction_subset))  # Still shows all 4 levels

satisfaction_subset
     Poor      Fair      Good Excellent 
        1         0         1         1

Factor Properties

# Examine factor structure
education <- factor(c("High School", "Bachelor's", "Master's", "Bachelor's", "PhD"),
                   levels = c("High School", "Bachelor's", "Master's", "PhD"))

# Key factor properties
levels(education)        # The possible values

[1] "High School" "Bachelor's"  "Master's"    "PhD"

nlevels(education)       # Number of levels

[1] 4

is.factor(education)     # Check if it's a factor

[1] TRUE

as.character(education)  # Convert back to character

[1] "High School" "Bachelor's"  "Master's"    "Bachelor's"  "PhD"

as.numeric(education)    # See underlying numeric codes

[1] 1 2 3 2 4

# Factor summary
summary(education)

High School  Bachelor's    Master's         PhD 
          1           2           1           1

fct_count(education)     # forcats way to count

# A tibble: 4 × 2
  f               n
  <fct>       <int>
1 High School     1
2 Bachelor's      2
3 Master's        1
4 PhD             1

Basic Factor Operations with forcats

Counting and Inspecting Factors

# Sample survey data
survey_data <- tibble(
  id = 1:20,
  satisfaction = sample(c("Very Dissatisfied", "Dissatisfied", "Neutral",
                         "Satisfied", "Very Satisfied"), 20, replace = TRUE),
  department = sample(c("Sales", "Marketing", "IT", "HR", "Finance"), 20, replace = TRUE),
  experience = sample(c("0-2 years", "3-5 years", "6-10 years", "10+ years"), 20, replace = TRUE)
)

# Convert to factors with proper ordering
survey_data <- survey_data %>%
  mutate(
    satisfaction = factor(satisfaction,
                         levels = c("Very Dissatisfied", "Dissatisfied", "Neutral",
                                   "Satisfied", "Very Satisfied"),
                         ordered = TRUE),
    department = factor(department),
    experience = factor(experience,
                       levels = c("0-2 years", "3-5 years", "6-10 years", "10+ years"),
                       ordered = TRUE)
  )

# Count factor levels
fct_count(survey_data$satisfaction)

# A tibble: 5 × 2
  f                     n
  <ord>             <int>
1 Very Dissatisfied     3
2 Dissatisfied          5
3 Neutral               4
4 Satisfied             3
5 Very Satisfied        5

fct_count(survey_data$department)

# A tibble: 5 × 2
  f             n
  <fct>     <int>
1 Finance       2
2 HR            5
3 IT            6
4 Marketing     2
5 Sales         5

fct_count(survey_data$experience)

# A tibble: 4 × 2
  f              n
  <ord>      <int>
1 0-2 years      5
2 3-5 years      7
3 6-10 years     3
4 10+ years      5

# Count with sorting
fct_count(survey_data$department, sort = TRUE)

# A tibble: 5 × 2
  f             n
  <fct>     <int>
1 IT            6
2 HR            5
3 Sales         5
4 Finance       2
5 Marketing     2

Reordering Factor Levels

# Create sample sales data by region
sales_data <- tibble(
  region = c("North", "South", "East", "West", "North", "South", "East", "West"),
  quarter = rep(c("Q1", "Q2"), 4),
  sales = c(120, 95, 110, 88, 135, 102, 118, 92)
)

# Convert region to factor
sales_data$region <- factor(sales_data$region)

# Default factor order (alphabetical)
levels(sales_data$region)

[1] "East"  "North" "South" "West"

# Reorder by total sales (descending)
sales_by_region <- sales_data %>%
  group_by(region) %>%
  summarise(total_sales = sum(sales), .groups = "drop") %>%
  mutate(region = fct_reorder(region, total_sales, .desc = TRUE))

print(sales_by_region)

# A tibble: 4 × 2
  region total_sales
  <fct>        <dbl>
1 East           228
2 North          255
3 South          197
4 West           180

levels(sales_by_region$region)

[1] "North" "East"  "South" "West"

# Reorder by another variable within groups
sales_data_ordered <- sales_data %>%
  mutate(region = fct_reorder(region, sales, .fun = mean, .desc = TRUE))

print(sales_data_ordered)

# A tibble: 8 × 3
  region quarter sales
  <fct>  <chr>   <dbl>
1 North  Q1        120
2 South  Q2         95
3 East   Q1        110
4 West   Q2         88
5 North  Q1        135
6 South  Q2        102
7 East   Q1        118
8 West   Q2         92

# Manual reordering
sales_data$region <- fct_relevel(sales_data$region, "West", "North", "East", "South")
levels(sales_data$region)

[1] "West"  "North" "East"  "South"

Recoding Factor Levels

# Sample product categories
products <- tibble(
  item = 1:15,
  category = sample(c("Electronics", "Clothing", "Home", "Sports", "Books"), 15, replace = TRUE),
  subcategory = sample(c("Phone", "Laptop", "Shirt", "Pants", "Furniture",
                        "Basketball", "Soccer", "Fiction", "Non-fiction"), 15, replace = TRUE)
)

# Convert to factor
products$category <- factor(products$category)

# Recode factor levels
products$category_recoded <- fct_recode(products$category,
  "Tech" = "Electronics",
  "Apparel" = "Clothing",
  "Home & Garden" = "Home",
  "Sporting Goods" = "Sports"
  # "Books" stays the same (not mentioned)
)

# Compare original and recoded
comparison <- products %>%
  count(category, category_recoded)
print(comparison)

# A tibble: 5 × 3
  category    category_recoded     n
  <fct>       <fct>            <int>
1 Books       Books                5
2 Clothing    Apparel              3
3 Electronics Tech                 2
4 Home        Home & Garden        2
5 Sports      Sporting Goods       3

# Recode with error checking (forcats will warn about typos)
# This would give a warning:
# products$category_bad <- fct_recode(products$category, "Tech" = "Electronic")  # Note the typo

Advanced Factor Manipulation

Collapsing Factor Levels

# Create detailed job title data
job_data <- tibble(
  employee_id = 1:25,
  job_title = sample(c("Software Engineer I", "Software Engineer II", "Senior Software Engineer",
                      "Marketing Specialist", "Marketing Manager", "Senior Marketing Manager",
                      "Sales Rep", "Senior Sales Rep", "Sales Manager",
                      "HR Specialist", "HR Manager", "Data Analyst", "Senior Data Analyst"),
                    25, replace = TRUE),
  salary = runif(25, 50000, 120000)
)

job_data$job_title <- factor(job_data$job_title)

# Original levels
cat("Original job titles:\n")

Original job titles:

fct_count(job_data$job_title, sort = TRUE)

# A tibble: 10 × 2
   f                            n
   <fct>                    <int>
 1 Senior Software Engineer     5
 2 Data Analyst                 3
 3 HR Specialist                3
 4 Sales Manager                3
 5 Software Engineer II         3
 6 HR Manager                   2
 7 Senior Marketing Manager     2
 8 Software Engineer I          2
 9 Marketing Specialist         1
10 Senior Sales Rep             1

# Collapse into broader categories
job_data$department <- fct_collapse(job_data$job_title,
  "Engineering" = c("Software Engineer I", "Software Engineer II", "Senior Software Engineer"),
  "Marketing" = c("Marketing Specialist", "Marketing Manager", "Senior Marketing Manager"),
  "Sales" = c("Sales Rep", "Senior Sales Rep", "Sales Manager"),
  "HR" = c("HR Specialist", "HR Manager"),
  "Analytics" = c("Data Analyst", "Senior Data Analyst")
)

cat("\nCollapsed into departments:\n")


Collapsed into departments:

fct_count(job_data$department, sort = TRUE)

# A tibble: 5 × 2
  f               n
  <fct>       <int>
1 Engineering    10
2 HR              5
3 Sales           4
4 Analytics       3
5 Marketing       3

# More complex collapsing with grouping
job_data$seniority <- fct_collapse(job_data$job_title,
  "Senior" = c("Senior Software Engineer", "Senior Marketing Manager",
               "Senior Sales Rep", "Senior Data Analyst"),
  "Manager" = c("Marketing Manager", "Sales Manager", "HR Manager"),
  "Individual Contributor" = c("Software Engineer I", "Software Engineer II",
                              "Marketing Specialist", "Sales Rep", "HR Specialist", "Data Analyst")
)

cat("\nCollapsed by seniority:\n")


Collapsed by seniority:

fct_count(job_data$seniority, sort = TRUE)

# A tibble: 3 × 2
  f                          n
  <fct>                  <int>
1 Individual Contributor    12
2 Senior                     8
3 Manager                    5

Lumping Rare Categories

# Create data with many categories, some rare
customer_data <- tibble(
  customer_id = 1:100,
  state = sample(c("CA", "TX", "NY", "FL", "IL", "PA", "OH", "MI", "GA", "NC",
                  "NJ", "VA", "WA", "AZ", "MA", "TN", "IN", "MO", "MD", "WI",
                  "MN", "CO", "AL", "SC", "LA", "OR", "OK", "CT", "AR", "MS"),
                100, replace = TRUE,
                prob = c(rep(0.08, 5), rep(0.04, 5), rep(0.02, 10), rep(0.01, 10))),
  purchase_amount = runif(100, 10, 500)
)

customer_data$state <- factor(customer_data$state)

# Original state distribution
cat("Original state distribution:\n")

Original state distribution:

state_counts <- fct_count(customer_data$state, sort = TRUE)
print(state_counts)

# A tibble: 25 × 2
   f         n
   <fct> <int>
 1 CA       10
 2 FL       10
 3 IL        8
 4 NY        8
 5 NC        7
 6 TX        6
 7 GA        5
 8 AL        4
 9 MA        4
10 OH        4
# ℹ 15 more rows

# Lump least frequent states into "Other"
customer_data$state_lumped <- fct_lump_n(customer_data$state, n = 10)  # Keep top 10

cat("\nAfter lumping (keep top 10):\n")


After lumping (keep top 10):

fct_count(customer_data$state_lumped, sort = TRUE)

# A tibble: 13 × 2
   f         n
   <fct> <int>
 1 Other    26
 2 CA       10
 3 FL       10
 4 IL        8
 5 NY        8
 6 NC        7
 7 TX        6
 8 GA        5
 9 AL        4
10 MA        4
11 OH        4
12 PA        4
13 VA        4

# Lump states with fewer than 3 customers
customer_data$state_min <- fct_lump_min(customer_data$state, min = 3)

cat("\nAfter lumping (minimum 3 customers):\n")


After lumping (minimum 3 customers):

fct_count(customer_data$state_min, sort = TRUE)

# A tibble: 17 × 2
   f         n
   <fct> <int>
 1 Other    14
 2 CA       10
 3 FL       10
 4 IL        8
 5 NY        8
 6 NC        7
 7 TX        6
 8 GA        5
 9 AL        4
10 MA        4
11 OH        4
12 PA        4
13 VA        4
14 AZ        3
15 CT        3
16 MI        3
17 OK        3

# Lump by proportion (keep states with at least 5% of customers)
customer_data$state_prop <- fct_lump_prop(customer_data$state, prop = 0.05)

cat("\nAfter lumping (minimum 5% proportion):\n")


After lumping (minimum 5% proportion):

fct_count(customer_data$state_prop, sort = TRUE)

# A tibble: 7 × 2
  f         n
  <fct> <int>
1 Other    51
2 CA       10
3 FL       10
4 IL        8
5 NY        8
6 NC        7
7 TX        6

Handling Missing Levels and NA Values

# Create data with unused levels and NA values
rating_data <- tibble(
  product_id = 1:20,
  rating = sample(c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars", NA),
                 20, replace = TRUE)
)

# Create factor with extra levels
rating_data$rating <- factor(rating_data$rating,
                           levels = c("1 Star", "2 Stars", "3 Stars", "4 Stars", "5 Stars", "6 Stars"))

cat("Original factor with unused level:\n")

Original factor with unused level:

print(levels(rating_data$rating))

[1] "1 Star"  "2 Stars" "3 Stars" "4 Stars" "5 Stars" "6 Stars"

print(summary(rating_data$rating))

 1 Star 2 Stars 3 Stars 4 Stars 5 Stars 6 Stars    NA's 
      3       2       5       5       4       0       1

# Drop unused levels
rating_data$rating_clean <- fct_drop(rating_data$rating)

cat("\nAfter dropping unused levels:\n")


After dropping unused levels:

print(levels(rating_data$rating_clean))

[1] "1 Star"  "2 Stars" "3 Stars" "4 Stars" "5 Stars"

print(summary(rating_data$rating_clean))

 1 Star 2 Stars 3 Stars 4 Stars 5 Stars    NA's 
      3       2       5       5       4       1

# Handle NA values explicitly
rating_data$rating_with_na <- fct_explicit_na(rating_data$rating_clean, na_level = "No Rating")

cat("\nAfter making NA explicit:\n")


After making NA explicit:

print(summary(rating_data$rating_with_na))

   1 Star   2 Stars   3 Stars   4 Stars   5 Stars No Rating 
        3         2         5         5         4         1

# Demonstrate NA handling workflow
# First, convert NA values to explicit level
rating_data$rating_with_unknown <- fct_na_value_to_level(rating_data$rating_clean, level = "Unknown")

cat("\nAfter converting NA to 'Unknown' level:\n")


After converting NA to 'Unknown' level:

print(summary(rating_data$rating_with_unknown))

 1 Star 2 Stars 3 Stars 4 Stars 5 Stars Unknown 
      3       2       5       5       4       1

# Convert explicit NA level back to actual NA values
# Note: fct_na_level_to_value() converts the default "(Missing)" level back to NA
rating_data$rating_with_missing <- fct_explicit_na(rating_data$rating_clean)
rating_data$rating_back_to_na <- fct_na_level_to_value(rating_data$rating_with_missing)

cat("\nAfter converting '(Missing)' level back to NA:\n")


After converting '(Missing)' level back to NA:

print(summary(rating_data$rating_back_to_na))

   1 Star   2 Stars   3 Stars   4 Stars   5 Stars (Missing) 
        3         2         5         5         4         1

Real-World Factor Applications

Example 1: Survey Data Analysis

# Create realistic survey data
set.seed(123)
survey_responses <- tibble(
  respondent_id = 1:200,
  age_group = sample(c("18-25", "26-35", "36-45", "46-55", "56-65", "65+"), 200, replace = TRUE),
  education = sample(c("High School", "Some College", "Bachelor's", "Master's", "PhD"),
                    200, replace = TRUE, prob = c(0.3, 0.2, 0.3, 0.15, 0.05)),
  income = sample(c("Under $30k", "$30k-$50k", "$50k-$75k", "$75k-$100k", "Over $100k"),
                 200, replace = TRUE),
  satisfaction = sample(c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"),
                       200, replace = TRUE, prob = c(0.05, 0.15, 0.2, 0.4, 0.2))
)

# Convert to properly ordered factors
survey_responses <- survey_responses %>%
  mutate(
    age_group = factor(age_group, levels = c("18-25", "26-35", "36-45", "46-55", "56-65", "65+")),
    education = factor(education, levels = c("High School", "Some College", "Bachelor's", "Master's", "PhD"),
                      ordered = TRUE),
    income = factor(income, levels = c("Under $30k", "$30k-$50k", "$50k-$75k", "$75k-$100k", "Over $100k"),
                   ordered = TRUE),
    satisfaction = factor(satisfaction, levels = c("Very Dissatisfied", "Dissatisfied", "Neutral",
                                                  "Satisfied", "Very Satisfied"), ordered = TRUE)
  )

# Analyze satisfaction by demographics
cat("Satisfaction by Age Group:\n")

Satisfaction by Age Group:

satisfaction_by_age <- survey_responses %>%
  count(age_group, satisfaction) %>%
  group_by(age_group) %>%
  mutate(prop = n / sum(n)) %>%
  filter(satisfaction %in% c("Satisfied", "Very Satisfied")) %>%
  summarise(satisfaction_rate = sum(prop), .groups = "drop") %>%
  mutate(age_group = fct_reorder(age_group, satisfaction_rate))

print(satisfaction_by_age)

# A tibble: 6 × 2
  age_group satisfaction_rate
  <fct>                 <dbl>
1 18-25                 0.697
2 26-35                 0.645
3 36-45                 0.719
4 46-55                 0.464
5 56-65                 0.441
6 65+                   0.571

# Collapse education levels for analysis
survey_responses$education_simple <- fct_collapse(survey_responses$education,
  "High School or Less" = c("High School"),
  "Some College" = c("Some College"),
  "College Graduate" = c("Bachelor's"),
  "Advanced Degree" = c("Master's", "PhD")
)

cat("\nSatisfaction by Education Level:\n")


Satisfaction by Education Level:

satisfaction_by_education <- survey_responses %>%
  count(education_simple, satisfaction) %>%
  group_by(education_simple) %>%
  mutate(prop = round(n / sum(n), 3)) %>%
  print()

# A tibble: 20 × 4
# Groups:   education_simple [4]
   education_simple    satisfaction          n  prop
   <ord>               <ord>             <int> <dbl>
 1 High School or Less Very Dissatisfied     3 0.045
 2 High School or Less Dissatisfied          6 0.09 
 3 High School or Less Neutral              17 0.254
 4 High School or Less Satisfied            23 0.343
 5 High School or Less Very Satisfied       18 0.269
 6 Some College        Very Dissatisfied     1 0.026
 7 Some College        Dissatisfied          6 0.154
 8 Some College        Neutral               7 0.179
 9 Some College        Satisfied            19 0.487
10 Some College        Very Satisfied        6 0.154
11 College Graduate    Very Dissatisfied     3 0.053
12 College Graduate    Dissatisfied         10 0.175
13 College Graduate    Neutral              16 0.281
14 College Graduate    Satisfied            20 0.351
15 College Graduate    Very Satisfied        8 0.14 
16 Advanced Degree     Very Dissatisfied     1 0.027
17 Advanced Degree     Dissatisfied          5 0.135
18 Advanced Degree     Neutral               7 0.189
19 Advanced Degree     Satisfied            13 0.351
20 Advanced Degree     Very Satisfied       11 0.297

Example 2: Product Category Management

# Create e-commerce product data
products_detailed <- tibble(
  product_id = 1:150,
  category = sample(c("Electronics > Smartphones", "Electronics > Laptops", "Electronics > Tablets",
                     "Clothing > Men's Shirts", "Clothing > Women's Dresses", "Clothing > Children's",
                     "Home > Kitchen", "Home > Bedroom", "Home > Living Room",
                     "Books > Fiction", "Books > Non-Fiction", "Books > Textbooks",
                     "Sports > Basketball", "Sports > Soccer", "Sports > Tennis"),
                   150, replace = TRUE, prob = c(rep(0.1, 3), rep(0.08, 3), rep(0.07, 3),
                                                rep(0.05, 3), rep(0.03, 3))),
  price = round(runif(150, 10, 1000), 2),
  sales_volume = sample(1:100, 150, replace = TRUE)
)

products_detailed$category <- factor(products_detailed$category)

cat("Original detailed categories:\n")

Original detailed categories:

fct_count(products_detailed$category, sort = TRUE)

# A tibble: 15 × 2
   f                              n
   <fct>                      <int>
 1 Electronics > Tablets         20
 2 Clothing > Women's Dresses    15
 3 Home > Kitchen                15
 4 Electronics > Laptops         14
 5 Home > Bedroom                14
 6 Clothing > Men's Shirts       12
 7 Electronics > Smartphones     11
 8 Clothing > Children's         10
 9 Home > Living Room             9
10 Books > Non-Fiction            8
11 Books > Textbooks              8
12 Sports > Tennis                5
13 Books > Fiction                4
14 Sports > Basketball            3
15 Sports > Soccer                2

# Extract main categories
products_detailed$main_category <- products_detailed$category %>%
  str_extract("^[^>]+") %>%
  str_trim() %>%
  factor()

cat("\nMain categories:\n")


Main categories:

fct_count(products_detailed$main_category, sort = TRUE)

# A tibble: 5 × 2
  f               n
  <fct>       <int>
1 Electronics    45
2 Home           38
3 Clothing       37
4 Books          20
5 Sports         10

# Reorder categories by average price
products_detailed$main_category_by_price <- fct_reorder(products_detailed$main_category,
                                                       products_detailed$price,
                                                       .fun = mean, .desc = TRUE)

cat("\nCategories ordered by average price:\n")


Categories ordered by average price:

products_detailed %>%
  group_by(main_category_by_price) %>%
  summarise(avg_price = round(mean(price), 2), .groups = "drop") %>%
  print()

# A tibble: 5 × 2
  main_category_by_price avg_price
  <fct>                      <dbl>
1 Sports                      581.
2 Home                        537.
3 Clothing                    502.
4 Books                       489.
5 Electronics                 408.

# Lump low-volume categories
products_detailed$category_lumped <- fct_lump_n(products_detailed$category, n = 8,
                                               w = products_detailed$sales_volume)

cat("\nAfter lumping low-volume categories:\n")


After lumping low-volume categories:

fct_count(products_detailed$category_lumped, sort = TRUE)

# A tibble: 9 × 2
  f                              n
  <fct>                      <int>
1 Other                         39
2 Electronics > Tablets         20
3 Clothing > Women's Dresses    15
4 Home > Kitchen                15
5 Electronics > Laptops         14
6 Home > Bedroom                14
7 Clothing > Men's Shirts       12
8 Electronics > Smartphones     11
9 Clothing > Children's         10

# Create performance categories
products_detailed <- products_detailed %>%
  group_by(main_category) %>%
  mutate(
    revenue = price * sales_volume,
    category_performance = case_when(
      revenue > quantile(revenue, 0.75) ~ "High Performer",
      revenue > quantile(revenue, 0.25) ~ "Medium Performer",
      TRUE ~ "Low Performer"
    )
  ) %>%
  ungroup()

products_detailed$category_performance <- factor(products_detailed$category_performance,
                                                levels = c("Low Performer", "Medium Performer", "High Performer"),
                                                ordered = TRUE)

cat("\nPerformance distribution:\n")


Performance distribution:

fct_count(products_detailed$category_performance)

# A tibble: 3 × 2
  f                    n
  <ord>            <int>
1 Low Performer       40
2 Medium Performer    72
3 High Performer      38

Example 3: Geographic Data Processing

# Create customer location data
customer_locations <- tibble(
  customer_id = 1:300,
  country = sample(c("United States", "Canada", "United Kingdom", "Germany", "France",
                    "Australia", "Japan", "Brazil", "Mexico", "India", "China", "Other"),
                  300, replace = TRUE,
                  prob = c(0.4, 0.15, 0.1, 0.08, 0.06, 0.05, 0.04, 0.03, 0.03, 0.02, 0.02, 0.02)),
  region = case_when(
    country %in% c("United States", "Canada", "Mexico") ~ "North America",
    country %in% c("United Kingdom", "Germany", "France") ~ "Europe",
    country %in% c("Australia", "Japan") ~ "Asia-Pacific",
    country %in% c("Brazil") ~ "South America",
    country %in% c("India", "China") ~ "Asia",
    TRUE ~ "Other"
  ),
  order_value = round(runif(300, 20, 500), 2)
)

# Convert to factors with logical ordering
customer_locations$country <- factor(customer_locations$country)
customer_locations$region <- factor(customer_locations$region)

# Reorder countries by total order value
country_summary <- customer_locations %>%
  group_by(country) %>%
  summarise(
    total_orders = n(),
    total_value = sum(order_value),
    avg_value = mean(order_value),
    .groups = "drop"
  ) %>%
  mutate(country = fct_reorder(country, total_value, .desc = TRUE))

cat("Countries by total order value:\n")

Countries by total order value:

print(country_summary)

# A tibble: 12 × 4
   country        total_orders total_value avg_value
   <fct>                 <int>       <dbl>     <dbl>
 1 Australia                15       4144.      276.
 2 Brazil                    8       1935.      242.
 3 Canada                   45      10704.      238.
 4 China                     6       1697.      283.
 5 France                   17       3581.      211.
 6 Germany                  20       6013.      301.
 7 India                     8       1968.      246.
 8 Japan                    14       3874.      277.
 9 Mexico                   10       3400.      340.
10 Other                     7       1722.      246.
11 United Kingdom           35       9156.      262.
12 United States           115      29209.      254.

# Lump smaller countries by order volume
customer_locations$country_lumped <- fct_lump_n(customer_locations$country, n = 6,
                                               w = customer_locations$order_value)

cat("\nAfter lumping smaller countries:\n")


After lumping smaller countries:

fct_count(customer_locations$country_lumped, sort = TRUE)

# A tibble: 7 × 2
  f                  n
  <fct>          <int>
1 United States    115
2 Other             56
3 Canada            45
4 United Kingdom    35
5 Germany           20
6 Australia         15
7 Japan             14

# Create market size categories
market_sizes <- customer_locations %>%
  group_by(country) %>%
  summarise(market_size = sum(order_value), .groups = "drop") %>%
  mutate(
    size_category = case_when(
      market_size > 5000 ~ "Large Market",
      market_size > 2000 ~ "Medium Market",
      TRUE ~ "Small Market"
    )
  )

customer_locations <- customer_locations %>%
  left_join(market_sizes %>% select(country, size_category), by = "country")

customer_locations$size_category <- factor(customer_locations$size_category,
                                          levels = c("Small Market", "Medium Market", "Large Market"),
                                          ordered = TRUE)

cat("\nMarket size distribution:\n")


Market size distribution:

fct_count(customer_locations$size_category)

# A tibble: 3 × 2
  f                 n
  <ord>         <int>
1 Small Market     29
2 Medium Market    56
3 Large Market    215

Factor Visualization and Analysis

Preparing Factors for Visualization

# Create sample data for visualization
employee_data <- tibble(
  department = sample(c("Engineering", "Sales", "Marketing", "HR", "Finance", "Operations"),
                     100, replace = TRUE),
  performance_rating = sample(c("Needs Improvement", "Meets Expectations", "Exceeds Expectations", "Outstanding"),
                             100, replace = TRUE, prob = c(0.1, 0.4, 0.4, 0.1)),
  salary = round(rnorm(100, 75000, 15000)),
  years_experience = sample(1:20, 100, replace = TRUE)
)

# Convert to factors with proper ordering
employee_data <- employee_data %>%
  mutate(
    department = factor(department),
    performance_rating = factor(performance_rating,
                               levels = c("Needs Improvement", "Meets Expectations",
                                        "Exceeds Expectations", "Outstanding"),
                               ordered = TRUE)
  )

# Reorder departments by median salary for better visualization
employee_data$department_ordered <- fct_reorder(employee_data$department,
                                               employee_data$salary,
                                               .fun = median)

# Create summary for visualization
dept_summary <- employee_data %>%
  group_by(department_ordered, performance_rating) %>%
  summarise(
    count = n(),
    avg_salary = round(mean(salary)),
    .groups = "drop"
  )

cat("Department performance summary (ordered by median salary):\n")

Department performance summary (ordered by median salary):

print(dept_summary)

# A tibble: 21 × 4
   department_ordered performance_rating   count avg_salary
   <fct>              <ord>                <int>      <dbl>
 1 Engineering        Needs Improvement        3      64980
 2 Engineering        Meets Expectations       3      53476
 3 Engineering        Exceeds Expectations     5      87745
 4 Marketing          Needs Improvement        4      65940
 5 Marketing          Meets Expectations       9      71310
 6 Marketing          Exceeds Expectations     9      68022
 7 Marketing          Outstanding              2      82324
 8 Sales              Needs Improvement        3      83663
 9 Sales              Meets Expectations       8      74992
10 Sales              Exceeds Expectations     2      72360
# ℹ 11 more rows

# Show factor levels in order
cat("\nDepartments ordered by median salary:\n")


Departments ordered by median salary:

print(levels(employee_data$department_ordered))

[1] "Engineering" "Marketing"   "Sales"       "Operations"  "HR"         
[6] "Finance"

Best Practices and Common Pitfalls

Factor Best Practices

# Best Practice 1: Always specify levels explicitly for ordered data
# Good
satisfaction_good <- factor(c("Poor", "Good", "Excellent"),
                           levels = c("Poor", "Fair", "Good", "Excellent"))

# Avoid: Letting R determine order alphabetically
satisfaction_bad <- factor(c("Poor", "Good", "Excellent"))

cat("Good approach - explicit levels:\n")

Good approach - explicit levels:

print(levels(satisfaction_good))

[1] "Poor"      "Fair"      "Good"      "Excellent"

cat("\nBad approach - alphabetical:\n")


Bad approach - alphabetical:

print(levels(satisfaction_bad))

[1] "Excellent" "Good"      "Poor"

# Best Practice 2: Use ordered factors for ordinal data
education_ordered <- factor(c("High School", "Bachelor's", "Master's"),
                           levels = c("High School", "Bachelor's", "Master's", "PhD"),
                           ordered = TRUE)

# Best Practice 3: Handle missing data explicitly
survey_with_na <- c("Yes", "No", "Yes", NA, "No")
survey_factor <- factor(survey_with_na)
survey_explicit <- fct_explicit_na(survey_factor, "No Response")

cat("\nHandling missing data:\n")


Handling missing data:

print(summary(survey_factor))

  No  Yes NA's 
   2    2    1

print(summary(survey_explicit))

         No         Yes No Response 
          2           2           1

Common Mistakes

# Mistake 1: Converting factors to numeric incorrectly
rating_factor <- factor(c("1", "2", "3", "4", "5"))

# Wrong way (gets the underlying codes, not the values)
wrong_numeric <- as.numeric(rating_factor)
cat("Wrong conversion to numeric:\n")

Wrong conversion to numeric:

print(wrong_numeric)

[1] 1 2 3 4 5

# Right way
right_numeric <- as.numeric(as.character(rating_factor))
cat("Correct conversion to numeric:\n")

Correct conversion to numeric:

print(right_numeric)

[1] 1 2 3 4 5

# Mistake 2: Not dropping unused levels
original_data <- factor(c("A", "B", "C", "A", "B"))
subset_data <- original_data[1:2]  # Only A and B

cat("\nSubset still has unused level C:\n")


Subset still has unused level C:

print(levels(subset_data))

[1] "A" "B" "C"

print(table(subset_data))

subset_data
A B C 
1 1 0

# Fix by dropping unused levels
subset_clean <- fct_drop(subset_data)
cat("\nAfter dropping unused levels:\n")


After dropping unused levels:

print(levels(subset_clean))

[1] "A" "B"

print(table(subset_clean))

subset_clean
A B 
1 1

# Mistake 3: Inconsistent level names
messy_categories <- c("Category A", "category_a", "CATEGORY A", "Category A")
factor(messy_categories)  # Creates 3 different levels!

[1] Category A category_a CATEGORY A Category A
Levels: Category A CATEGORY A category_a

# Better: Clean first, then convert to factor
clean_categories <- str_to_title(str_replace_all(messy_categories, "[_\\s]+", " "))
factor(clean_categories)

[1] Category A Category A Category A Category A
Levels: Category A

Exercises

Exercise 1: Survey Data Processing

Given survey responses with inconsistent category names: 1. Clean and standardize the response categories 2. Create proper ordered factors for Likert scales 3. Collapse detailed categories into broader groups 4. Handle missing responses appropriately

Exercise 2: Sales Data Analysis

You have sales data with product categories: 1. Reorder categories by sales performance 2. Lump low-performing categories into “Other” 3. Create high/medium/low performance tiers 4. Prepare the data for visualization

Exercise 3: Geographic Analysis

Working with customer location data: 1. Standardize country and region names 2. Group countries into major markets 3. Order regions by customer value 4. Create market size categories

Exercise 4: Factor Validation Pipeline

Create a data validation system for categorical data: 1. Detect and fix inconsistent category names 2. Identify and handle unexpected categories 3. Ensure proper factor ordering 4. Generate validation reports

Summary

The forcats package makes factor manipulation intuitive and powerful:

Key Functions:

Inspection: fct_count(), levels(), nlevels()
Reordering: fct_reorder(), fct_relevel(), fct_rev()
Changing levels: fct_recode(), fct_collapse(), fct_lump()
Missing data: fct_explicit_na(), fct_drop()

Best Practices:

Specify levels explicitly for ordered data
Use ordered factors for ordinal variables
Handle missing data consciously
Drop unused levels after subsetting
Clean data before converting to factors

Common Applications:

Survey analysis: Proper ordering of response scales
Data visualization: Reordering for better plots
Reporting: Grouping categories for summaries
Machine learning: Preparing categorical variables

Remember:

Factors preserve level order even when subsetting
Proper factor handling improves visualization and analysis
forcats integrates seamlessly with dplyr and ggplot2
Always validate factor levels in real-world data

Factors are essential for working with categorical data effectively. With forcats, you can handle even complex categorical data scenarios with confidence!

Next: Working with Dates and Times using lubridate