Working with Strings using stringr

Author

IND215

Published

September 22, 2025

Introduction to String Manipulation

Text data is everywhere in real-world datasets - names, addresses, comments, categories, and more. The stringr package provides a consistent, intuitive set of functions for working with strings that integrates seamlessly with the tidyverse.

library(tidyverse)
library(stringr)

# All stringr functions start with str_ for easy discovery
cat("Key stringr functions we'll explore:\n")
Key stringr functions we'll explore:
cat("- str_length(): get string length\n")
- str_length(): get string length
cat("- str_sub(): extract substrings\n")
- str_sub(): extract substrings
cat("- str_detect(): find patterns\n")
- str_detect(): find patterns
cat("- str_replace(): replace patterns\n")
- str_replace(): replace patterns
cat("- str_split(): split strings\n")
- str_split(): split strings
cat("- str_trim(): remove whitespace\n")
- str_trim(): remove whitespace
cat("- str_to_*(): change case\n")
- str_to_*(): change case

Basic String Operations

String Length and Subsetting

# Sample text data
customer_names <- c("John Smith", "Mary Johnson", "Bob O'Connor", "李小明", "José García")
product_codes <- c("ABC-123", "XYZ-456", "DEF-789", "GHI-012")

# Get string length
str_length(customer_names)
[1] 10 12 12  3 11
str_length(product_codes)
[1] 7 7 7 7
# Extract substrings
str_sub(customer_names, 1, 4)          # First 4 characters
[1] "John"   "Mary"   "Bob "   "李小明" "José"  
str_sub(customer_names, -5, -1)        # Last 5 characters
[1] "Smith"  "hnson"  "onnor"  "李小明" "arcía" 
str_sub(product_codes, 1, 3)           # Extract prefix
[1] "ABC" "XYZ" "DEF" "GHI"
str_sub(product_codes, 5, 7)           # Extract suffix
[1] "123" "456" "789" "012"
# Modify substrings
modified_codes <- product_codes
str_sub(modified_codes, 1, 3) <- "NEW"  # Replace first 3 characters
print(modified_codes)
[1] "NEW-123" "NEW-456" "NEW-789" "NEW-012"

Case Conversion

messy_names <- c("JOHN SMITH", "mary johnson", "Bob O'connor", "MARÍA garcía")

# Case conversion functions
str_to_lower(messy_names)      # All lowercase
[1] "john smith"   "mary johnson" "bob o'connor" "maría garcía"
str_to_upper(messy_names)      # All uppercase
[1] "JOHN SMITH"   "MARY JOHNSON" "BOB O'CONNOR" "MARÍA GARCÍA"
str_to_title(messy_names)      # Title Case
[1] "John Smith"   "Mary Johnson" "Bob O'connor" "María García"
str_to_sentence(messy_names)   # Sentence case
[1] "John smith"   "Mary johnson" "Bob o'connor" "María garcía"
# Locale-aware conversion (important for international names)
str_to_title(messy_names, locale = "en")
[1] "John Smith"   "Mary Johnson" "Bob O'connor" "María García"
str_to_title("maría garcía", locale = "es")
[1] "María García"

Whitespace and Padding

messy_text <- c("  John Smith  ", "\tMary Johnson\n", "Bob    Wilson", "   ")

# Remove whitespace
str_trim(messy_text)                    # Remove leading/trailing
[1] "John Smith"    "Mary Johnson"  "Bob    Wilson" ""             
str_trim(messy_text, side = "left")     # Remove only leading
[1] "John Smith  "   "Mary Johnson\n" "Bob    Wilson"  ""              
str_trim(messy_text, side = "right")    # Remove only trailing
[1] "  John Smith"   "\tMary Johnson" "Bob    Wilson"  ""              
str_squish(messy_text)                  # Remove all extra whitespace
[1] "John Smith"   "Mary Johnson" "Bob Wilson"   ""            
# Add padding
names <- c("John", "Mary", "Bob")
str_pad(names, width = 10, side = "left", pad = " ")
[1] "      John" "      Mary" "       Bob"
str_pad(names, width = 10, side = "both", pad = "-")
[1] "---John---" "---Mary---" "---Bob----"
str_pad(names, width = 8, side = "right", pad = ".")
[1] "John...." "Mary...." "Bob....."

Pattern Detection and Matching

Basic Pattern Detection

email_list <- c("john@email.com", "mary.wilson@company.org", "invalid-email",
                "bob@test.co.uk", "sarah@domain", "alice@example.com")

# Detect patterns
str_detect(email_list, "@")                    # Contains @
[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
str_detect(email_list, "\\.com")               # Ends with .com
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE
str_detect(email_list, "^[a-z]+@")             # Starts with lowercase letters before @
[1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE
# Count matches
str_count(email_list, "\\.")                   # Count dots
[1] 1 2 0 2 0 1
str_count(email_list, "[aeiou]")               # Count vowels
[1] 5 6 6 4 5 7
# Find pattern locations
str_locate(email_list, "@")                    # First @ position
     start end
[1,]     5   5
[2,]    12  12
[3,]    NA  NA
[4,]     4   4
[5,]     6   6
[6,]     6   6
str_locate_all(email_list, "[aeiou]")          # All vowel positions
[[1]]
     start end
[1,]     2   2
[2,]     6   6
[3,]     8   8
[4,]     9   9
[5,]    13  13

[[2]]
     start end
[1,]     2   2
[2,]     7   7
[3,]    10  10
[4,]    14  14
[5,]    17  17
[6,]    21  21

[[3]]
     start end
[1,]     1   1
[2,]     4   4
[3,]     6   6
[4,]     9   9
[5,]    11  11
[6,]    12  12

[[4]]
     start end
[1,]     2   2
[2,]     6   6
[3,]    11  11
[4,]    13  13

[[5]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     8   8
[4,]    10  10
[5,]    11  11

[[6]]
     start end
[1,]     1   1
[2,]     3   3
[3,]     5   5
[4,]     7   7
[5,]     9   9
[6,]    13  13
[7,]    16  16

Working with Regular Expressions

# Sample data for regex practice
phone_numbers <- c("123-456-7890", "(555) 123-4567", "555.123.4567",
                   "1234567890", "+1-555-123-4567", "invalid")

# Basic regex patterns
str_detect(phone_numbers, "\\d{3}-\\d{3}-\\d{4}")     # xxx-xxx-xxxx format
[1]  TRUE FALSE FALSE FALSE  TRUE FALSE
str_detect(phone_numbers, "\\(\\d{3}\\)")             # Area code in parentheses
[1] FALSE  TRUE FALSE FALSE FALSE FALSE
str_detect(phone_numbers, "^\\+1")                    # Starts with +1
[1] FALSE FALSE FALSE FALSE  TRUE FALSE
# Extract parts using regex groups
phone_pattern <- "(\\d{3})[.-](\\d{3})[.-](\\d{4})"
str_extract(phone_numbers, phone_pattern)
[1] "123-456-7890" NA             "555.123.4567" NA             "555-123-4567"
[6] NA            
# More complex patterns
email_pattern <- "([a-zA-Z0-9._%-]+)@([a-zA-Z0-9.-]+\\.[a-zA-Z]{2,})"
sample_emails <- c("user@domain.com", "test.email@company.co.uk", "invalid@", "@invalid.com")

str_detect(sample_emails, email_pattern)
[1]  TRUE  TRUE FALSE FALSE
str_extract(sample_emails, email_pattern)
[1] "user@domain.com"          "test.email@company.co.uk"
[3] NA                         NA                        

Common Regex Patterns

# Useful regex patterns for data cleaning
sample_text <- c("Call 555-123-4567 today!", "Email: user@test.com",
                 "Price: $29.99", "Date: 2024-01-15", "ID: ABC123XYZ")

# Phone numbers
phone_regex <- "\\b\\d{3}[.-]?\\d{3}[.-]?\\d{4}\\b"
str_extract(sample_text, phone_regex)
[1] "555-123-4567" NA             NA             NA             NA            
# Email addresses
email_regex <- "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b"
str_extract(sample_text, email_regex)
[1] NA              "user@test.com" NA              NA             
[5] NA             
# Currency amounts
currency_regex <- "\\$\\d+\\.\\d{2}"
str_extract(sample_text, currency_regex)
[1] NA       NA       "$29.99" NA       NA      
# Dates (YYYY-MM-DD format)
date_regex <- "\\b\\d{4}-\\d{2}-\\d{2}\\b"
str_extract(sample_text, date_regex)
[1] NA           NA           NA           "2024-01-15" NA          
# Alphanumeric IDs
id_regex <- "\\b[A-Z]{3}\\d{3}[A-Z]{3}\\b"
str_extract(sample_text, id_regex)
[1] NA          NA          NA          NA          "ABC123XYZ"

String Replacement and Transformation

Basic Replacement

messy_addresses <- c("123 Main St.", "456 Oak Ave", "789 Pine Street", "101 Elm St")

# Simple replacement
str_replace(messy_addresses, "St\\.", "Street")        # Replace first match
[1] "123 Main Street" "456 Oak Ave"     "789 Pine Street" "101 Elm St"     
str_replace_all(messy_addresses, "\\.", "")            # Remove all periods
[1] "123 Main St"     "456 Oak Ave"     "789 Pine Street" "101 Elm St"     
str_replace_all(messy_addresses, "St$", "Street")      # Replace St at end
[1] "123 Main St."    "456 Oak Ave"     "789 Pine Street" "101 Elm Street" 
# Multiple replacements
cleanup_patterns <- c("St\\." = "Street", "Ave" = "Avenue", "\\b(\\d+)\\b" = "Number \\1")
str_replace_all(messy_addresses, cleanup_patterns)
[1] "Number 123 Main Street" "Number 456 Oak Avenue"  "Number 789 Pine Street"
[4] "Number 101 Elm St"     
# Case-insensitive replacement
str_replace_all(messy_addresses, regex("ST", ignore_case = TRUE), "Street")
[1] "123 Main Street."    "456 Oak Ave"         "789 Pine Streetreet"
[4] "101 Elm Street"     

Advanced Replacement with Functions

# Sample product descriptions
products <- c("LAPTOP-15INCH-8GB", "PHONE-ANDROID-64GB", "TABLET-IPAD-128GB")

# Replace with custom function
str_replace_all(products, "\\b(\\d+)GB\\b", function(x) {
  gb <- as.numeric(str_extract(x, "\\d+"))
  if (gb >= 128) "High Storage"
  else if (gb >= 64) "Medium Storage"
  else "Low Storage"
})
[1] "LAPTOP-15INCH-Low Storage"    "PHONE-ANDROID-Medium Storage"
[3] "TABLET-IPAD-High Storage"    
# Clean and format product names
clean_products <- products %>%
  str_replace_all("-", " ") %>%
  str_to_title() %>%
  str_replace_all("(\\d+)(Gb|GB)", "\\1 GB")

print(clean_products)
[1] "Laptop 15inch 8gb"  "Phone Android 64gb" "Tablet Ipad 128gb" 

String Splitting and Joining

Splitting Strings

# Sample data with delimiters
csv_data <- c("John,25,Engineer", "Mary,30,Manager", "Bob,28,Analyst")
pipe_data <- c("Apple|Red|Sweet", "Banana|Yellow|Sweet", "Lemon|Yellow|Sour")

# Split by delimiter
str_split(csv_data, ",")                    # Returns list
[[1]]
[1] "John"     "25"       "Engineer"

[[2]]
[1] "Mary"    "30"      "Manager"

[[3]]
[1] "Bob"     "28"      "Analyst"
str_split(csv_data, ",", simplify = TRUE)   # Returns matrix
     [,1]   [,2] [,3]      
[1,] "John" "25" "Engineer"
[2,] "Mary" "30" "Manager" 
[3,] "Bob"  "28" "Analyst" 
str_split(pipe_data, "\\|", n = 2)          # Limit to 2 pieces
[[1]]
[1] "Apple"     "Red|Sweet"

[[2]]
[1] "Banana"       "Yellow|Sweet"

[[3]]
[1] "Lemon"       "Yellow|Sour"
# Split into tibble (very useful!)
str_split(csv_data, ",", simplify = TRUE) %>%
  as_tibble(.name_repair = "minimal") %>%
  set_names(c("name", "age", "job"))
# A tibble: 3 × 3
  name  age   job     
  <chr> <chr> <chr>   
1 John  25    Engineer
2 Mary  30    Manager 
3 Bob   28    Analyst 
# More complex splitting
full_names <- c("John A. Smith", "Mary Elizabeth Johnson", "Bob Wilson")
str_split(full_names, "\\s+")               # Split on any whitespace
[[1]]
[1] "John"  "A."    "Smith"

[[2]]
[1] "Mary"      "Elizabeth" "Johnson"  

[[3]]
[1] "Bob"    "Wilson"
str_split(full_names, "\\s+", n = 2)        # Split into first and last parts
[[1]]
[1] "John"     "A. Smith"

[[2]]
[1] "Mary"              "Elizabeth Johnson"

[[3]]
[1] "Bob"    "Wilson"

Joining Strings

# Sample data to join
first_names <- c("John", "Mary", "Bob")
last_names <- c("Smith", "Johnson", "Wilson")
titles <- c("Dr.", "Ms.", "Mr.")

# Basic joining
str_c(first_names, last_names, sep = " ")           # Simple concatenation
[1] "John Smith"   "Mary Johnson" "Bob Wilson"  
str_c(titles, first_names, last_names, sep = " ")   # Multiple parts
[1] "Dr. John Smith"   "Ms. Mary Johnson" "Mr. Bob Wilson"  
# Join with collapse (combine all into one string)
str_c(first_names, collapse = ", ")                 # "John, Mary, Bob"
[1] "John, Mary, Bob"
str_c(first_names, collapse = " and ")              # "John and Mary and Bob"
[1] "John and Mary and Bob"
# Conditional joining (skip missing values)
mixed_data <- c("John", NA, "Mary", "", "Bob")
str_c("Name: ", mixed_data)                         # Creates NA
[1] "Name: John" NA           "Name: Mary" "Name: "     "Name: Bob" 
str_c("Name: ", mixed_data, sep = "")               # Still creates NA
[1] "Name: John" NA           "Name: Mary" "Name: "     "Name: Bob" 
# Better approach with coalesce
str_c("Name: ", coalesce(mixed_data, "Unknown"))
[1] "Name: John"    "Name: Unknown" "Name: Mary"    "Name: "       
[5] "Name: Bob"    

Real-World String Cleaning Examples

Example 1: Cleaning Customer Data

# Messy customer data
raw_customers <- tibble(
  id = 1:6,
  name = c("  JOHN SMITH  ", "mary-johnson", "Bob  O'Connor",
           "SARAH    WILSON", "mike.davis", "ANNA-MARIA GARCIA"),
  email = c("JOHN@EMAIL.COM", "mary@company.org", "bob@COMPANY.com",
            "sarah.wilson@test.CO.UK", "mike.davis@DOMAIN.com", "anna@example.COM"),
  phone = c("(555) 123-4567", "555.123.4567", "5551234567",
            "555-123-4567", "(555)123-4567", "555 123 4567"),
  address = c("123 main st.", "456 OAK AVE", "789 pine street apt 2",
              "101 ELM ST UNIT B", "202 maple ave.", "303 BIRCH ST")
)

print("Raw customer data:")
[1] "Raw customer data:"
print(raw_customers)
# A tibble: 6 × 5
     id name                email                   phone          address      
  <int> <chr>               <chr>                   <chr>          <chr>        
1     1 "  JOHN SMITH  "    JOHN@EMAIL.COM          (555) 123-4567 123 main st. 
2     2 "mary-johnson"      mary@company.org        555.123.4567   456 OAK AVE  
3     3 "Bob  O'Connor"     bob@COMPANY.com         5551234567     789 pine str…
4     4 "SARAH    WILSON"   sarah.wilson@test.CO.UK 555-123-4567   101 ELM ST U…
5     5 "mike.davis"        mike.davis@DOMAIN.com   (555)123-4567  202 maple av…
6     6 "ANNA-MARIA GARCIA" anna@example.COM        555 123 4567   303 BIRCH ST 
# Clean the data step by step
cleaned_customers <- raw_customers %>%
  mutate(
    # Clean names
    name_clean = name %>%
      str_trim() %>%                              # Remove leading/trailing spaces
      str_squish() %>%                           # Remove extra internal spaces
      str_replace_all("[.-]", " ") %>%           # Replace hyphens and dots with spaces
      str_to_title(),                            # Convert to title case

    # Clean emails
    email_clean = email %>%
      str_to_lower() %>%                         # Convert to lowercase
      str_trim(),                                # Remove any spaces

    # Clean phone numbers
    phone_clean = phone %>%
      str_replace_all("[^0-9]", "") %>%          # Remove all non-digits
      str_replace("(\\d{3})(\\d{3})(\\d{4})", "(\\1) \\2-\\3"),  # Format as (xxx) xxx-xxxx

    # Clean addresses
    address_clean = address %>%
      str_to_title() %>%                         # Convert to title case
      str_replace_all("\\bSt\\b", "Street") %>%  # Expand abbreviations
      str_replace_all("\\bAve\\b", "Avenue") %>%
      str_replace_all("\\bApt\\b", "Apartment") %>%
      str_squish()                               # Clean up spaces
  ) %>%
  select(id, name_clean, email_clean, phone_clean, address_clean)

print("Cleaned customer data:")
[1] "Cleaned customer data:"
print(cleaned_customers)
# A tibble: 6 × 5
     id name_clean        email_clean             phone_clean    address_clean  
  <int> <chr>             <chr>                   <chr>          <chr>          
1     1 John Smith        john@email.com          (555) 123-4567 123 Main Stree…
2     2 Mary Johnson      mary@company.org        (555) 123-4567 456 Oak Avenue 
3     3 Bob O'connor      bob@company.com         (555) 123-4567 789 Pine Stree…
4     4 Sarah Wilson      sarah.wilson@test.co.uk (555) 123-4567 101 Elm Street…
5     5 Mike Davis        mike.davis@domain.com   (555) 123-4567 202 Maple Aven…
6     6 Anna Maria Garcia anna@example.com        (555) 123-4567 303 Birch Stre…

Example 2: Parsing Product Information

# Product codes with embedded information
product_data <- tibble(
  sku = c("LAPTOP-DELL-15INCH-8GB-256SSD-WIN11",
          "PHONE-APPLE-IPHONE14-128GB-BLUE",
          "TABLET-SAMSUNG-10INCH-64GB-ANDROID",
          "LAPTOP-HP-14INCH-16GB-512SSD-WIN11",
          "PHONE-SAMSUNG-GALAXY-256GB-BLACK"),
  price = c("$1299.99", "$999.00", "$449.99", "$1599.99", "$799.99")
)

print("Raw product data:")
[1] "Raw product data:"
print(product_data)
# A tibble: 5 × 2
  sku                                 price   
  <chr>                               <chr>   
1 LAPTOP-DELL-15INCH-8GB-256SSD-WIN11 $1299.99
2 PHONE-APPLE-IPHONE14-128GB-BLUE     $999.00 
3 TABLET-SAMSUNG-10INCH-64GB-ANDROID  $449.99 
4 LAPTOP-HP-14INCH-16GB-512SSD-WIN11  $1599.99
5 PHONE-SAMSUNG-GALAXY-256GB-BLACK    $799.99 
# Parse information from SKU codes
parsed_products <- product_data %>%
  mutate(
    # Extract basic product type
    product_type = str_extract(sku, "^[A-Z]+"),

    # Extract brand
    brand = str_extract(sku, "(?<=-)([A-Z]+)(?=-)"),

    # Extract storage information
    storage_info = str_extract(sku, "\\d+GB|\\d+SSD"),
    storage_gb = as.numeric(str_extract(storage_info, "\\d+")),
    storage_type = case_when(
      str_detect(storage_info, "SSD") ~ "SSD",
      str_detect(storage_info, "GB") ~ "Flash",
      TRUE ~ "Unknown"
    ),

    # Extract screen size for relevant products
    screen_size = str_extract(sku, "\\d+INCH"),
    screen_inches = as.numeric(str_extract(screen_size, "\\d+")),

    # Clean price
    price_numeric = as.numeric(str_replace_all(price, "[$,]", "")),

    # Create clean product name
    product_name = str_replace_all(sku, "-", " ") %>%
      str_to_title() %>%
      str_replace_all("\\b(\\d+)Gb\\b", "\\1 GB") %>%
      str_replace_all("\\b(\\d+)Ssd\\b", "\\1 SSD") %>%
      str_replace_all("\\b(\\d+)Inch\\b", "\\1 Inch")
  ) %>%
  select(sku, product_name, product_type, brand, storage_gb, storage_type,
         screen_inches, price_numeric)

print("Parsed product data:")
[1] "Parsed product data:"
print(parsed_products)
# A tibble: 5 × 8
  sku      product_name product_type brand storage_gb storage_type screen_inches
  <chr>    <chr>        <chr>        <chr>      <dbl> <chr>                <dbl>
1 LAPTOP-… Laptop Dell… LAPTOP       DELL           8 Flash                   15
2 PHONE-A… Phone Apple… PHONE        APPLE        128 Flash                   NA
3 TABLET-… Tablet Sams… TABLET       SAMS…         64 Flash                   10
4 LAPTOP-… Laptop Hp 1… LAPTOP       HP            16 Flash                   14
5 PHONE-S… Phone Samsu… PHONE        SAMS…        256 Flash                   NA
# ℹ 1 more variable: price_numeric <dbl>
# Summary analysis
cat("Product analysis:\n")
Product analysis:
parsed_products %>%
  group_by(product_type, brand) %>%
  summarise(
    count = n(),
    avg_price = round(mean(price_numeric), 2),
    avg_storage = round(mean(storage_gb, na.rm = TRUE), 0),
    .groups = "drop"
  ) %>%
  print()
# A tibble: 5 × 5
  product_type brand   count avg_price avg_storage
  <chr>        <chr>   <int>     <dbl>       <dbl>
1 LAPTOP       DELL        1     1300.           8
2 LAPTOP       HP          1     1600.          16
3 PHONE        APPLE       1      999          128
4 PHONE        SAMSUNG     1      800.         256
5 TABLET       SAMSUNG     1      450.          64

Example 3: Text Data Mining

# Customer feedback comments
feedback_data <- tibble(
  customer_id = 1:8,
  comment = c(
    "Great product! Very satisfied with the quality and fast shipping.",
    "Poor quality, arrived damaged. Customer service was unhelpful.",
    "Good value for money. Would recommend to others.",
    "Excellent! Fast delivery and great customer support.",
    "Average product. Nothing special but does the job.",
    "Terrible experience. Product broke after 2 days.",
    "Love it! Best purchase I've made this year.",
    "Okay product but overpriced. Customer service was good though."
  ),
  rating = c(5, 1, 4, 5, 3, 1, 5, 3)
)

print("Customer feedback:")
[1] "Customer feedback:"
print(feedback_data)
# A tibble: 8 × 3
  customer_id comment                                                     rating
        <int> <chr>                                                        <dbl>
1           1 Great product! Very satisfied with the quality and fast sh…      5
2           2 Poor quality, arrived damaged. Customer service was unhelp…      1
3           3 Good value for money. Would recommend to others.                 4
4           4 Excellent! Fast delivery and great customer support.             5
5           5 Average product. Nothing special but does the job.               3
6           6 Terrible experience. Product broke after 2 days.                 1
7           7 Love it! Best purchase I've made this year.                      5
8           8 Okay product but overpriced. Customer service was good tho…      3
# Analyze sentiment and extract insights
feedback_analysis <- feedback_data %>%
  mutate(
    # Convert to lowercase for analysis
    comment_clean = str_to_lower(comment),

    # Extract sentiment indicators
    positive_words = str_count(comment_clean, "\\b(great|excellent|good|love|best|satisfied|recommend)\\b"),
    negative_words = str_count(comment_clean, "\\b(poor|terrible|bad|awful|hate|worst|disappointed|unhelpful)\\b"),

    # Check for specific topics
    mentions_quality = str_detect(comment_clean, "\\bquality\\b"),
    mentions_shipping = str_detect(comment_clean, "\\b(shipping|delivery)\\b"),
    mentions_service = str_detect(comment_clean, "\\b(service|support)\\b"),
    mentions_price = str_detect(comment_clean, "\\b(price|value|money|expensive|cheap|overpriced)\\b"),

    # Calculate sentiment score
    sentiment_score = positive_words - negative_words,

    # Classify sentiment
    sentiment = case_when(
      sentiment_score > 0 ~ "Positive",
      sentiment_score < 0 ~ "Negative",
      TRUE ~ "Neutral"
    ),

    # Extract key phrases (simple approach)
    key_phrases = str_extract_all(comment_clean, "\\b(very|extremely|really)\\s+\\w+"),

    # Word count
    word_count = str_count(comment_clean, "\\S+")
  )

print("Feedback analysis:")
[1] "Feedback analysis:"
feedback_analysis %>%
  select(customer_id, rating, sentiment, sentiment_score, mentions_quality,
         mentions_shipping, mentions_service, mentions_price, word_count) %>%
  print()
# A tibble: 8 × 9
  customer_id rating sentiment sentiment_score mentions_quality
        <int>  <dbl> <chr>               <int> <lgl>           
1           1      5 Positive                2 TRUE            
2           2      1 Negative               -2 TRUE            
3           3      4 Positive                2 FALSE           
4           4      5 Positive                2 FALSE           
5           5      3 Neutral                 0 FALSE           
6           6      1 Negative               -1 FALSE           
7           7      5 Positive                2 FALSE           
8           8      3 Positive                1 FALSE           
# ℹ 4 more variables: mentions_shipping <lgl>, mentions_service <lgl>,
#   mentions_price <lgl>, word_count <int>
# Summary insights
cat("\nSentiment Analysis Summary:\n")

Sentiment Analysis Summary:
feedback_analysis %>%
  group_by(sentiment) %>%
  summarise(
    count = n(),
    avg_rating = round(mean(rating), 1),
    avg_word_count = round(mean(word_count), 1)
  ) %>%
  print()
# A tibble: 3 × 4
  sentiment count avg_rating avg_word_count
  <chr>     <int>      <dbl>          <dbl>
1 Negative      2        1              7.5
2 Neutral       1        3              8  
3 Positive      5        4.4            8.4
cat("\nTopic Mentions:\n")

Topic Mentions:
feedback_analysis %>%
  summarise(
    quality_mentions = sum(mentions_quality),
    shipping_mentions = sum(mentions_shipping),
    service_mentions = sum(mentions_service),
    price_mentions = sum(mentions_price)
  ) %>%
  print()
# A tibble: 1 × 4
  quality_mentions shipping_mentions service_mentions price_mentions
             <int>             <int>            <int>          <int>
1                2                 2                3              2

Advanced String Techniques

Working with Encoding and Special Characters

# Text with special characters and different encodings
international_text <- c("Café", "naïve", "résumé", "piñata", "Zürich", "北京", "Москва")

# String length (note: may differ from character count for some encodings)
str_length(international_text)
[1] 4 5 6 6 6 2 6
nchar(international_text)
[1] 4 5 6 6 6 2 6
# Detect and handle different character types
str_detect(international_text, "[^\\x00-\\x7F]")  # Non-ASCII characters
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(international_text, "\\p{L}")          # Unicode letters
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
str_detect(international_text, "\\p{Han}")        # Chinese characters
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
# Normalize text (remove accents)
# Note: This is a simplified example; real normalization needs specialized packages
simplified_text <- international_text %>%
  str_replace_all("[àáâãäå]", "a") %>%
  str_replace_all("[èéêë]", "e") %>%
  str_replace_all("[ìíîï]", "i") %>%
  str_replace_all("[òóôõö]", "o") %>%
  str_replace_all("[ùúûü]", "u") %>%
  str_replace_all("[ñ]", "n")

print(data.frame(original = international_text, simplified = simplified_text))
  original simplified
1     Café       Cafe
2    naïve      naive
3   résumé     resume
4   piñata     pinata
5   Zürich     Zurich
6     北京       北京
7   Москва     Москва

String Validation

# Create validation functions using stringr
validate_email <- function(email) {
  pattern <- "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
  str_detect(email, pattern)
}

validate_phone <- function(phone) {
  # Remove all non-digits and check if 10 digits remain
  digits_only <- str_replace_all(phone, "[^0-9]", "")
  str_length(digits_only) == 10
}

validate_postal_code <- function(postal_code, country = "US") {
  if (country == "US") {
    # US ZIP code: 12345 or 12345-6789
    str_detect(postal_code, "^\\d{5}(-\\d{4})?$")
  } else if (country == "CA") {
    # Canadian postal code: A1A 1A1
    str_detect(postal_code, "^[A-Z]\\d[A-Z]\\s?\\d[A-Z]\\d$")
  } else if (country == "UK") {
    # UK postal code (simplified): AA1 1AA or A1 1AA
    str_detect(postal_code, "^[A-Z]{1,2}\\d{1,2}\\s?\\d[A-Z]{2}$")
  }
}

# Test validation functions
test_emails <- c("user@domain.com", "invalid.email", "test@company.co.uk", "bad@")
test_phones <- c("(555) 123-4567", "555.123.4567", "12345", "555-123-4567")
test_postal <- c("12345", "12345-6789", "ABC", "90210-1234")

validation_results <- tibble(
  email = test_emails,
  email_valid = validate_email(test_emails),
  phone = test_phones,
  phone_valid = validate_phone(test_phones),
  postal = test_postal,
  postal_valid = validate_postal_code(test_postal, "US")
)

print(validation_results)
# A tibble: 4 × 6
  email              email_valid phone          phone_valid postal  postal_valid
  <chr>              <lgl>       <chr>          <lgl>       <chr>   <lgl>       
1 user@domain.com    TRUE        (555) 123-4567 TRUE        12345   TRUE        
2 invalid.email      FALSE       555.123.4567   TRUE        12345-… TRUE        
3 test@company.co.uk TRUE        12345          FALSE       ABC     FALSE       
4 bad@               FALSE       555-123-4567   TRUE        90210-… TRUE        

Best Practices and Common Pitfalls

Performance Tips

# For large datasets, vectorized operations are much faster
large_text <- rep(c("Hello World", "Data Science", "R Programming"), 1000)

# Good: Use vectorized stringr functions
system.time({
  result1 <- str_to_upper(large_text)
})
   user  system elapsed 
      0       0       0 
# Avoid: Loops for simple operations
system.time({
  result2 <- character(length(large_text))
  for (i in seq_along(large_text)) {
    result2[i] <- toupper(large_text[i])
  }
})
   user  system elapsed 
  0.003   0.000   0.002 
# Pre-compile regex patterns for repeated use
email_pattern <- regex("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")
test_emails <- rep(c("user@domain.com", "invalid"), 500)

system.time({
  result3 <- str_detect(test_emails, email_pattern)
})
   user  system elapsed 
      0       0       0 

Common Mistakes and Solutions

# Mistake 1: Forgetting to escape special regex characters
test_text <- c("price: $29.99", "cost: $15.50", "value: 10.00")

# Wrong: . matches any character in regex
str_detect(test_text, "$29.99")  # This won't work as expected
[1] FALSE FALSE FALSE
# Right: Escape special characters
str_detect(test_text, "\\$29\\.99")
[1]  TRUE FALSE FALSE
# Mistake 2: Not handling missing values
messy_data <- c("John Smith", NA, "Mary Johnson", "")

# This will create NA for missing values
str_to_upper(messy_data)
[1] "JOHN SMITH"   NA             "MARY JOHNSON" ""            
# Better: Handle missing values explicitly
coalesce(str_to_upper(messy_data), "UNKNOWN")
[1] "JOHN SMITH"   "UNKNOWN"      "MARY JOHNSON" ""            
# Mistake 3: Assuming consistent formats
mixed_dates <- c("2024-01-15", "01/15/2024", "Jan 15, 2024")

# Wrong: Assuming one format
str_extract(mixed_dates, "\\d{4}-\\d{2}-\\d{2}")
[1] "2024-01-15" NA           NA          
# Better: Handle multiple formats
case_when(
  str_detect(mixed_dates, "\\d{4}-\\d{2}-\\d{2}") ~ "ISO format",
  str_detect(mixed_dates, "\\d{2}/\\d{2}/\\d{4}") ~ "US format",
  str_detect(mixed_dates, "[A-Za-z]+ \\d{1,2}, \\d{4}") ~ "Written format",
  TRUE ~ "Unknown format"
)
[1] "ISO format"     "US format"      "Written format"

Exercises

Exercise 1: Email Validation and Cleaning

Given a list of messy email addresses, write code to: 1. Clean and standardize the format 2. Identify valid vs invalid emails 3. Extract domain names and categorize them

Exercise 2: Product Name Standardization

You have product names in various formats. Create functions to: 1. Extract product categories, brands, and specifications 2. Standardize naming conventions 3. Identify products with missing information

Exercise 3: Text Analysis

Analyze customer reviews to: 1. Calculate sentiment scores based on positive/negative words 2. Extract key features mentioned (price, quality, service, etc.) 3. Identify the most common complaint topics

Exercise 4: Data Validation Pipeline

Create a comprehensive data validation system that: 1. Validates phone numbers, emails, and postal codes 2. Standardizes name formats 3. Flags potentially problematic entries for manual review

Summary

The stringr package is essential for working with text data in R:

Key Functions:

  • Basic operations: str_length(), str_sub(), str_trim(), str_pad()
  • Case conversion: str_to_lower(), str_to_upper(), str_to_title()
  • Pattern matching: str_detect(), str_count(), str_locate()
  • Replacement: str_replace(), str_replace_all()
  • Splitting/joining: str_split(), str_c()

Best Practices:

  • Use vectorized operations for performance
  • Escape special regex characters properly
  • Handle missing values explicitly
  • Test regex patterns thoroughly
  • Consider internationalization for global data

Regular Expressions:

  • Learn common patterns for emails, phones, dates
  • Use regex() function for case-insensitive matching
  • Practice with simple patterns before complex ones
  • Test patterns with edge cases

String manipulation is a crucial skill for data cleaning and preparation. With stringr, you can handle even the messiest text data efficiently and elegantly!

Next: Managing Factors with forcats