Text data is everywhere in real-world datasets - names, addresses, comments, categories, and more. The stringr package provides a consistent, intuitive set of functions for working with strings that integrates seamlessly with the tidyverse.
library(tidyverse)library(stringr)# All stringr functions start with str_ for easy discoverycat("Key stringr functions we'll explore:\n")
Key stringr functions we'll explore:
cat("- str_length(): get string length\n")
- str_length(): get string length
cat("- str_sub(): extract substrings\n")
- str_sub(): extract substrings
cat("- str_detect(): find patterns\n")
- str_detect(): find patterns
cat("- str_replace(): replace patterns\n")
- str_replace(): replace patterns
cat("- str_split(): split strings\n")
- str_split(): split strings
cat("- str_trim(): remove whitespace\n")
- str_trim(): remove whitespace
cat("- str_to_*(): change case\n")
- str_to_*(): change case
Basic String Operations
String Length and Subsetting
# Sample text datacustomer_names <-c("John Smith", "Mary Johnson", "Bob O'Connor", "李小明", "José García")product_codes <-c("ABC-123", "XYZ-456", "DEF-789", "GHI-012")# Get string lengthstr_length(customer_names)
[1] 10 12 12 3 11
str_length(product_codes)
[1] 7 7 7 7
# Extract substringsstr_sub(customer_names, 1, 4) # First 4 characters
[1] "John" "Mary" "Bob " "李小明" "José"
str_sub(customer_names, -5, -1) # Last 5 characters
messy_addresses <-c("123 Main St.", "456 Oak Ave", "789 Pine Street", "101 Elm St")# Simple replacementstr_replace(messy_addresses, "St\\.", "Street") # Replace first match
[1] "123 Main Street" "456 Oak Ave" "789 Pine Street" "101 Elm St"
str_replace_all(messy_addresses, "\\.", "") # Remove all periods
[1] "123 Main St" "456 Oak Ave" "789 Pine Street" "101 Elm St"
str_replace_all(messy_addresses, "St$", "Street") # Replace St at end
[1] "123 Main St." "456 Oak Ave" "789 Pine Street" "101 Elm Street"
# Sample data with delimiterscsv_data <-c("John,25,Engineer", "Mary,30,Manager", "Bob,28,Analyst")pipe_data <-c("Apple|Red|Sweet", "Banana|Yellow|Sweet", "Lemon|Yellow|Sour")# Split by delimiterstr_split(csv_data, ",") # Returns list
# Messy customer dataraw_customers <-tibble(id =1:6,name =c(" JOHN SMITH ", "mary-johnson", "Bob O'Connor","SARAH WILSON", "mike.davis", "ANNA-MARIA GARCIA"),email =c("JOHN@EMAIL.COM", "mary@company.org", "bob@COMPANY.com","sarah.wilson@test.CO.UK", "mike.davis@DOMAIN.com", "anna@example.COM"),phone =c("(555) 123-4567", "555.123.4567", "5551234567","555-123-4567", "(555)123-4567", "555 123 4567"),address =c("123 main st.", "456 OAK AVE", "789 pine street apt 2","101 ELM ST UNIT B", "202 maple ave.", "303 BIRCH ST"))print("Raw customer data:")
[1] "Raw customer data:"
print(raw_customers)
# A tibble: 6 × 5
id name email phone address
<int> <chr> <chr> <chr> <chr>
1 1 " JOHN SMITH " JOHN@EMAIL.COM (555) 123-4567 123 main st.
2 2 "mary-johnson" mary@company.org 555.123.4567 456 OAK AVE
3 3 "Bob O'Connor" bob@COMPANY.com 5551234567 789 pine str…
4 4 "SARAH WILSON" sarah.wilson@test.CO.UK 555-123-4567 101 ELM ST U…
5 5 "mike.davis" mike.davis@DOMAIN.com (555)123-4567 202 maple av…
6 6 "ANNA-MARIA GARCIA" anna@example.COM 555 123 4567 303 BIRCH ST
# Clean the data step by stepcleaned_customers <- raw_customers %>%mutate(# Clean namesname_clean = name %>%str_trim() %>%# Remove leading/trailing spacesstr_squish() %>%# Remove extra internal spacesstr_replace_all("[.-]", " ") %>%# Replace hyphens and dots with spacesstr_to_title(), # Convert to title case# Clean emailsemail_clean = email %>%str_to_lower() %>%# Convert to lowercasestr_trim(), # Remove any spaces# Clean phone numbersphone_clean = phone %>%str_replace_all("[^0-9]", "") %>%# Remove all non-digitsstr_replace("(\\d{3})(\\d{3})(\\d{4})", "(\\1) \\2-\\3"), # Format as (xxx) xxx-xxxx# Clean addressesaddress_clean = address %>%str_to_title() %>%# Convert to title casestr_replace_all("\\bSt\\b", "Street") %>%# Expand abbreviationsstr_replace_all("\\bAve\\b", "Avenue") %>%str_replace_all("\\bApt\\b", "Apartment") %>%str_squish() # Clean up spaces ) %>%select(id, name_clean, email_clean, phone_clean, address_clean)print("Cleaned customer data:")
[1] "Cleaned customer data:"
print(cleaned_customers)
# A tibble: 6 × 5
id name_clean email_clean phone_clean address_clean
<int> <chr> <chr> <chr> <chr>
1 1 John Smith john@email.com (555) 123-4567 123 Main Stree…
2 2 Mary Johnson mary@company.org (555) 123-4567 456 Oak Avenue
3 3 Bob O'connor bob@company.com (555) 123-4567 789 Pine Stree…
4 4 Sarah Wilson sarah.wilson@test.co.uk (555) 123-4567 101 Elm Street…
5 5 Mike Davis mike.davis@domain.com (555) 123-4567 202 Maple Aven…
6 6 Anna Maria Garcia anna@example.com (555) 123-4567 303 Birch Stre…
# Customer feedback commentsfeedback_data <-tibble(customer_id =1:8,comment =c("Great product! Very satisfied with the quality and fast shipping.","Poor quality, arrived damaged. Customer service was unhelpful.","Good value for money. Would recommend to others.","Excellent! Fast delivery and great customer support.","Average product. Nothing special but does the job.","Terrible experience. Product broke after 2 days.","Love it! Best purchase I've made this year.","Okay product but overpriced. Customer service was good though." ),rating =c(5, 1, 4, 5, 3, 1, 5, 3))print("Customer feedback:")
[1] "Customer feedback:"
print(feedback_data)
# A tibble: 8 × 3
customer_id comment rating
<int> <chr> <dbl>
1 1 Great product! Very satisfied with the quality and fast sh… 5
2 2 Poor quality, arrived damaged. Customer service was unhelp… 1
3 3 Good value for money. Would recommend to others. 4
4 4 Excellent! Fast delivery and great customer support. 5
5 5 Average product. Nothing special but does the job. 3
6 6 Terrible experience. Product broke after 2 days. 1
7 7 Love it! Best purchase I've made this year. 5
8 8 Okay product but overpriced. Customer service was good tho… 3
# Text with special characters and different encodingsinternational_text <-c("Café", "naïve", "résumé", "piñata", "Zürich", "北京", "Москва")# String length (note: may differ from character count for some encodings)str_length(international_text)
[1] 4 5 6 6 6 2 6
nchar(international_text)
[1] 4 5 6 6 6 2 6
# Detect and handle different character typesstr_detect(international_text, "[^\\x00-\\x7F]") # Non-ASCII characters
# For large datasets, vectorized operations are much fasterlarge_text <-rep(c("Hello World", "Data Science", "R Programming"), 1000)# Good: Use vectorized stringr functionssystem.time({ result1 <-str_to_upper(large_text)})
user system elapsed
0 0 0
# Avoid: Loops for simple operationssystem.time({ result2 <-character(length(large_text))for (i inseq_along(large_text)) { result2[i] <-toupper(large_text[i]) }})
# Mistake 1: Forgetting to escape special regex characterstest_text <-c("price: $29.99", "cost: $15.50", "value: 10.00")# Wrong: . matches any character in regexstr_detect(test_text, "$29.99") # This won't work as expected
[1] FALSE FALSE FALSE
# Right: Escape special charactersstr_detect(test_text, "\\$29\\.99")
[1] TRUE FALSE FALSE
# Mistake 2: Not handling missing valuesmessy_data <-c("John Smith", NA, "Mary Johnson", "")# This will create NA for missing valuesstr_to_upper(messy_data)
Given a list of messy email addresses, write code to: 1. Clean and standardize the format 2. Identify valid vs invalid emails 3. Extract domain names and categorize them
Exercise 2: Product Name Standardization
You have product names in various formats. Create functions to: 1. Extract product categories, brands, and specifications 2. Standardize naming conventions 3. Identify products with missing information
Exercise 3: Text Analysis
Analyze customer reviews to: 1. Calculate sentiment scores based on positive/negative words 2. Extract key features mentioned (price, quality, service, etc.) 3. Identify the most common complaint topics
Exercise 4: Data Validation Pipeline
Create a comprehensive data validation system that: 1. Validates phone numbers, emails, and postal codes 2. Standardizes name formats 3. Flags potentially problematic entries for manual review
Summary
The stringr package is essential for working with text data in R:
Use regex() function for case-insensitive matching
Practice with simple patterns before complex ones
Test patterns with edge cases
String manipulation is a crucial skill for data cleaning and preparation. With stringr, you can handle even the messiest text data efficiently and elegantly!