library(tidyverse)
Data Import with readr
Introduction to readr
The readr package provides a fast and friendly way to read rectangular data (like csv, tsv, and fixed width files) into R. It’s part of the core tidyverse and automatically returns tibbles instead of data frames.
Why Use readr?
Advantages Over Base R
- Speed: ~10x faster than base R functions
- Tibbles: Returns tibbles by default
- Consistency: Consistent naming and behavior
- Progress bars: Shows progress for large files
- Better defaults: More sensible type guessing
- Encoding support: Better handling of character encodings
# Base R
<- read.csv("data.csv") # Returns data.frame
df_base # Strings become factors (pre R 4.0)
# Slower for large files
# readr
<- read_csv("data.csv") # Returns tibble
df_readr # Strings stay as characters
# Much faster
# Shows column specifications
Reading CSV Files
Basic CSV Import
Let’s create a sample CSV file to work with:
# Create a sample CSV file
<- tibble(
sample_data id = 1:5,
name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
age = c(25, 30, 35, 28, 32),
salary = c(50000, 60000, 75000, 55000, 65000),
hire_date = c("2020-01-15", "2019-06-01", "2018-03-20",
"2021-02-10", "2019-11-30")
)
# Write to CSV
write_csv(sample_data, "employees.csv")
# Read it back
<- read_csv("employees.csv")
employees employees
# A tibble: 5 × 5
id name age salary hire_date
<dbl> <chr> <dbl> <dbl> <date>
1 1 Alice 25 50000 2020-01-15
2 2 Bob 30 60000 2019-06-01
3 3 Charlie 35 75000 2018-03-20
4 4 Diana 28 55000 2021-02-10
5 5 Eve 32 65000 2019-11-30
Understanding Column Specifications
readr automatically detects column types:
# readr shows what it detected
<- read_csv("employees.csv", show_col_types = TRUE)
employees
# Get the specification
spec(employees)
cols(
id = col_double(),
name = col_character(),
age = col_double(),
salary = col_double(),
hire_date = col_date(format = "")
)
# You can also explicitly set column types
<- read_csv(
employees_typed "employees.csv",
col_types = cols(
id = col_integer(),
name = col_character(),
age = col_integer(),
salary = col_double(),
hire_date = col_date(format = "%Y-%m-%d")
)
)
employees_typed
# A tibble: 5 × 5
id name age salary hire_date
<int> <chr> <int> <dbl> <date>
1 1 Alice 25 50000 2020-01-15
2 2 Bob 30 60000 2019-06-01
3 3 Charlie 35 75000 2018-03-20
4 4 Diana 28 55000 2021-02-10
5 5 Eve 32 65000 2019-11-30
Column Type Specifications
Available Column Types
# Create a file with various data types
<- tibble(
diverse_data integers = c(1, 2, 3),
doubles = c(1.5, 2.7, 3.9),
logicals = c(TRUE, FALSE, TRUE),
characters = c("apple", "banana", "cherry"),
dates = as.Date(c("2024-01-01", "2024-02-01", "2024-03-01")),
times = hms::hms(hours = c(9, 14, 18), minutes = c(30, 15, 45)),
datetimes = as.POSIXct(c("2024-01-01 09:30:00",
"2024-02-01 14:15:00",
"2024-03-01 18:45:00"))
)
write_csv(diverse_data, "diverse_data.csv")
# Read with automatic detection
<- read_csv("diverse_data.csv")
auto_read glimpse(auto_read)
Rows: 3
Columns: 7
$ integers <dbl> 1, 2, 3
$ doubles <dbl> 1.5, 2.7, 3.9
$ logicals <lgl> TRUE, FALSE, TRUE
$ characters <chr> "apple", "banana", "cherry"
$ dates <date> 2024-01-01, 2024-02-01, 2024-03-01
$ times <time> 09:30:00, 14:15:00, 18:45:00
$ datetimes <dttm> 2024-01-01 09:30:00, 2024-02-01 14:15:00, 2024-03-01 18:45:…
Compact Column Specification
Use a compact string specification:
# Use single letters for common types
# i = integer, d = double, c = character, l = logical
# D = date, T = datetime, t = time
<- read_csv(
data_compact "diverse_data.csv",
col_types = "iddcDtT" # One letter per column
)
glimpse(data_compact)
Rows: 3
Columns: 7
$ integers <int> 1, 2, 3
$ doubles <dbl> 1.5, 2.7, 3.9
$ logicals <dbl> NA, NA, NA
$ characters <chr> "apple", "banana", "cherry"
$ dates <date> 2024-01-01, 2024-02-01, 2024-03-01
$ times <time> 09:30:00, 14:15:00, 18:45:00
$ datetimes <dttm> 2024-01-01 09:30:00, 2024-02-01 14:15:00, 2024-03-01 18:45:…
Handling Common Data Issues
Missing Values
# Create data with missing values
<- "id,value,category
missing_data 1,10,A
2,NA,B
3,20,NA
4,,C
5,30,D"
# Default: NA and empty strings are treated as missing
<- read_csv(missing_data)
df_default df_default
# A tibble: 5 × 3
id value category
<dbl> <dbl> <chr>
1 1 10 A
2 2 NA B
3 3 20 <NA>
4 4 NA C
5 5 30 D
# Specify custom NA values
<- read_csv(
df_custom
missing_data,na = c("", "NA", "N/A", "missing")
) df_custom
# A tibble: 5 × 3
id value category
<dbl> <dbl> <chr>
1 1 10 A
2 2 NA B
3 3 20 <NA>
4 4 NA C
5 5 30 D
Parsing Problems
# Data with parsing issues
<- "number,text
problematic_data 1,apple
two,banana
3,cherry
4.5,date"
# Read and check for problems
<- read_csv(problematic_data)
df_problems
# See what went wrong
problems(df_problems)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
# Fix by specifying correct types
<- read_csv(
df_fixed
problematic_data,col_types = cols(
number = col_character(), # Read as character to preserve all data
text = col_character()
)
) df_fixed
# A tibble: 4 × 2
number text
<chr> <chr>
1 1 apple
2 two banana
3 3 cherry
4 4.5 date
Skip Rows and Comments
# Data with header information
<- "# Company Sales Report
messy_file # Generated: 2024-01-01
# Department: Analytics
id,product,sales
1,Widget A,1000
2,Widget B,1500
# Mid-year update
3,Widget C,2000"
# Skip lines and ignore comments
<- read_csv(
clean_data
messy_file,skip = 3, # Skip first 3 lines
comment = "#" # Ignore lines starting with #
) clean_data
# A tibble: 3 × 3
id product sales
<dbl> <chr> <dbl>
1 1 Widget A 1000
2 2 Widget B 1500
3 3 Widget C 2000
Reading Other File Formats
TSV Files (Tab-Separated Values)
# Create a TSV file
<- tibble(
tsv_data gene = c("BRCA1", "TP53", "EGFR"),
chromosome = c(17, 17, 7),
gene_function = c("DNA repair", "Tumor suppressor", "Growth factor receptor")
)
write_tsv(tsv_data, "genes.tsv")
# Read TSV
<- read_tsv("genes.tsv")
genes genes
# A tibble: 3 × 3
gene chromosome gene_function
<chr> <dbl> <chr>
1 BRCA1 17 DNA repair
2 TP53 17 Tumor suppressor
3 EGFR 7 Growth factor receptor
Delimited Files
# Create a pipe-delimited file
<- "name|age|city
pipe_data Alice|25|New York
Bob|30|Los Angeles
Charlie|35|Chicago"
write_file(pipe_data, "pipe_data.txt")
# Read with custom delimiter
<- read_delim("pipe_data.txt", delim = "|")
df_pipe df_pipe
# A tibble: 3 × 3
name age city
<chr> <dbl> <chr>
1 Alice 25 New York
2 Bob 30 Los Angeles
3 Charlie 35 Chicago
# Semicolon-delimited (common in European data)
<- "product;price;quantity
semi_data Apple;1,50;100
Banana;0,75;150
Orange;2,00;80"
write_file(semi_data, "semi_data.csv")
# Read with semicolon delimiter and European decimal mark
<- read_delim(
df_semi "semi_data.csv",
delim = ";",
locale = locale(decimal_mark = ",")
) df_semi
# A tibble: 3 × 3
product price quantity
<chr> <dbl> <dbl>
1 Apple 1.5 100
2 Banana 0.75 150
3 Orange 2 80
Fixed Width Files
# Create a fixed width file
<- "001John Smith 25 Engineer
fwf_data 002Jane Doe 30 Manager
003Bob Johnson 28 Analyst"
write_file(fwf_data, "fixed_width.txt")
# Read fixed width file with column positions
<- read_fwf(
df_fwf "fixed_width.txt",
col_positions = fwf_widths(
widths = c(3, 8, 8, 4, 10),
col_names = c("id", "first", "last", "age", "job")
)
) df_fwf
# A tibble: 3 × 5
id first last age job
<chr> <chr> <chr> <dbl> <chr>
1 001 John Smith 25 Engineer
2 002 Jane Doe 30 Manager
3 003 Bob Johnson 28 Analyst
Working with Large Files
Reading in Chunks
# Create a larger file
<- tibble(
large_data id = 1:10000,
value = rnorm(10000),
category = sample(LETTERS[1:5], 10000, replace = TRUE)
)
write_csv(large_data, "large_file.csv")
# Read in chunks for processing
<- function(chunk, pos) {
process_chunk # Process each chunk
summarize(chunk,
chunk = pos,
n = n(),
mean_value = mean(value))
}
# Process file in chunks of 2000 rows
<- read_csv_chunked(
chunk_results "large_file.csv",
chunk_size = 2000,
callback = DataFrameCallback$new(process_chunk)
)
chunk_results
# A tibble: 5 × 3
chunk n mean_value
<int> <int> <dbl>
1 1 2000 0.0145
2 2001 2000 0.00626
3 4001 2000 -0.0105
4 6001 2000 -0.0333
5 8001 2000 0.00498
Lazy Reading
# For very large files, use lazy reading
# This doesn't load the entire file into memory
library(vroom) # Alternative package for very large files
# Lazy read - data is only read when needed
<- vroom("very_large_file.csv")
lazy_df
# Operations are performed lazily
<- lazy_df %>%
result filter(category == "A") %>%
group_by(year) %>%
summarize(total = sum(value))
Advanced Import Options
Locale Settings
# European format data
<- "date;amount;percentage
euro_data 01/02/2024;1.234,56;45,5%
15/03/2024;2.345,67;32,1%
30/04/2024;3.456,78;28,9%"
write_file(euro_data, "euro_data.csv")
# Read with European locale
<- read_delim(
euro_df "euro_data.csv",
delim = ";",
locale = locale(
date_format = "%d/%m/%Y",
decimal_mark = ",",
grouping_mark = "."
)
) euro_df
# A tibble: 3 × 3
date amount percentage
<date> <dbl> <chr>
1 2024-02-01 1235. 45,5%
2 2024-03-15 2346. 32,1%
3 2024-04-30 3457. 28,9%
Column Selection
# Create data with many columns
<- tibble(
wide_data id = 1:5,
name = c("A", "B", "C", "D", "E"),
col1 = rnorm(5),
col2 = rnorm(5),
col3 = rnorm(5),
col4 = rnorm(5),
col5 = rnorm(5)
)
write_csv(wide_data, "wide_data.csv")
# Read only specific columns
<- read_csv(
selected "wide_data.csv",
col_select = c(id, name, col1, col3)
) selected
# A tibble: 5 × 4
id name col1 col3
<dbl> <chr> <dbl> <dbl>
1 1 A 1.41 -0.267
2 2 B -0.984 -0.198
3 3 C -1.23 -0.277
4 4 D 1.07 -0.159
5 5 E -0.172 0.135
# Use tidyselect helpers
<- read_csv(
selected2 "wide_data.csv",
col_select = c(id, name, starts_with("col"))
)head(selected2, 2)
# A tibble: 2 × 7
id name col1 col2 col3 col4 col5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 1.41 -0.746 -0.267 0.495 0.727
2 2 B -0.984 -0.0287 -0.198 0.428 0.785
Writing Data
Writing CSV Files
# Create sample data
<- tibble(
output_data id = 1:5,
value = rnorm(5),
date = Sys.Date() + 0:4,
timestamp = Sys.time() + 0:4 * 3600
)
# Write CSV (no row names, readable format)
write_csv(output_data, "output.csv")
# Write with European format
write_csv2(output_data, "output_euro.csv") # Uses semicolon delimiter
# Write TSV
write_tsv(output_data, "output.tsv")
# Write with custom NA representation
<- tibble(
data_with_na x = c(1, NA, 3),
y = c("a", "b", NA)
)
write_csv(data_with_na, "with_na.csv", na = "missing")
Appending to Files
# Initial data
<- tibble(id = 1:3, value = c(10, 20, 30))
initial write_csv(initial, "append_test.csv")
# New data to append
<- tibble(id = 4:6, value = c(40, 50, 60))
new_data
# Append to existing file
write_csv(new_data, "append_test.csv", append = TRUE)
# Read the combined file
read_csv("append_test.csv")
# A tibble: 6 × 2
id value
<dbl> <dbl>
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
Best Practices
1. Always Inspect Your Data
# After reading, always check:
<- read_csv("employees.csv")
data
# Structure
glimpse(data)
Rows: 5
Columns: 5
$ id <dbl> 1, 2, 3, 4, 5
$ name <chr> "Alice", "Bob", "Charlie", "Diana", "Eve"
$ age <dbl> 25, 30, 35, 28, 32
$ salary <dbl> 50000, 60000, 75000, 55000, 65000
$ hire_date <date> 2020-01-15, 2019-06-01, 2018-03-20, 2021-02-10, 2019-11-30
# First few rows
head(data)
# A tibble: 5 × 5
id name age salary hire_date
<dbl> <chr> <dbl> <dbl> <date>
1 1 Alice 25 50000 2020-01-15
2 2 Bob 30 60000 2019-06-01
3 3 Charlie 35 75000 2018-03-20
4 4 Diana 28 55000 2021-02-10
5 5 Eve 32 65000 2019-11-30
# Summary statistics
summary(data)
id name age salary
Min. :1 Length:5 Min. :25 Min. :50000
1st Qu.:2 Class :character 1st Qu.:28 1st Qu.:55000
Median :3 Mode :character Median :30 Median :60000
Mean :3 Mean :30 Mean :61000
3rd Qu.:4 3rd Qu.:32 3rd Qu.:65000
Max. :5 Max. :35 Max. :75000
hire_date
Min. :2018-03-20
1st Qu.:2019-06-01
Median :2019-11-30
Mean :2019-09-27
3rd Qu.:2020-01-15
Max. :2021-02-10
# Check for parsing problems
problems(data)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
2. Be Explicit About Types
# Good: Explicit column types
<- read_csv(
data "file.csv",
col_types = cols(
id = col_integer(),
amount = col_double(),
date = col_date(format = "%Y-%m-%d"),
category = col_factor(levels = c("A", "B", "C"))
)
)
# This prevents surprises and makes code more robust
3. Handle Encodings Properly
# For files with special characters
<- read_csv(
data "international_data.csv",
locale = locale(encoding = "UTF-8")
)
# Common encodings: UTF-8, Latin1, Windows-1252
Common Patterns
Pattern 1: Import and Clean
# Common workflow: import, clean, transform
<- read_csv("employees.csv") %>%
clean_data # Remove any rows with missing values
drop_na() %>%
# Parse dates properly
mutate(
hire_date = as.Date(hire_date),
years_employed = as.numeric(Sys.Date() - hire_date) / 365.25
%>%
) # Filter and arrange
filter(age >= 25) %>%
arrange(desc(salary))
clean_data
# A tibble: 5 × 6
id name age salary hire_date years_employed
<dbl> <chr> <dbl> <dbl> <date> <dbl>
1 3 Charlie 35 75000 2018-03-20 7.51
2 5 Eve 32 65000 2019-11-30 5.81
3 2 Bob 30 60000 2019-06-01 6.31
4 4 Diana 28 55000 2021-02-10 4.61
5 1 Alice 25 50000 2020-01-15 5.69
Pattern 2: Multiple File Import
# Create multiple files
for(i in 1:3) {
tibble(
file_id = i,
value = rnorm(5),
category = sample(c("A", "B"), 5, replace = TRUE)
%>%
) write_csv(paste0("data_", i, ".csv"))
}
# Read all files at once
<- list.files(pattern = "data_\\d+\\.csv", full.names = TRUE)
all_files
<- all_files %>%
combined_data map_dfr(read_csv, show_col_types = FALSE) %>%
arrange(file_id)
combined_data
# A tibble: 15 × 3
file_id value category
<dbl> <dbl> <chr>
1 1 0.410 A
2 1 0.653 B
3 1 0.187 B
4 1 1.80 B
5 1 1.30 A
6 2 0.907 A
7 2 0.492 B
8 2 -1.11 B
9 2 -0.786 B
10 2 0.418 B
11 3 0.284 A
12 3 1.28 B
13 3 -1.34 A
14 3 -0.0405 B
15 3 0.368 A
Exercises
Exercise 1: Basic Import
- Create a CSV file with student grades
- Read it with appropriate column types
- Check for any parsing problems
Exercise 2: Messy Data
Create a CSV file with: - Comment lines starting with # - Empty rows - Mixed date formats - Custom NA values Then read it correctly using readr options.
Exercise 3: Large File Simulation
- Generate a CSV with 100,000 rows
- Read only specific columns
- Process it in chunks to calculate group summaries
Exercise 4: International Data
Create a CSV file with European formatting (semicolon delimiter, comma decimal) and read it properly.
Cleanup
Summary
The readr package provides a modern, fast, and user-friendly approach to data import:
- Consistent API: All read functions work similarly
- Automatic type detection: Smart guessing with explicit overrides
- Better defaults: Returns tibbles, preserves strings
- Helpful feedback: Shows column specs and parsing problems
- Performance: Much faster than base R for large files
- Flexibility: Handles various formats and encodings
Mastering readr is essential for efficient data import in the tidyverse ecosystem. Combined with dplyr and tidyr, it forms the foundation of most data analysis workflows.
Next, we’ll explore data transformation with dplyr!