library(tidyverse)Data Import with readr
Introduction to readr
The readr package provides a fast and friendly way to read rectangular data (like csv, tsv, and fixed width files) into R. It’s part of the core tidyverse and automatically returns tibbles instead of data frames.
Why Use readr?
Advantages Over Base R
- Speed: ~10x faster than base R functions
- Tibbles: Returns tibbles by default
- Consistency: Consistent naming and behavior
- Progress bars: Shows progress for large files
- Better defaults: More sensible type guessing
- Encoding support: Better handling of character encodings
# Base R
df_base <- read.csv("data.csv") # Returns data.frame
# Strings become factors (pre R 4.0)
# Slower for large files
# readr
df_readr <- read_csv("data.csv") # Returns tibble
# Strings stay as characters
# Much faster
# Shows column specificationsReading CSV Files
Basic CSV Import
Let’s create a sample CSV file to work with:
# Create a sample CSV file
sample_data <- tibble(
id = 1:5,
name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
age = c(25, 30, 35, 28, 32),
salary = c(50000, 60000, 75000, 55000, 65000),
hire_date = c("2020-01-15", "2019-06-01", "2018-03-20",
"2021-02-10", "2019-11-30")
)
# Write to CSV
write_csv(sample_data, "employees.csv")
# Read it back
employees <- read_csv("employees.csv")
employees# A tibble: 5 × 5
id name age salary hire_date
<dbl> <chr> <dbl> <dbl> <date>
1 1 Alice 25 50000 2020-01-15
2 2 Bob 30 60000 2019-06-01
3 3 Charlie 35 75000 2018-03-20
4 4 Diana 28 55000 2021-02-10
5 5 Eve 32 65000 2019-11-30
Understanding Column Specifications
readr automatically detects column types:
# readr shows what it detected
employees <- read_csv("employees.csv", show_col_types = TRUE)
# Get the specification
spec(employees)cols(
id = col_double(),
name = col_character(),
age = col_double(),
salary = col_double(),
hire_date = col_date(format = "")
)
# You can also explicitly set column types
employees_typed <- read_csv(
"employees.csv",
col_types = cols(
id = col_integer(),
name = col_character(),
age = col_integer(),
salary = col_double(),
hire_date = col_date(format = "%Y-%m-%d")
)
)
employees_typed# A tibble: 5 × 5
id name age salary hire_date
<int> <chr> <int> <dbl> <date>
1 1 Alice 25 50000 2020-01-15
2 2 Bob 30 60000 2019-06-01
3 3 Charlie 35 75000 2018-03-20
4 4 Diana 28 55000 2021-02-10
5 5 Eve 32 65000 2019-11-30
Column Type Specifications
Available Column Types
# Create a file with various data types
diverse_data <- tibble(
integers = c(1, 2, 3),
doubles = c(1.5, 2.7, 3.9),
logicals = c(TRUE, FALSE, TRUE),
characters = c("apple", "banana", "cherry"),
dates = as.Date(c("2024-01-01", "2024-02-01", "2024-03-01")),
times = hms::hms(hours = c(9, 14, 18), minutes = c(30, 15, 45)),
datetimes = as.POSIXct(c("2024-01-01 09:30:00",
"2024-02-01 14:15:00",
"2024-03-01 18:45:00"))
)
write_csv(diverse_data, "diverse_data.csv")
# Read with automatic detection
auto_read <- read_csv("diverse_data.csv")
glimpse(auto_read)Rows: 3
Columns: 7
$ integers <dbl> 1, 2, 3
$ doubles <dbl> 1.5, 2.7, 3.9
$ logicals <lgl> TRUE, FALSE, TRUE
$ characters <chr> "apple", "banana", "cherry"
$ dates <date> 2024-01-01, 2024-02-01, 2024-03-01
$ times <time> 09:30:00, 14:15:00, 18:45:00
$ datetimes <dttm> 2024-01-01 09:30:00, 2024-02-01 14:15:00, 2024-03-01 18:45:…
Compact Column Specification
Use a compact string specification:
# Use single letters for common types
# i = integer, d = double, c = character, l = logical
# D = date, T = datetime, t = time
data_compact <- read_csv(
"diverse_data.csv",
col_types = "iddcDtT" # One letter per column
)
glimpse(data_compact)Rows: 3
Columns: 7
$ integers <int> 1, 2, 3
$ doubles <dbl> 1.5, 2.7, 3.9
$ logicals <dbl> NA, NA, NA
$ characters <chr> "apple", "banana", "cherry"
$ dates <date> 2024-01-01, 2024-02-01, 2024-03-01
$ times <time> 09:30:00, 14:15:00, 18:45:00
$ datetimes <dttm> 2024-01-01 09:30:00, 2024-02-01 14:15:00, 2024-03-01 18:45:…
Handling Common Data Issues
Missing Values
# Create data with missing values
missing_data <- "id,value,category
1,10,A
2,NA,B
3,20,NA
4,,C
5,30,D"
# Default: NA and empty strings are treated as missing
df_default <- read_csv(missing_data)
df_default# A tibble: 5 × 3
id value category
<dbl> <dbl> <chr>
1 1 10 A
2 2 NA B
3 3 20 <NA>
4 4 NA C
5 5 30 D
# Specify custom NA values
df_custom <- read_csv(
missing_data,
na = c("", "NA", "N/A", "missing")
)
df_custom# A tibble: 5 × 3
id value category
<dbl> <dbl> <chr>
1 1 10 A
2 2 NA B
3 3 20 <NA>
4 4 NA C
5 5 30 D
Parsing Problems
# Data with parsing issues
problematic_data <- "number,text
1,apple
two,banana
3,cherry
4.5,date"
# Read and check for problems
df_problems <- read_csv(problematic_data)
# See what went wrong
problems(df_problems)# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
# Fix by specifying correct types
df_fixed <- read_csv(
problematic_data,
col_types = cols(
number = col_character(), # Read as character to preserve all data
text = col_character()
)
)
df_fixed# A tibble: 4 × 2
number text
<chr> <chr>
1 1 apple
2 two banana
3 3 cherry
4 4.5 date
Skip Rows and Comments
# Data with header information
messy_file <- "# Company Sales Report
# Generated: 2024-01-01
# Department: Analytics
id,product,sales
1,Widget A,1000
2,Widget B,1500
# Mid-year update
3,Widget C,2000"
# Skip lines and ignore comments
clean_data <- read_csv(
messy_file,
skip = 3, # Skip first 3 lines
comment = "#" # Ignore lines starting with #
)
clean_data# A tibble: 3 × 3
id product sales
<dbl> <chr> <dbl>
1 1 Widget A 1000
2 2 Widget B 1500
3 3 Widget C 2000
Reading Other File Formats
TSV Files (Tab-Separated Values)
# Create a TSV file
tsv_data <- tibble(
gene = c("BRCA1", "TP53", "EGFR"),
chromosome = c(17, 17, 7),
gene_function = c("DNA repair", "Tumor suppressor", "Growth factor receptor")
)
write_tsv(tsv_data, "genes.tsv")
# Read TSV
genes <- read_tsv("genes.tsv")
genes# A tibble: 3 × 3
gene chromosome gene_function
<chr> <dbl> <chr>
1 BRCA1 17 DNA repair
2 TP53 17 Tumor suppressor
3 EGFR 7 Growth factor receptor
Delimited Files
# Create a pipe-delimited file
pipe_data <- "name|age|city
Alice|25|New York
Bob|30|Los Angeles
Charlie|35|Chicago"
write_file(pipe_data, "pipe_data.txt")
# Read with custom delimiter
df_pipe <- read_delim("pipe_data.txt", delim = "|")
df_pipe# A tibble: 3 × 3
name age city
<chr> <dbl> <chr>
1 Alice 25 New York
2 Bob 30 Los Angeles
3 Charlie 35 Chicago
# Semicolon-delimited (common in European data)
semi_data <- "product;price;quantity
Apple;1,50;100
Banana;0,75;150
Orange;2,00;80"
write_file(semi_data, "semi_data.csv")
# Read with semicolon delimiter and European decimal mark
df_semi <- read_delim(
"semi_data.csv",
delim = ";",
locale = locale(decimal_mark = ",")
)
df_semi# A tibble: 3 × 3
product price quantity
<chr> <dbl> <dbl>
1 Apple 1.5 100
2 Banana 0.75 150
3 Orange 2 80
Fixed Width Files
# Create a fixed width file
fwf_data <- "001John Smith 25 Engineer
002Jane Doe 30 Manager
003Bob Johnson 28 Analyst"
write_file(fwf_data, "fixed_width.txt")
# Read fixed width file with column positions
df_fwf <- read_fwf(
"fixed_width.txt",
col_positions = fwf_widths(
widths = c(3, 8, 8, 4, 10),
col_names = c("id", "first", "last", "age", "job")
)
)
df_fwf# A tibble: 3 × 5
id first last age job
<chr> <chr> <chr> <dbl> <chr>
1 001 John Smith 25 Engineer
2 002 Jane Doe 30 Manager
3 003 Bob Johnson 28 Analyst
Working with Large Files
Reading in Chunks
# Create a larger file
large_data <- tibble(
id = 1:10000,
value = rnorm(10000),
category = sample(LETTERS[1:5], 10000, replace = TRUE)
)
write_csv(large_data, "large_file.csv")
# Read in chunks for processing
process_chunk <- function(chunk, pos) {
# Process each chunk
summarize(chunk,
chunk = pos,
n = n(),
mean_value = mean(value))
}
# Process file in chunks of 2000 rows
chunk_results <- read_csv_chunked(
"large_file.csv",
chunk_size = 2000,
callback = DataFrameCallback$new(process_chunk)
)
chunk_results# A tibble: 5 × 3
chunk n mean_value
<int> <int> <dbl>
1 1 2000 0.0145
2 2001 2000 0.00626
3 4001 2000 -0.0105
4 6001 2000 -0.0333
5 8001 2000 0.00498
Lazy Reading
# For very large files, use lazy reading
# This doesn't load the entire file into memory
library(vroom) # Alternative package for very large files
# Lazy read - data is only read when needed
lazy_df <- vroom("very_large_file.csv")
# Operations are performed lazily
result <- lazy_df %>%
filter(category == "A") %>%
group_by(year) %>%
summarize(total = sum(value))Advanced Import Options
Locale Settings
# European format data
euro_data <- "date;amount;percentage
01/02/2024;1.234,56;45,5%
15/03/2024;2.345,67;32,1%
30/04/2024;3.456,78;28,9%"
write_file(euro_data, "euro_data.csv")
# Read with European locale
euro_df <- read_delim(
"euro_data.csv",
delim = ";",
locale = locale(
date_format = "%d/%m/%Y",
decimal_mark = ",",
grouping_mark = "."
)
)
euro_df# A tibble: 3 × 3
date amount percentage
<date> <dbl> <chr>
1 2024-02-01 1235. 45,5%
2 2024-03-15 2346. 32,1%
3 2024-04-30 3457. 28,9%
Column Selection
# Create data with many columns
wide_data <- tibble(
id = 1:5,
name = c("A", "B", "C", "D", "E"),
col1 = rnorm(5),
col2 = rnorm(5),
col3 = rnorm(5),
col4 = rnorm(5),
col5 = rnorm(5)
)
write_csv(wide_data, "wide_data.csv")
# Read only specific columns
selected <- read_csv(
"wide_data.csv",
col_select = c(id, name, col1, col3)
)
selected# A tibble: 5 × 4
id name col1 col3
<dbl> <chr> <dbl> <dbl>
1 1 A 1.41 -0.267
2 2 B -0.984 -0.198
3 3 C -1.23 -0.277
4 4 D 1.07 -0.159
5 5 E -0.172 0.135
# Use tidyselect helpers
selected2 <- read_csv(
"wide_data.csv",
col_select = c(id, name, starts_with("col"))
)
head(selected2, 2)# A tibble: 2 × 7
id name col1 col2 col3 col4 col5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 1.41 -0.746 -0.267 0.495 0.727
2 2 B -0.984 -0.0287 -0.198 0.428 0.785
Writing Data
Writing CSV Files
# Create sample data
output_data <- tibble(
id = 1:5,
value = rnorm(5),
date = Sys.Date() + 0:4,
timestamp = Sys.time() + 0:4 * 3600
)
# Write CSV (no row names, readable format)
write_csv(output_data, "output.csv")
# Write with European format
write_csv2(output_data, "output_euro.csv") # Uses semicolon delimiter
# Write TSV
write_tsv(output_data, "output.tsv")
# Write with custom NA representation
data_with_na <- tibble(
x = c(1, NA, 3),
y = c("a", "b", NA)
)
write_csv(data_with_na, "with_na.csv", na = "missing")Appending to Files
# Initial data
initial <- tibble(id = 1:3, value = c(10, 20, 30))
write_csv(initial, "append_test.csv")
# New data to append
new_data <- tibble(id = 4:6, value = c(40, 50, 60))
# Append to existing file
write_csv(new_data, "append_test.csv", append = TRUE)
# Read the combined file
read_csv("append_test.csv")# A tibble: 6 × 2
id value
<dbl> <dbl>
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
Best Practices
1. Always Inspect Your Data
# After reading, always check:
data <- read_csv("employees.csv")
# Structure
glimpse(data)Rows: 5
Columns: 5
$ id <dbl> 1, 2, 3, 4, 5
$ name <chr> "Alice", "Bob", "Charlie", "Diana", "Eve"
$ age <dbl> 25, 30, 35, 28, 32
$ salary <dbl> 50000, 60000, 75000, 55000, 65000
$ hire_date <date> 2020-01-15, 2019-06-01, 2018-03-20, 2021-02-10, 2019-11-30
# First few rows
head(data)# A tibble: 5 × 5
id name age salary hire_date
<dbl> <chr> <dbl> <dbl> <date>
1 1 Alice 25 50000 2020-01-15
2 2 Bob 30 60000 2019-06-01
3 3 Charlie 35 75000 2018-03-20
4 4 Diana 28 55000 2021-02-10
5 5 Eve 32 65000 2019-11-30
# Summary statistics
summary(data) id name age salary
Min. :1 Length:5 Min. :25 Min. :50000
1st Qu.:2 Class :character 1st Qu.:28 1st Qu.:55000
Median :3 Mode :character Median :30 Median :60000
Mean :3 Mean :30 Mean :61000
3rd Qu.:4 3rd Qu.:32 3rd Qu.:65000
Max. :5 Max. :35 Max. :75000
hire_date
Min. :2018-03-20
1st Qu.:2019-06-01
Median :2019-11-30
Mean :2019-09-27
3rd Qu.:2020-01-15
Max. :2021-02-10
# Check for parsing problems
problems(data)# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
2. Be Explicit About Types
# Good: Explicit column types
data <- read_csv(
"file.csv",
col_types = cols(
id = col_integer(),
amount = col_double(),
date = col_date(format = "%Y-%m-%d"),
category = col_factor(levels = c("A", "B", "C"))
)
)
# This prevents surprises and makes code more robust3. Handle Encodings Properly
# For files with special characters
data <- read_csv(
"international_data.csv",
locale = locale(encoding = "UTF-8")
)
# Common encodings: UTF-8, Latin1, Windows-1252Common Patterns
Pattern 1: Import and Clean
# Common workflow: import, clean, transform
clean_data <- read_csv("employees.csv") %>%
# Remove any rows with missing values
drop_na() %>%
# Parse dates properly
mutate(
hire_date = as.Date(hire_date),
years_employed = as.numeric(Sys.Date() - hire_date) / 365.25
) %>%
# Filter and arrange
filter(age >= 25) %>%
arrange(desc(salary))
clean_data# A tibble: 5 × 6
id name age salary hire_date years_employed
<dbl> <chr> <dbl> <dbl> <date> <dbl>
1 3 Charlie 35 75000 2018-03-20 7.51
2 5 Eve 32 65000 2019-11-30 5.81
3 2 Bob 30 60000 2019-06-01 6.31
4 4 Diana 28 55000 2021-02-10 4.61
5 1 Alice 25 50000 2020-01-15 5.69
Pattern 2: Multiple File Import
# Create multiple files
for(i in 1:3) {
tibble(
file_id = i,
value = rnorm(5),
category = sample(c("A", "B"), 5, replace = TRUE)
) %>%
write_csv(paste0("data_", i, ".csv"))
}
# Read all files at once
all_files <- list.files(pattern = "data_\\d+\\.csv", full.names = TRUE)
combined_data <- all_files %>%
map_dfr(read_csv, show_col_types = FALSE) %>%
arrange(file_id)
combined_data# A tibble: 15 × 3
file_id value category
<dbl> <dbl> <chr>
1 1 0.410 A
2 1 0.653 B
3 1 0.187 B
4 1 1.80 B
5 1 1.30 A
6 2 0.907 A
7 2 0.492 B
8 2 -1.11 B
9 2 -0.786 B
10 2 0.418 B
11 3 0.284 A
12 3 1.28 B
13 3 -1.34 A
14 3 -0.0405 B
15 3 0.368 A
Exercises
Exercise 1: Basic Import
- Create a CSV file with student grades
- Read it with appropriate column types
- Check for any parsing problems
Exercise 2: Messy Data
Create a CSV file with: - Comment lines starting with # - Empty rows - Mixed date formats - Custom NA values Then read it correctly using readr options.
Exercise 3: Large File Simulation
- Generate a CSV with 100,000 rows
- Read only specific columns
- Process it in chunks to calculate group summaries
Exercise 4: International Data
Create a CSV file with European formatting (semicolon delimiter, comma decimal) and read it properly.
Cleanup
Summary
The readr package provides a modern, fast, and user-friendly approach to data import:
- Consistent API: All read functions work similarly
- Automatic type detection: Smart guessing with explicit overrides
- Better defaults: Returns tibbles, preserves strings
- Helpful feedback: Shows column specs and parsing problems
- Performance: Much faster than base R for large files
- Flexibility: Handles various formats and encodings
Mastering readr is essential for efficient data import in the tidyverse ecosystem. Combined with dplyr and tidyr, it forms the foundation of most data analysis workflows.
Next, we’ll explore data transformation with dplyr!