Lists and Data Frames

Author

IND215

Published

September 22, 2025

Introduction to Complex Data Structures

While vectors are fundamental, real-world data analysis often requires more complex structures. In this section, we’ll explore two essential data structures that build upon vectors:

Lists: Flexible containers that can hold different data types
Data frames: The backbone of data analysis, organizing data in rows and columns

Lists: Flexible Data Containers

What are Lists?

Lists are collections that can contain elements of different types. Unlike vectors, which must contain elements of the same type, lists can mix numbers, characters, logicals, and even other lists!

Creating Lists

# Basic list with different types
my_list <- list(
  numbers = c(1, 2, 3, 4),
  text = "Hello, world!",
  logical_value = TRUE,
  single_number = 42
)

print(my_list)

$numbers
[1] 1 2 3 4

$text
[1] "Hello, world!"

$logical_value
[1] TRUE

$single_number
[1] 42

# Lists can contain vectors of different lengths
mixed_list <- list(
  short_vector = c(1, 2),
  long_vector = c("a", "b", "c", "d", "e"),
  single_value = 3.14
)

print(mixed_list)

$short_vector
[1] 1 2

$long_vector
[1] "a" "b" "c" "d" "e"

$single_value
[1] 3.14

List Structure and Properties

# Examine list structure
str(my_list)

List of 4
 $ numbers      : num [1:4] 1 2 3 4
 $ text         : chr "Hello, world!"
 $ logical_value: logi TRUE
 $ single_number: num 42

# Check if it's a list
is.list(my_list)

[1] TRUE

# Get list length (number of elements)
length(my_list)

[1] 4

# Get element names
names(my_list)

[1] "numbers"       "text"          "logical_value" "single_number"

Complex Lists

Lists can contain other lists and complex objects:

# List containing other lists
nested_list <- list(
  personal_info = list(
    name = "Alice Johnson",
    age = 28,
    city = "Boston"
  ),
  scores = c(85, 92, 78, 96),
  metadata = list(
    date_created = Sys.Date(),
    version = 1.0
  )
)

str(nested_list)

List of 3
 $ personal_info:List of 3
  ..$ name: chr "Alice Johnson"
  ..$ age : num 28
  ..$ city: chr "Boston"
 $ scores       : num [1:4] 85 92 78 96
 $ metadata     :List of 2
  ..$ date_created: Date[1:1], format: "2025-09-22"
  ..$ version     : num 1

Accessing List Elements

There are three main ways to access list elements:

Using `[[]]` (Double Brackets)

# Extract single elements (returns the actual object)
my_list[[1]]           # First element

[1] 1 2 3 4

my_list[["numbers"]]   # Element named "numbers"

[1] 1 2 3 4

my_list$numbers        # Same as above, using $ notation

[1] 1 2 3 4

# Check types
typeof(my_list[[1]])   # Returns "double" (the vector)

[1] "double"

typeof(my_list[1])     # Returns "list" (a list with one element)

[1] "list"

Using `[]` (Single Brackets)

# Returns a list containing the selected elements
my_list[1]             # Returns a list with the first element

$numbers
[1] 1 2 3 4

my_list[c(1, 3)]       # Returns a list with elements 1 and 3

$numbers
[1] 1 2 3 4

$logical_value
[1] TRUE

my_list[c("numbers", "text")]  # Returns a list with named elements

$numbers
[1] 1 2 3 4

$text
[1] "Hello, world!"

# The difference is important!
class(my_list[[1]])    # "numeric" - the actual vector

[1] "numeric"

class(my_list[1])      # "list" - a list containing the vector

[1] "list"

Using `$` (Dollar Sign)

# Access by name (most common for named lists)
my_list$numbers

[1] 1 2 3 4

my_list$text

[1] "Hello, world!"

my_list$logical_value

[1] TRUE

# This only works with named elements
# my_list$1  # This would cause an error

Modifying Lists

# Create a list to modify
student_data <- list(
  name = "Bob Smith",
  grades = c(85, 90, 78)
)

# Add new elements
student_data$age <- 20
student_data$major <- "Statistics"

# Modify existing elements
student_data$grades <- c(student_data$grades, 92)  # Add new grade

# Remove elements (set to NULL)
student_data$age <- NULL

print(student_data)

$name
[1] "Bob Smith"

$grades
[1] 85 90 78 92

$major
[1] "Statistics"

List Functions

# Create sample list
data_list <- list(
  group_a = c(23, 25, 27, 29),
  group_b = c(18, 22, 24, 26, 28),
  group_c = c(30, 32, 34)
)

# Apply function to each element
lapply(data_list, mean)     # Returns a list

$group_a
[1] 26

$group_b
[1] 23.6

$group_c
[1] 32

sapply(data_list, mean)     # Returns a vector when possible

group_a group_b group_c 
   26.0    23.6    32.0

vapply(data_list, mean, numeric(1))  # Specify output type

group_a group_b group_c 
   26.0    23.6    32.0

# Get lengths of each element
lapply(data_list, length)

$group_a
[1] 4

$group_b
[1] 5

$group_c
[1] 3

sapply(data_list, length)

group_a group_b group_c 
      4       5       3

# Combine lists
list1 <- list(a = 1, b = 2)
list2 <- list(c = 3, d = 4)
combined <- c(list1, list2)
print(combined)

$a
[1] 1

$b
[1] 2

$c
[1] 3

$d
[1] 4

Data Frames: The Heart of Data Analysis

What are Data Frames?

Data frames are the most important data structure for data analysis. Think of them as spreadsheets or database tables - they organize data in rows and columns where:

Each column represents a variable (like age, income, name)
Each row represents an observation (like a person, transaction, measurement)
All columns have the same length
Different columns can contain different data types

Creating Data Frames

# Basic data frame
students <- data.frame(
  name = c("Alice", "Bob", "Charlie", "Diana"),
  age = c(20, 22, 19, 21),
  major = c("Psychology", "Math", "Biology", "English"),
  gpa = c(3.8, 3.6, 3.9, 3.7),
  stringsAsFactors = FALSE  # Keep strings as characters
)

print(students)

     name age      major gpa
1   Alice  20 Psychology 3.8
2     Bob  22       Math 3.6
3 Charlie  19    Biology 3.9
4   Diana  21    English 3.7

# Check structure
str(students)

'data.frame':   4 obs. of  4 variables:
 $ name : chr  "Alice" "Bob" "Charlie" "Diana"
 $ age  : num  20 22 19 21
 $ major: chr  "Psychology" "Math" "Biology" "English"
 $ gpa  : num  3.8 3.6 3.9 3.7

Data Frame Properties

# Dimensions
nrow(students)    # Number of rows

[1] 4

ncol(students)    # Number of columns

[1] 4

dim(students)     # Both dimensions

[1] 4 4

# Column and row names
colnames(students)

[1] "name"  "age"   "major" "gpa"

rownames(students)

[1] "1" "2" "3" "4"

names(students)   # Same as colnames for data frames

[1] "name"  "age"   "major" "gpa"

# Quick overview
head(students)    # First 6 rows (default)

     name age      major gpa
1   Alice  20 Psychology 3.8
2     Bob  22       Math 3.6
3 Charlie  19    Biology 3.9
4   Diana  21    English 3.7

tail(students)    # Last 6 rows

     name age      major gpa
1   Alice  20 Psychology 3.8
2     Bob  22       Math 3.6
3 Charlie  19    Biology 3.9
4   Diana  21    English 3.7

summary(students) # Summary statistics

     name                age           major                gpa       
 Length:4           Min.   :19.00   Length:4           Min.   :3.600  
 Class :character   1st Qu.:19.75   Class :character   1st Qu.:3.675  
 Mode  :character   Median :20.50   Mode  :character   Median :3.750  
                    Mean   :20.50                      Mean   :3.750  
                    3rd Qu.:21.25                      3rd Qu.:3.825  
                    Max.   :22.00                      Max.   :3.900

Accessing Data Frame Elements

Column Access

# Access columns (multiple ways)
students$name           # Using $

[1] "Alice"   "Bob"     "Charlie" "Diana"

students[["name"]]      # Using [[]]

[1] "Alice"   "Bob"     "Charlie" "Diana"

students["name"]        # Returns data frame with one column

     name
1   Alice
2     Bob
3 Charlie
4   Diana

students[, "name"]      # Using row, column notation

[1] "Alice"   "Bob"     "Charlie" "Diana"

# Multiple columns
students[c("name", "gpa")]

     name gpa
1   Alice 3.8
2     Bob 3.6
3 Charlie 3.9
4   Diana 3.7

students[, c("name", "gpa")]

     name gpa
1   Alice 3.8
2     Bob 3.6
3 Charlie 3.9
4   Diana 3.7

# All columns
students[, ]  # All rows and columns

     name age      major gpa
1   Alice  20 Psychology 3.8
2     Bob  22       Math 3.6
3 Charlie  19    Biology 3.9
4   Diana  21    English 3.7

Row Access

# Access rows
students[1, ]           # First row, all columns

   name age      major gpa
1 Alice  20 Psychology 3.8

students[c(1, 3), ]     # Rows 1 and 3, all columns

     name age      major gpa
1   Alice  20 Psychology 3.8
3 Charlie  19    Biology 3.9

students[1:2, ]         # Rows 1 through 2

   name age      major gpa
1 Alice  20 Psychology 3.8
2   Bob  22       Math 3.6

# Specific cells
students[1, "name"]     # Row 1, column "name"

[1] "Alice"

students[1, 2]          # Row 1, column 2

[1] 20

students[c(1, 3), c("name", "gpa")]  # Multiple rows and columns

     name gpa
1   Alice 3.8
3 Charlie 3.9

Conditional Subsetting

# Filter rows based on conditions
high_gpa <- students[students$gpa > 3.7, ]
print(high_gpa)

     name age      major gpa
1   Alice  20 Psychology 3.8
3 Charlie  19    Biology 3.9

# Multiple conditions
young_high_achievers <- students[students$age < 21 & students$gpa > 3.8, ]
print(young_high_achievers)

     name age   major gpa
3 Charlie  19 Biology 3.9

# Using subset() function (alternative)
subset(students, gpa > 3.7)

     name age      major gpa
1   Alice  20 Psychology 3.8
3 Charlie  19    Biology 3.9

subset(students, age < 21 & gpa > 3.8)

     name age   major gpa
3 Charlie  19 Biology 3.9

# Filter and select specific columns
subset(students, gpa > 3.7, select = c("name", "gpa"))

     name gpa
1   Alice 3.8
3 Charlie 3.9

Modifying Data Frames

Adding Columns

# Add new columns
students$year <- c("Sophomore", "Senior", "Freshman", "Junior")
students$credits <- c(45, 120, 15, 90)

# Calculate new columns from existing ones
students$gpa_category <- ifelse(students$gpa >= 3.8, "High",
                               ifelse(students$gpa >= 3.5, "Medium", "Low"))

print(students)

     name age      major gpa      year credits gpa_category
1   Alice  20 Psychology 3.8 Sophomore      45         High
2     Bob  22       Math 3.6    Senior     120       Medium
3 Charlie  19    Biology 3.9  Freshman      15         High
4   Diana  21    English 3.7    Junior      90       Medium

Adding Rows

# Create new student data
new_student <- data.frame(
  name = "Eve",
  age = 20,
  major = "Computer Science",
  gpa = 3.95,
  year = "Sophomore",
  credits = 50,
  gpa_category = "High",
  stringsAsFactors = FALSE
)

# Add to existing data frame
students_expanded <- rbind(students, new_student)
print(students_expanded)

     name age            major  gpa      year credits gpa_category
1   Alice  20       Psychology 3.80 Sophomore      45         High
2     Bob  22             Math 3.60    Senior     120       Medium
3 Charlie  19          Biology 3.90  Freshman      15         High
4   Diana  21          English 3.70    Junior      90       Medium
5     Eve  20 Computer Science 3.95 Sophomore      50         High

Modifying Values

# Modify specific values
students$age[1] <- 21  # Change Alice's age

# Modify based on conditions
students$gpa[students$name == "Bob"] <- 3.65  # Update Bob's GPA

# Modify multiple values
students$credits <- students$credits + 3  # Add 3 credits to everyone

print(students)

     name age      major  gpa      year credits gpa_category
1   Alice  21 Psychology 3.80 Sophomore      48         High
2     Bob  22       Math 3.65    Senior     123       Medium
3 Charlie  19    Biology 3.90  Freshman      18         High
4   Diana  21    English 3.70    Junior      93       Medium

Data Frame Operations

Sorting

# Sort by GPA (ascending)
students_by_gpa <- students[order(students$gpa), ]
print(students_by_gpa)

     name age      major  gpa      year credits gpa_category
2     Bob  22       Math 3.65    Senior     123       Medium
4   Diana  21    English 3.70    Junior      93       Medium
1   Alice  21 Psychology 3.80 Sophomore      48         High
3 Charlie  19    Biology 3.90  Freshman      18         High

# Sort by GPA (descending)
students_by_gpa_desc <- students[order(-students$gpa), ]
print(students_by_gpa_desc)

     name age      major  gpa      year credits gpa_category
3 Charlie  19    Biology 3.90  Freshman      18         High
1   Alice  21 Psychology 3.80 Sophomore      48         High
4   Diana  21    English 3.70    Junior      93       Medium
2     Bob  22       Math 3.65    Senior     123       Medium

# Sort by multiple columns
students_sorted <- students[order(students$major, -students$gpa), ]
print(students_sorted)

     name age      major  gpa      year credits gpa_category
3 Charlie  19    Biology 3.90  Freshman      18         High
4   Diana  21    English 3.70    Junior      93       Medium
2     Bob  22       Math 3.65    Senior     123       Medium
1   Alice  21 Psychology 3.80 Sophomore      48         High

Aggregation

# Basic statistics
mean(students$gpa)

[1] 3.7625

max(students$age)

[1] 22

min(students$credits)

[1] 18

# Statistics by group
aggregate(gpa ~ major, data = students, FUN = mean)

       major  gpa
1    Biology 3.90
2    English 3.70
3       Math 3.65
4 Psychology 3.80

aggregate(age ~ gpa_category, data = students, FUN = mean)

  gpa_category  age
1         High 20.0
2       Medium 21.5

# Count by category
table(students$major)


   Biology    English       Math Psychology 
         1          1          1          1

table(students$gpa_category)


  High Medium 
     2      2

Working with Real Data

Example 1: Sales Analysis

# Create sales data
sales_data <- data.frame(
  date = as.Date(c("2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05")),
  product = c("Widget A", "Widget B", "Widget A", "Widget C", "Widget B"),
  quantity = c(10, 15, 8, 12, 20),
  price = c(25.99, 35.50, 25.99, 45.00, 35.50),
  sales_rep = c("Alice", "Bob", "Alice", "Charlie", "Bob")
)

# Calculate revenue
sales_data$revenue <- sales_data$quantity * sales_data$price

# Add day of week
sales_data$day_of_week <- weekdays(sales_data$date)

print(sales_data)

        date  product quantity price sales_rep revenue day_of_week
1 2024-01-01 Widget A       10 25.99     Alice  259.90      Monday
2 2024-01-02 Widget B       15 35.50       Bob  532.50     Tuesday
3 2024-01-03 Widget A        8 25.99     Alice  207.92   Wednesday
4 2024-01-04 Widget C       12 45.00   Charlie  540.00    Thursday
5 2024-01-05 Widget B       20 35.50       Bob  710.00      Friday

# Analysis
total_revenue <- sum(sales_data$revenue)
best_day <- sales_data[which.max(sales_data$revenue), ]
revenue_by_rep <- aggregate(revenue ~ sales_rep, data = sales_data, FUN = sum)

cat("Total revenue: $", round(total_revenue, 2), "\n")

Total revenue: $ 2250.32

print("Best performing day:")

[1] "Best performing day:"

print(best_day)

        date  product quantity price sales_rep revenue day_of_week
5 2024-01-05 Widget B       20  35.5       Bob     710      Friday

print("Revenue by sales rep:")

[1] "Revenue by sales rep:"

print(revenue_by_rep)

  sales_rep revenue
1     Alice  467.82
2       Bob 1242.50
3   Charlie  540.00

Example 2: Student Performance Analysis

# Create comprehensive student data
student_performance <- data.frame(
  student_id = 1:20,
  name = paste("Student", LETTERS[1:20]),
  math_score = round(rnorm(20, mean = 85, sd = 10)),
  science_score = round(rnorm(20, mean = 82, sd = 12)),
  english_score = round(rnorm(20, mean = 88, sd = 8)),
  attendance_rate = round(runif(20, min = 0.7, max = 1.0), 2),
  study_hours = round(runif(20, min = 5, max = 25))
)

# Calculate derived variables
student_performance$average_score <- rowMeans(student_performance[, c("math_score", "science_score", "english_score")])
student_performance$grade <- cut(student_performance$average_score,
                                breaks = c(0, 60, 70, 80, 90, 100),
                                labels = c("F", "D", "C", "B", "A"))

# Identify high performers
high_performers <- subset(student_performance,
                         average_score >= 90 & attendance_rate >= 0.9)

# Performance by study hours
cor(student_performance$study_hours, student_performance$average_score)

[1] 0.1508882

print("High performers:")

[1] "High performers:"

print(high_performers[, c("name", "average_score", "attendance_rate", "study_hours")])

        name average_score attendance_rate study_hours
8  Student H      92.33333            0.99          17
16 Student P      92.00000            0.93           8
17 Student Q      92.00000            0.96           5

print("Grade distribution:")

[1] "Grade distribution:"

print(table(student_performance$grade))


 F  D  C  B  A 
 0  0  1 11  8

Combining Lists and Data Frames

Sometimes you need to store data frames within lists or create complex nested structures:

# List containing multiple data frames
analysis_results <- list(
  raw_data = student_performance,
  summary_stats = data.frame(
    metric = c("Mean Score", "Median Score", "Std Dev", "Max Score"),
    value = c(mean(student_performance$average_score),
              median(student_performance$average_score),
              sd(student_performance$average_score),
              max(student_performance$average_score))
  ),
  high_performers = high_performers,
  metadata = list(
    analysis_date = Sys.Date(),
    total_students = nrow(student_performance),
    subjects_analyzed = c("math", "science", "english")
  )
)

# Access different components
str(analysis_results, max.level = 2)

List of 4
 $ raw_data       :'data.frame':    20 obs. of  9 variables:
  ..$ student_id     : int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ name           : chr [1:20] "Student A" "Student B" "Student C" "Student D" ...
  ..$ math_score     : num [1:20] 89 83 82 102 89 95 86 82 99 87 ...
  ..$ science_score  : num [1:20] 63 78 99 103 87 88 74 88 96 66 ...
  ..$ english_score  : num [1:20] 85 85 80 86 82 79 96 107 84 97 ...
  ..$ attendance_rate: num [1:20] 0.98 0.94 0.75 0.87 0.74 0.75 0.98 0.99 0.74 0.93 ...
  ..$ study_hours    : num [1:20] 12 21 16 24 15 12 6 17 9 10 ...
  ..$ average_score  : num [1:20] 79 82 87 97 86 ...
  ..$ grade          : Factor w/ 5 levels "F","D","C","B",..: 3 4 4 5 4 4 4 5 5 4 ...
 $ summary_stats  :'data.frame':    4 obs. of  2 variables:
  ..$ metric: chr [1:4] "Mean Score" "Median Score" "Std Dev" "Max Score"
  ..$ value : num [1:4] 88.37 87.17 5.25 99.67
 $ high_performers:'data.frame':    3 obs. of  9 variables:
  ..$ student_id     : int [1:3] 8 16 17
  ..$ name           : chr [1:3] "Student H" "Student P" "Student Q"
  ..$ math_score     : num [1:3] 82 95 91
  ..$ science_score  : num [1:3] 88 96 93
  ..$ english_score  : num [1:3] 107 85 92
  ..$ attendance_rate: num [1:3] 0.99 0.93 0.96
  ..$ study_hours    : num [1:3] 17 8 5
  ..$ average_score  : num [1:3] 92.3 92 92
  ..$ grade          : Factor w/ 5 levels "F","D","C","B",..: 5 5 5
 $ metadata       :List of 3
  ..$ analysis_date    : Date[1:1], format: "2025-09-22"
  ..$ total_students   : int 20
  ..$ subjects_analyzed: chr [1:3] "math" "science" "english"

print(analysis_results$summary_stats)

        metric     value
1   Mean Score 88.366667
2 Median Score 87.166667
3      Std Dev  5.245884
4    Max Score 99.666667

print(analysis_results$metadata$total_students)

[1] 20

Best Practices and Common Pitfalls

1. Data Frame vs List Choice

# Use data frames when:
# - All columns have the same length
# - You're working with rectangular data
# - You need to analyze relationships between variables

# Use lists when:
# - Elements have different lengths
# - You need to store heterogeneous objects
# - You're collecting results from multiple analyses

2. Factor Handling

# Be careful with factors in older R versions
# Always use stringsAsFactors = FALSE when creating data frames
# Or convert factors to characters when needed

students_safe <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  major = c("Math", "Science", "Art"),
  stringsAsFactors = FALSE
)

# Check if columns are factors
sapply(students_safe, class)

       name       major 
"character" "character"

3. Consistent Data Types

# Ensure consistent data types within columns
mixed_ages <- c(20, "twenty-one", 22)  # This becomes all character!
print(mixed_ages)

[1] "20"         "twenty-one" "22"

# Better approach
clean_ages <- c(20, 21, 22)
special_case <- "unknown"  # Handle special cases separately

Exercises

Exercise 1: List Practice

Create a list containing:
- A vector of your favorite colors
- Your age
- A logical value indicating if you like pizza
- A nested list with your address information
Practice accessing elements using different methods ($, [[]], [])

Exercise 2: Data Frame Creation

Create a data frame with information about 10 movies including: - Title, year, genre, rating (1-10), duration (minutes) - Add calculated columns for decade and rating category - Find movies from a specific decade with high ratings

Exercise 3: Real-world Analysis

Given this employee data structure:

employees <- data.frame(
  name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  department = c("Sales", "IT", "Sales", "HR", "IT"),
  salary = c(50000, 75000, 55000, 60000, 80000),
  years_experience = c(3, 7, 4, 5, 9),
  performance_rating = c(4.2, 4.8, 3.9, 4.5, 4.9)
)

Calculate average salary by department
Find the highest paid employee in each department
Identify employees with above-average performance ratings
Create a bonus column (10% of salary for ratings > 4.5)

Summary

Lists and data frames are essential data structures in R:

Lists

Flexible: Can contain different data types and structures
Nested: Can contain other lists for complex hierarchies
Access: Use $, [[]] for elements, [] for sub-lists
Use case: Storing analysis results, configuration data, complex objects

Data Frames

Rectangular: Rows and columns like a spreadsheet
Mixed types: Different columns can have different types
Consistent length: All columns must have the same number of rows
Use case: Primary structure for data analysis

Key Operations

Creation: list() and data.frame()
Access: Multiple indexing methods for flexibility
Modification: Add, remove, or change elements and columns
Analysis: Built-in functions for summaries and aggregation

These structures form the foundation for data manipulation and analysis. Next, we’ll learn about control structures to add logic and iteration to our data processing workflows!