While vectors are fundamental, real-world data analysis often requires more complex structures. In this section, we’ll explore two essential data structures that build upon vectors:
Lists: Flexible containers that can hold different data types
Data frames: The backbone of data analysis, organizing data in rows and columns
Lists: Flexible Data Containers
What are Lists?
Lists are collections that can contain elements of different types. Unlike vectors, which must contain elements of the same type, lists can mix numbers, characters, logicals, and even other lists!
Creating Lists
# Basic list with different typesmy_list <-list(numbers =c(1, 2, 3, 4),text ="Hello, world!",logical_value =TRUE,single_number =42)print(my_list)
Lists can contain other lists and complex objects:
# List containing other listsnested_list <-list(personal_info =list(name ="Alice Johnson",age =28,city ="Boston" ),scores =c(85, 92, 78, 96),metadata =list(date_created =Sys.Date(),version =1.0 ))str(nested_list)
List of 3
$ personal_info:List of 3
..$ name: chr "Alice Johnson"
..$ age : num 28
..$ city: chr "Boston"
$ scores : num [1:4] 85 92 78 96
$ metadata :List of 2
..$ date_created: Date[1:1], format: "2025-09-22"
..$ version : num 1
Accessing List Elements
There are three main ways to access list elements:
Using [[]] (Double Brackets)
# Extract single elements (returns the actual object)my_list[[1]] # First element
[1] 1 2 3 4
my_list[["numbers"]] # Element named "numbers"
[1] 1 2 3 4
my_list$numbers # Same as above, using $ notation
[1] 1 2 3 4
# Check typestypeof(my_list[[1]]) # Returns "double" (the vector)
[1] "double"
typeof(my_list[1]) # Returns "list" (a list with one element)
[1] "list"
Using [] (Single Brackets)
# Returns a list containing the selected elementsmy_list[1] # Returns a list with the first element
$numbers
[1] 1 2 3 4
my_list[c(1, 3)] # Returns a list with elements 1 and 3
$numbers
[1] 1 2 3 4
$logical_value
[1] TRUE
my_list[c("numbers", "text")] # Returns a list with named elements
$numbers
[1] 1 2 3 4
$text
[1] "Hello, world!"
# The difference is important!class(my_list[[1]]) # "numeric" - the actual vector
[1] "numeric"
class(my_list[1]) # "list" - a list containing the vector
[1] "list"
Using $ (Dollar Sign)
# Access by name (most common for named lists)my_list$numbers
[1] 1 2 3 4
my_list$text
[1] "Hello, world!"
my_list$logical_value
[1] TRUE
# This only works with named elements# my_list$1 # This would cause an error
Modifying Lists
# Create a list to modifystudent_data <-list(name ="Bob Smith",grades =c(85, 90, 78))# Add new elementsstudent_data$age <-20student_data$major <-"Statistics"# Modify existing elementsstudent_data$grades <-c(student_data$grades, 92) # Add new grade# Remove elements (set to NULL)student_data$age <-NULLprint(student_data)
# Create sample listdata_list <-list(group_a =c(23, 25, 27, 29),group_b =c(18, 22, 24, 26, 28),group_c =c(30, 32, 34))# Apply function to each elementlapply(data_list, mean) # Returns a list
$group_a
[1] 26
$group_b
[1] 23.6
$group_c
[1] 32
sapply(data_list, mean) # Returns a vector when possible
group_a group_b group_c
26.0 23.6 32.0
vapply(data_list, mean, numeric(1)) # Specify output type
group_a group_b group_c
26.0 23.6 32.0
# Get lengths of each elementlapply(data_list, length)
$group_a
[1] 4
$group_b
[1] 5
$group_c
[1] 3
sapply(data_list, length)
group_a group_b group_c
4 5 3
# Combine listslist1 <-list(a =1, b =2)list2 <-list(c =3, d =4)combined <-c(list1, list2)print(combined)
$a
[1] 1
$b
[1] 2
$c
[1] 3
$d
[1] 4
Data Frames: The Heart of Data Analysis
What are Data Frames?
Data frames are the most important data structure for data analysis. Think of them as spreadsheets or database tables - they organize data in rows and columns where:
Each column represents a variable (like age, income, name)
Each row represents an observation (like a person, transaction, measurement)
All columns have the same length
Different columns can contain different data types
name age major gpa
1 Alice 20 Psychology 3.8
2 Bob 22 Math 3.6
3 Charlie 19 Biology 3.9
4 Diana 21 English 3.7
# Check structurestr(students)
'data.frame': 4 obs. of 4 variables:
$ name : chr "Alice" "Bob" "Charlie" "Diana"
$ age : num 20 22 19 21
$ major: chr "Psychology" "Math" "Biology" "English"
$ gpa : num 3.8 3.6 3.9 3.7
Data Frame Properties
# Dimensionsnrow(students) # Number of rows
[1] 4
ncol(students) # Number of columns
[1] 4
dim(students) # Both dimensions
[1] 4 4
# Column and row namescolnames(students)
[1] "name" "age" "major" "gpa"
rownames(students)
[1] "1" "2" "3" "4"
names(students) # Same as colnames for data frames
[1] "name" "age" "major" "gpa"
# Quick overviewhead(students) # First 6 rows (default)
name age major gpa
1 Alice 20 Psychology 3.8
2 Bob 22 Math 3.6
3 Charlie 19 Biology 3.9
4 Diana 21 English 3.7
tail(students) # Last 6 rows
name age major gpa
1 Alice 20 Psychology 3.8
2 Bob 22 Math 3.6
3 Charlie 19 Biology 3.9
4 Diana 21 English 3.7
summary(students) # Summary statistics
name age major gpa
Length:4 Min. :19.00 Length:4 Min. :3.600
Class :character 1st Qu.:19.75 Class :character 1st Qu.:3.675
Mode :character Median :20.50 Mode :character Median :3.750
Mean :20.50 Mean :3.750
3rd Qu.:21.25 3rd Qu.:3.825
Max. :22.00 Max. :3.900
Accessing Data Frame Elements
Column Access
# Access columns (multiple ways)students$name # Using $
[1] "Alice" "Bob" "Charlie" "Diana"
students[["name"]] # Using [[]]
[1] "Alice" "Bob" "Charlie" "Diana"
students["name"] # Returns data frame with one column
name
1 Alice
2 Bob
3 Charlie
4 Diana
students[, "name"] # Using row, column notation
[1] "Alice" "Bob" "Charlie" "Diana"
# Multiple columnsstudents[c("name", "gpa")]
name gpa
1 Alice 3.8
2 Bob 3.6
3 Charlie 3.9
4 Diana 3.7
students[, c("name", "gpa")]
name gpa
1 Alice 3.8
2 Bob 3.6
3 Charlie 3.9
4 Diana 3.7
# All columnsstudents[, ] # All rows and columns
name age major gpa
1 Alice 20 Psychology 3.8
2 Bob 22 Math 3.6
3 Charlie 19 Biology 3.9
4 Diana 21 English 3.7
Row Access
# Access rowsstudents[1, ] # First row, all columns
name age major gpa
1 Alice 20 Psychology 3.8
students[c(1, 3), ] # Rows 1 and 3, all columns
name age major gpa
1 Alice 20 Psychology 3.8
3 Charlie 19 Biology 3.9
students[1:2, ] # Rows 1 through 2
name age major gpa
1 Alice 20 Psychology 3.8
2 Bob 22 Math 3.6
# Specific cellsstudents[1, "name"] # Row 1, column "name"
[1] "Alice"
students[1, 2] # Row 1, column 2
[1] 20
students[c(1, 3), c("name", "gpa")] # Multiple rows and columns
name gpa
1 Alice 3.8
3 Charlie 3.9
Conditional Subsetting
# Filter rows based on conditionshigh_gpa <- students[students$gpa >3.7, ]print(high_gpa)
name age major gpa
1 Alice 20 Psychology 3.8
3 Charlie 19 Biology 3.9
# Using subset() function (alternative)subset(students, gpa >3.7)
name age major gpa
1 Alice 20 Psychology 3.8
3 Charlie 19 Biology 3.9
subset(students, age <21& gpa >3.8)
name age major gpa
3 Charlie 19 Biology 3.9
# Filter and select specific columnssubset(students, gpa >3.7, select =c("name", "gpa"))
name gpa
1 Alice 3.8
3 Charlie 3.9
Modifying Data Frames
Adding Columns
# Add new columnsstudents$year <-c("Sophomore", "Senior", "Freshman", "Junior")students$credits <-c(45, 120, 15, 90)# Calculate new columns from existing onesstudents$gpa_category <-ifelse(students$gpa >=3.8, "High",ifelse(students$gpa >=3.5, "Medium", "Low"))print(students)
name age major gpa year credits gpa_category
1 Alice 20 Psychology 3.8 Sophomore 45 High
2 Bob 22 Math 3.6 Senior 120 Medium
3 Charlie 19 Biology 3.9 Freshman 15 High
4 Diana 21 English 3.7 Junior 90 Medium
Adding Rows
# Create new student datanew_student <-data.frame(name ="Eve",age =20,major ="Computer Science",gpa =3.95,year ="Sophomore",credits =50,gpa_category ="High",stringsAsFactors =FALSE)# Add to existing data framestudents_expanded <-rbind(students, new_student)print(students_expanded)
name age major gpa year credits gpa_category
1 Alice 20 Psychology 3.80 Sophomore 45 High
2 Bob 22 Math 3.60 Senior 120 Medium
3 Charlie 19 Biology 3.90 Freshman 15 High
4 Diana 21 English 3.70 Junior 90 Medium
5 Eve 20 Computer Science 3.95 Sophomore 50 High
Modifying Values
# Modify specific valuesstudents$age[1] <-21# Change Alice's age# Modify based on conditionsstudents$gpa[students$name =="Bob"] <-3.65# Update Bob's GPA# Modify multiple valuesstudents$credits <- students$credits +3# Add 3 credits to everyoneprint(students)
name age major gpa year credits gpa_category
1 Alice 21 Psychology 3.80 Sophomore 48 High
2 Bob 22 Math 3.65 Senior 123 Medium
3 Charlie 19 Biology 3.90 Freshman 18 High
4 Diana 21 English 3.70 Junior 93 Medium
Data Frame Operations
Sorting
# Sort by GPA (ascending)students_by_gpa <- students[order(students$gpa), ]print(students_by_gpa)
name age major gpa year credits gpa_category
2 Bob 22 Math 3.65 Senior 123 Medium
4 Diana 21 English 3.70 Junior 93 Medium
1 Alice 21 Psychology 3.80 Sophomore 48 High
3 Charlie 19 Biology 3.90 Freshman 18 High
# Sort by GPA (descending)students_by_gpa_desc <- students[order(-students$gpa), ]print(students_by_gpa_desc)
name age major gpa year credits gpa_category
3 Charlie 19 Biology 3.90 Freshman 18 High
1 Alice 21 Psychology 3.80 Sophomore 48 High
4 Diana 21 English 3.70 Junior 93 Medium
2 Bob 22 Math 3.65 Senior 123 Medium
# Sort by multiple columnsstudents_sorted <- students[order(students$major, -students$gpa), ]print(students_sorted)
name age major gpa year credits gpa_category
3 Charlie 19 Biology 3.90 Freshman 18 High
4 Diana 21 English 3.70 Junior 93 Medium
2 Bob 22 Math 3.65 Senior 123 Medium
1 Alice 21 Psychology 3.80 Sophomore 48 High
Aggregation
# Basic statisticsmean(students$gpa)
[1] 3.7625
max(students$age)
[1] 22
min(students$credits)
[1] 18
# Statistics by groupaggregate(gpa ~ major, data = students, FUN = mean)
major gpa
1 Biology 3.90
2 English 3.70
3 Math 3.65
4 Psychology 3.80
aggregate(age ~ gpa_category, data = students, FUN = mean)
date product quantity price sales_rep revenue day_of_week
1 2024-01-01 Widget A 10 25.99 Alice 259.90 Monday
2 2024-01-02 Widget B 15 35.50 Bob 532.50 Tuesday
3 2024-01-03 Widget A 8 25.99 Alice 207.92 Wednesday
4 2024-01-04 Widget C 12 45.00 Charlie 540.00 Thursday
5 2024-01-05 Widget B 20 35.50 Bob 710.00 Friday
# Analysistotal_revenue <-sum(sales_data$revenue)best_day <- sales_data[which.max(sales_data$revenue), ]revenue_by_rep <-aggregate(revenue ~ sales_rep, data = sales_data, FUN = sum)cat("Total revenue: $", round(total_revenue, 2), "\n")
Total revenue: $ 2250.32
print("Best performing day:")
[1] "Best performing day:"
print(best_day)
date product quantity price sales_rep revenue day_of_week
5 2024-01-05 Widget B 20 35.5 Bob 710 Friday
print("Revenue by sales rep:")
[1] "Revenue by sales rep:"
print(revenue_by_rep)
sales_rep revenue
1 Alice 467.82
2 Bob 1242.50
3 Charlie 540.00
Example 2: Student Performance Analysis
# Create comprehensive student datastudent_performance <-data.frame(student_id =1:20,name =paste("Student", LETTERS[1:20]),math_score =round(rnorm(20, mean =85, sd =10)),science_score =round(rnorm(20, mean =82, sd =12)),english_score =round(rnorm(20, mean =88, sd =8)),attendance_rate =round(runif(20, min =0.7, max =1.0), 2),study_hours =round(runif(20, min =5, max =25)))# Calculate derived variablesstudent_performance$average_score <-rowMeans(student_performance[, c("math_score", "science_score", "english_score")])student_performance$grade <-cut(student_performance$average_score,breaks =c(0, 60, 70, 80, 90, 100),labels =c("F", "D", "C", "B", "A"))# Identify high performershigh_performers <-subset(student_performance, average_score >=90& attendance_rate >=0.9)# Performance by study hourscor(student_performance$study_hours, student_performance$average_score)
metric value
1 Mean Score 88.366667
2 Median Score 87.166667
3 Std Dev 5.245884
4 Max Score 99.666667
print(analysis_results$metadata$total_students)
[1] 20
Best Practices and Common Pitfalls
1. Data Frame vs List Choice
# Use data frames when:# - All columns have the same length# - You're working with rectangular data# - You need to analyze relationships between variables# Use lists when:# - Elements have different lengths# - You need to store heterogeneous objects# - You're collecting results from multiple analyses
2. Factor Handling
# Be careful with factors in older R versions# Always use stringsAsFactors = FALSE when creating data frames# Or convert factors to characters when neededstudents_safe <-data.frame(name =c("Alice", "Bob", "Charlie"),major =c("Math", "Science", "Art"),stringsAsFactors =FALSE)# Check if columns are factorssapply(students_safe, class)
name major
"character" "character"
3. Consistent Data Types
# Ensure consistent data types within columnsmixed_ages <-c(20, "twenty-one", 22) # This becomes all character!print(mixed_ages)
Practice accessing elements using different methods ($, [[]], [])
Exercise 2: Data Frame Creation
Create a data frame with information about 10 movies including: - Title, year, genre, rating (1-10), duration (minutes) - Add calculated columns for decade and rating category - Find movies from a specific decade with high ratings
Identify employees with above-average performance ratings
Create a bonus column (10% of salary for ratings > 4.5)
Summary
Lists and data frames are essential data structures in R:
Lists
Flexible: Can contain different data types and structures
Nested: Can contain other lists for complex hierarchies
Access: Use $, [[]] for elements, [] for sub-lists
Use case: Storing analysis results, configuration data, complex objects
Data Frames
Rectangular: Rows and columns like a spreadsheet
Mixed types: Different columns can have different types
Consistent length: All columns must have the same number of rows
Use case: Primary structure for data analysis
Key Operations
Creation: list() and data.frame()
Access: Multiple indexing methods for flexibility
Modification: Add, remove, or change elements and columns
Analysis: Built-in functions for summaries and aggregation
These structures form the foundation for data manipulation and analysis. Next, we’ll learn about control structures to add logic and iteration to our data processing workflows!
---title: "Lists and Data Frames"author: "IND215"date: todayformat: html: toc: true toc-depth: 3 code-fold: false code-tools: true---## Introduction to Complex Data StructuresWhile vectors are fundamental, real-world data analysis often requires more complex structures. In this section, we'll explore two essential data structures that build upon vectors:- **Lists**: Flexible containers that can hold different data types- **Data frames**: The backbone of data analysis, organizing data in rows and columns## Lists: Flexible Data Containers### What are Lists?Lists are collections that can contain elements of **different types**. Unlike vectors, which must contain elements of the same type, lists can mix numbers, characters, logicals, and even other lists!### Creating Lists```{r}#| label: list-creation# Basic list with different typesmy_list <-list(numbers =c(1, 2, 3, 4),text ="Hello, world!",logical_value =TRUE,single_number =42)print(my_list)# Lists can contain vectors of different lengthsmixed_list <-list(short_vector =c(1, 2),long_vector =c("a", "b", "c", "d", "e"),single_value =3.14)print(mixed_list)```### List Structure and Properties```{r}#| label: list-structure# Examine list structurestr(my_list)# Check if it's a listis.list(my_list)# Get list length (number of elements)length(my_list)# Get element namesnames(my_list)```### Complex ListsLists can contain other lists and complex objects:```{r}#| label: complex-lists# List containing other listsnested_list <-list(personal_info =list(name ="Alice Johnson",age =28,city ="Boston" ),scores =c(85, 92, 78, 96),metadata =list(date_created =Sys.Date(),version =1.0 ))str(nested_list)```### Accessing List ElementsThere are three main ways to access list elements:#### Using `[[]]` (Double Brackets)```{r}#| label: list-double-brackets# Extract single elements (returns the actual object)my_list[[1]] # First elementmy_list[["numbers"]] # Element named "numbers"my_list$numbers # Same as above, using $ notation# Check typestypeof(my_list[[1]]) # Returns "double" (the vector)typeof(my_list[1]) # Returns "list" (a list with one element)```#### Using `[]` (Single Brackets)```{r}#| label: list-single-brackets# Returns a list containing the selected elementsmy_list[1] # Returns a list with the first elementmy_list[c(1, 3)] # Returns a list with elements 1 and 3my_list[c("numbers", "text")] # Returns a list with named elements# The difference is important!class(my_list[[1]]) # "numeric" - the actual vectorclass(my_list[1]) # "list" - a list containing the vector```#### Using `$` (Dollar Sign)```{r}#| label: list-dollar-sign# Access by name (most common for named lists)my_list$numbersmy_list$textmy_list$logical_value# This only works with named elements# my_list$1 # This would cause an error```### Modifying Lists```{r}#| label: list-modification# Create a list to modifystudent_data <-list(name ="Bob Smith",grades =c(85, 90, 78))# Add new elementsstudent_data$age <-20student_data$major <-"Statistics"# Modify existing elementsstudent_data$grades <-c(student_data$grades, 92) # Add new grade# Remove elements (set to NULL)student_data$age <-NULLprint(student_data)```### List Functions```{r}#| label: list-functions# Create sample listdata_list <-list(group_a =c(23, 25, 27, 29),group_b =c(18, 22, 24, 26, 28),group_c =c(30, 32, 34))# Apply function to each elementlapply(data_list, mean) # Returns a listsapply(data_list, mean) # Returns a vector when possiblevapply(data_list, mean, numeric(1)) # Specify output type# Get lengths of each elementlapply(data_list, length)sapply(data_list, length)# Combine listslist1 <-list(a =1, b =2)list2 <-list(c =3, d =4)combined <-c(list1, list2)print(combined)```## Data Frames: The Heart of Data Analysis### What are Data Frames?Data frames are the most important data structure for data analysis. Think of them as spreadsheets or database tables - they organize data in rows and columns where:- Each **column** represents a variable (like age, income, name)- Each **row** represents an observation (like a person, transaction, measurement)- All columns have the same length- Different columns can contain different data types### Creating Data Frames```{r}#| label: dataframe-creation# Basic data framestudents <-data.frame(name =c("Alice", "Bob", "Charlie", "Diana"),age =c(20, 22, 19, 21),major =c("Psychology", "Math", "Biology", "English"),gpa =c(3.8, 3.6, 3.9, 3.7),stringsAsFactors =FALSE# Keep strings as characters)print(students)# Check structurestr(students)```### Data Frame Properties```{r}#| label: dataframe-properties# Dimensionsnrow(students) # Number of rowsncol(students) # Number of columnsdim(students) # Both dimensions# Column and row namescolnames(students)rownames(students)names(students) # Same as colnames for data frames# Quick overviewhead(students) # First 6 rows (default)tail(students) # Last 6 rowssummary(students) # Summary statistics```### Accessing Data Frame Elements#### Column Access```{r}#| label: dataframe-column-access# Access columns (multiple ways)students$name # Using $students[["name"]] # Using [[]]students["name"] # Returns data frame with one columnstudents[, "name"] # Using row, column notation# Multiple columnsstudents[c("name", "gpa")]students[, c("name", "gpa")]# All columnsstudents[, ] # All rows and columns```#### Row Access```{r}#| label: dataframe-row-access# Access rowsstudents[1, ] # First row, all columnsstudents[c(1, 3), ] # Rows 1 and 3, all columnsstudents[1:2, ] # Rows 1 through 2# Specific cellsstudents[1, "name"] # Row 1, column "name"students[1, 2] # Row 1, column 2students[c(1, 3), c("name", "gpa")] # Multiple rows and columns```#### Conditional Subsetting```{r}#| label: dataframe-conditional# Filter rows based on conditionshigh_gpa <- students[students$gpa >3.7, ]print(high_gpa)# Multiple conditionsyoung_high_achievers <- students[students$age <21& students$gpa >3.8, ]print(young_high_achievers)# Using subset() function (alternative)subset(students, gpa >3.7)subset(students, age <21& gpa >3.8)# Filter and select specific columnssubset(students, gpa >3.7, select =c("name", "gpa"))```### Modifying Data Frames#### Adding Columns```{r}#| label: dataframe-add-columns# Add new columnsstudents$year <-c("Sophomore", "Senior", "Freshman", "Junior")students$credits <-c(45, 120, 15, 90)# Calculate new columns from existing onesstudents$gpa_category <-ifelse(students$gpa >=3.8, "High",ifelse(students$gpa >=3.5, "Medium", "Low"))print(students)```#### Adding Rows```{r}#| label: dataframe-add-rows# Create new student datanew_student <-data.frame(name ="Eve",age =20,major ="Computer Science",gpa =3.95,year ="Sophomore",credits =50,gpa_category ="High",stringsAsFactors =FALSE)# Add to existing data framestudents_expanded <-rbind(students, new_student)print(students_expanded)```#### Modifying Values```{r}#| label: dataframe-modify-values# Modify specific valuesstudents$age[1] <-21# Change Alice's age# Modify based on conditionsstudents$gpa[students$name =="Bob"] <-3.65# Update Bob's GPA# Modify multiple valuesstudents$credits <- students$credits +3# Add 3 credits to everyoneprint(students)```### Data Frame Operations#### Sorting```{r}#| label: dataframe-sorting# Sort by GPA (ascending)students_by_gpa <- students[order(students$gpa), ]print(students_by_gpa)# Sort by GPA (descending)students_by_gpa_desc <- students[order(-students$gpa), ]print(students_by_gpa_desc)# Sort by multiple columnsstudents_sorted <- students[order(students$major, -students$gpa), ]print(students_sorted)```#### Aggregation```{r}#| label: dataframe-aggregation# Basic statisticsmean(students$gpa)max(students$age)min(students$credits)# Statistics by groupaggregate(gpa ~ major, data = students, FUN = mean)aggregate(age ~ gpa_category, data = students, FUN = mean)# Count by categorytable(students$major)table(students$gpa_category)```## Working with Real Data### Example 1: Sales Analysis```{r}#| label: sales-analysis# Create sales datasales_data <-data.frame(date =as.Date(c("2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05")),product =c("Widget A", "Widget B", "Widget A", "Widget C", "Widget B"),quantity =c(10, 15, 8, 12, 20),price =c(25.99, 35.50, 25.99, 45.00, 35.50),sales_rep =c("Alice", "Bob", "Alice", "Charlie", "Bob"))# Calculate revenuesales_data$revenue <- sales_data$quantity * sales_data$price# Add day of weeksales_data$day_of_week <-weekdays(sales_data$date)print(sales_data)# Analysistotal_revenue <-sum(sales_data$revenue)best_day <- sales_data[which.max(sales_data$revenue), ]revenue_by_rep <-aggregate(revenue ~ sales_rep, data = sales_data, FUN = sum)cat("Total revenue: $", round(total_revenue, 2), "\n")print("Best performing day:")print(best_day)print("Revenue by sales rep:")print(revenue_by_rep)```### Example 2: Student Performance Analysis```{r}#| label: student-performance# Create comprehensive student datastudent_performance <-data.frame(student_id =1:20,name =paste("Student", LETTERS[1:20]),math_score =round(rnorm(20, mean =85, sd =10)),science_score =round(rnorm(20, mean =82, sd =12)),english_score =round(rnorm(20, mean =88, sd =8)),attendance_rate =round(runif(20, min =0.7, max =1.0), 2),study_hours =round(runif(20, min =5, max =25)))# Calculate derived variablesstudent_performance$average_score <-rowMeans(student_performance[, c("math_score", "science_score", "english_score")])student_performance$grade <-cut(student_performance$average_score,breaks =c(0, 60, 70, 80, 90, 100),labels =c("F", "D", "C", "B", "A"))# Identify high performershigh_performers <-subset(student_performance, average_score >=90& attendance_rate >=0.9)# Performance by study hourscor(student_performance$study_hours, student_performance$average_score)print("High performers:")print(high_performers[, c("name", "average_score", "attendance_rate", "study_hours")])print("Grade distribution:")print(table(student_performance$grade))```## Combining Lists and Data FramesSometimes you need to store data frames within lists or create complex nested structures:```{r}#| label: combined-structures# List containing multiple data framesanalysis_results <-list(raw_data = student_performance,summary_stats =data.frame(metric =c("Mean Score", "Median Score", "Std Dev", "Max Score"),value =c(mean(student_performance$average_score),median(student_performance$average_score),sd(student_performance$average_score),max(student_performance$average_score)) ),high_performers = high_performers,metadata =list(analysis_date =Sys.Date(),total_students =nrow(student_performance),subjects_analyzed =c("math", "science", "english") ))# Access different componentsstr(analysis_results, max.level =2)print(analysis_results$summary_stats)print(analysis_results$metadata$total_students)```## Best Practices and Common Pitfalls### 1. Data Frame vs List Choice```{r}#| label: structure-choice# Use data frames when:# - All columns have the same length# - You're working with rectangular data# - You need to analyze relationships between variables# Use lists when:# - Elements have different lengths# - You need to store heterogeneous objects# - You're collecting results from multiple analyses```### 2. Factor Handling```{r}#| label: factor-handling# Be careful with factors in older R versions# Always use stringsAsFactors = FALSE when creating data frames# Or convert factors to characters when neededstudents_safe <-data.frame(name =c("Alice", "Bob", "Charlie"),major =c("Math", "Science", "Art"),stringsAsFactors =FALSE)# Check if columns are factorssapply(students_safe, class)```### 3. Consistent Data Types```{r}#| label: consistent-types# Ensure consistent data types within columnsmixed_ages <-c(20, "twenty-one", 22) # This becomes all character!print(mixed_ages)# Better approachclean_ages <-c(20, 21, 22)special_case <-"unknown"# Handle special cases separately```## Exercises### Exercise 1: List Practice1. Create a list containing: - A vector of your favorite colors - Your age - A logical value indicating if you like pizza - A nested list with your address information2. Practice accessing elements using different methods (`$`, `[[]]`, `[]`)### Exercise 2: Data Frame CreationCreate a data frame with information about 10 movies including:- Title, year, genre, rating (1-10), duration (minutes)- Add calculated columns for decade and rating category- Find movies from a specific decade with high ratings### Exercise 3: Real-world AnalysisGiven this employee data structure:```{r}#| eval: falseemployees <-data.frame(name =c("Alice", "Bob", "Charlie", "Diana", "Eve"),department =c("Sales", "IT", "Sales", "HR", "IT"),salary =c(50000, 75000, 55000, 60000, 80000),years_experience =c(3, 7, 4, 5, 9),performance_rating =c(4.2, 4.8, 3.9, 4.5, 4.9))```1. Calculate average salary by department2. Find the highest paid employee in each department3. Identify employees with above-average performance ratings4. Create a bonus column (10% of salary for ratings > 4.5)## SummaryLists and data frames are essential data structures in R:### Lists- **Flexible**: Can contain different data types and structures- **Nested**: Can contain other lists for complex hierarchies- **Access**: Use `$`, `[[]]` for elements, `[]` for sub-lists- **Use case**: Storing analysis results, configuration data, complex objects### Data Frames- **Rectangular**: Rows and columns like a spreadsheet- **Mixed types**: Different columns can have different types- **Consistent length**: All columns must have the same number of rows- **Use case**: Primary structure for data analysis### Key Operations- **Creation**: `list()` and `data.frame()`- **Access**: Multiple indexing methods for flexibility- **Modification**: Add, remove, or change elements and columns- **Analysis**: Built-in functions for summaries and aggregationThese structures form the foundation for data manipulation and analysis. Next, we'll learn about control structures to add logic and iteration to our data processing workflows!