# ============================================================================
# Project: Introduction to R Analysis
# Script: my_first_analysis.R
# Author: Your Name
# Date: 2024-09-21
# Last Modified: 2024-09-21
#
# Purpose: Demonstrate best practices for R script organization
#
# Data: Built-in R datasets
# Output: Summary statistics and basic plots
# ============================================================================
# SETUP ======================================================================
# Clear workspace (optional, be careful!)
# rm(list = ls())
# Load required libraries
library(tidyverse) # Data manipulation and visualization
library(here) # File path management
# Set working directory (if not using projects)
# setwd("~/Documents/R-Analysis")
# Source additional scripts if needed
# source("helper_functions.R")
# CONSTANTS AND CONFIGURATION ================================================
# Define constants
<- 0.05
SIGNIFICANCE_LEVEL <- 8
DEFAULT_PLOT_WIDTH <- 6
DEFAULT_PLOT_HEIGHT
# Set plot theme
theme_set(theme_minimal())
# DATA IMPORT ================================================================
# Load built-in dataset for this example
data("mtcars")
# In real projects, you might load data like this:
# my_data <- read_csv(here("data", "raw", "dataset.csv"))
# DATA EXPLORATION ===========================================================
# Quick overview
glimpse(mtcars)
summary(mtcars)
# Check for missing values
sum(is.na(mtcars))
# DATA ANALYSIS ==============================================================
# Calculate summary statistics
<- mtcars %>%
mpg_stats summarise(
mean_mpg = mean(mpg),
median_mpg = median(mpg),
sd_mpg = sd(mpg),
min_mpg = min(mpg),
max_mpg = max(mpg)
)
print(mpg_stats)
# Create visualizations
<- ggplot(mtcars, aes(x = mpg)) +
mpg_histogram geom_histogram(bins = 10, fill = "steelblue", alpha = 0.7) +
labs(
title = "Distribution of Miles Per Gallon",
x = "Miles Per Gallon",
y = "Frequency"
)
print(mpg_histogram)
# RESULTS EXPORT =============================================================
# Save plot
ggsave(
filename = here("output", "mpg_distribution.png"),
plot = mpg_histogram,
width = DEFAULT_PLOT_WIDTH,
height = DEFAULT_PLOT_HEIGHT,
dpi = 300
)
# Save results
write_csv(mpg_stats, here("output", "mpg_summary_stats.csv"))
# SESSION INFO ===============================================================
# Document session for reproducibility
sessionInfo()
# End of script
R Scripts and Projects
Introduction
As you progress from simple calculations to complex analyses, organizing your R work becomes crucial. R scripts and RStudio Projects are the foundation of reproducible, organized data analysis. In this lesson, you’ll learn how to structure your R code professionally and manage your analytical projects effectively.
Think of R scripts as recipes and Projects as well-organized kitchens – both are essential for creating consistent, reproducible results.
Understanding R Scripts
What is an R Script?
An R script is a plain text file with a .R
extension that contains R commands. Unlike typing commands directly in the console, scripts allow you to:
- Save your work for future use
- Document your analysis with comments
- Share your methods with others
- Reproduce results exactly
- Debug and modify code systematically
Creating Your First Script
- File > New File > R Script (or
Ctrl+Shift+N
) - Save immediately: File > Save (or
Ctrl+S
) - Give it a meaningful name:
my_first_analysis.R
Script Structure Best Practices
Here’s a well-structured R script template:
Key Script Elements
1. Header Comments
Always start with a comprehensive header: - Project name and purpose - Author and date information - Brief description of what the script does - Data sources and outputs
2. Setup Section
- Load required libraries
- Set global options
- Define constants
- Source helper functions
3. Organized Sections
Use clear section headers:
# SECTION NAME ================================================================
# or
# Section Name ----
4. Meaningful Comments
# Good comments explain WHY, not just WHAT
<- 20 # Fuel efficiency benchmark for classification
mpg_threshold
# Bad comment (obvious)
<- 20 # Set mpg_threshold to 20 mpg_threshold
Working with Scripts
Running Script Code
Option 1: Line by Line - Place cursor on line - Press Ctrl+Enter
(Windows/Linux) or Cmd+Enter
(Mac)
Option 2: Selected Code - Highlight code block - Press Ctrl+Enter
Option 3: Entire Script - Press Ctrl+Shift+Enter
or click “Source”
Option 4: From Console
source("my_script.R")
Introduction to RStudio Projects
Why Use Projects?
RStudio Projects solve common organizational problems:
Without Projects:
# Absolute paths (bad!)
setwd("/Users/john/Documents/my_analysis")
<- read.csv("/Users/john/Documents/my_analysis/data/file.csv")
data
# Problems:
# - Not portable between computers
# - Difficult to share
# - Hard to organize multiple analyses
With Projects:
# Relative paths (good!)
<- read.csv("data/file.csv")
data # or even better:
<- read_csv(here("data", "file.csv")) data
Project Benefits
- Automatic working directory - no more
setwd()
- Organized file structure - everything in one place
- Portable - works on any computer
- Version control ready - easy Git integration
- Workspace isolation - separate environments for different projects
Creating a New Project
Method 1: From RStudio 1. File > New Project 2. Choose project type: - New Directory: Start fresh - Existing Directory: Use existing folder - Version Control: Clone from Git
Method 2: New Directory Walkthrough 1. Select “New Directory” 2. Choose “New Project” 3. Directory name: my-r-analysis
4. Choose parent directory 5. Optional: Initialize Git repository 6. Click “Create Project”
Recommended Project Structure
my-r-analysis/
├── my-r-analysis.Rproj # Project file
├── README.md # Project description
├── data/ # Data folder
│ ├── raw/ # Original, unmodified data
│ └── processed/ # Cleaned, processed data
├── scripts/ # R scripts
│ ├── 01_data_import.R
│ ├── 02_data_cleaning.R
│ └── 03_analysis.R
├── output/ # Results folder
│ ├── figures/ # Plots and visualizations
│ └── tables/ # Summary tables
├── docs/ # Documentation
└── renv/ # Package management (optional)
Setting Up Project Structure
Create this structure using R:
# Create project folders
dir.create("data")
dir.create("data/raw")
dir.create("data/processed")
dir.create("scripts")
dir.create("output")
dir.create("output/figures")
dir.create("output/tables")
dir.create("docs")
# Create README file
writeLines(
c("# My R Analysis Project",
"",
"## Description",
"Brief description of your project",
"",
"## Structure",
"- `data/`: Data files",
"- `scripts/`: R scripts",
"- `output/`: Results and figures",
"- `docs/`: Documentation"),
"README.md"
)
File Management Best Practices
Naming Conventions
Files and Folders: - Use lowercase with hyphens: data-cleaning.R
- Or use underscores: data_cleaning.R
- Be descriptive: monthly_sales_analysis.R
- Use prefixes for ordering: 01_import.R
, 02_clean.R
Variables and Functions:
# Good naming
<- c(25, 30, 45, 22)
customer_ages <- function(price, quantity) { ... }
calculate_total_revenue
# Avoid
<- c(25, 30, 45, 22)
x <- function(a, b) { ... } fun1
Working with Paths
Use the here
Package:
# Install if needed
install.packages("here")
library(here)
# Benefits of here()
<- here("data", "raw", "sales.csv")
data_path # Works on Windows: data/raw/sales.csv
# Works on Mac/Linux: data/raw/sales.csv
# Read data
<- read_csv(here("data", "raw", "sales.csv"))
sales_data
# Save output
write_csv(results, here("output", "tables", "summary.csv"))
Data Import Best Practices
# Good data import workflow
<- function() {
import_sales_data # Define file path
<- here("data", "raw", "sales_2024.csv")
file_path
# Check if file exists
if (!file.exists(file_path)) {
stop("Sales data file not found: ", file_path)
}
# Import with explicit column types
<- read_csv(
sales_data
file_path,col_types = cols(
date = col_date(format = "%Y-%m-%d"),
customer_id = col_character(),
amount = col_double(),
product = col_character()
)
)
# Basic validation
if (nrow(sales_data) == 0) {
warning("Sales data is empty")
}
return(sales_data)
}
# Use the function
<- import_sales_data() sales_data
Reproducible Analysis Workflow
Script Dependencies
Master Script Approach:
# main_analysis.R
# ============================================================================
# Master Script: Sales Analysis Pipeline
# ============================================================================
# Setup
source(here("scripts", "00_setup.R"))
# Data pipeline
source(here("scripts", "01_data_import.R"))
source(here("scripts", "02_data_cleaning.R"))
source(here("scripts", "03_data_analysis.R"))
source(here("scripts", "04_generate_report.R"))
# Session info
sessionInfo()
Package Management
Using renv
for Reproducibility:
# Install renv
install.packages("renv")
# Initialize project environment
::init()
renv
# Install packages
install.packages(c("tidyverse", "here", "lubridate"))
# Take snapshot of current packages
::snapshot()
renv
# Restore environment (on another computer)
::restore() renv
Common Project Pitfalls
Pitfall 1: Hardcoded Paths
# Bad
setwd("C:/Users/John/Documents/analysis")
<- read.csv("C:/Users/John/Documents/analysis/data.csv")
data
# Good
<- read_csv(here("data", "data.csv")) data
Pitfall 2: Not Using Version Control
- Initialize Git repository when creating project
- Use meaningful commit messages
- Push to GitHub/GitLab for backup
Pitfall 3: Mixing Data and Scripts
# Bad structure
/
project
├── analysis.R
├── data.csv
├── plot.R
├── results.csv
# Good structure
/
project/
├── scripts
│ ├── analysis.R
│ └── plot.R/
├── data
│ └── data.csv/
└── output └── results.csv
Pitfall 4: No Documentation
Always include: - README.md with project description - Comments in scripts explaining complex logic - Data dictionaries for datasets
Practical Exercise
Let’s create a complete project:
Create New Project
- Name: “iris-analysis”
- Choose appropriate location
Set Up Structure
# Run this in your new project dir.create("data") dir.create("scripts") dir.create("output")
Create Analysis Script
# scripts/iris_analysis.R # ======================================================================== # Iris Dataset Analysis # Author: Your Name # Date: Today's Date # ======================================================================== # Setup library(tidyverse) library(here) # Load data (built-in dataset) data("iris") # Quick exploration glimpse(iris) summary(iris) # Analysis <- iris %>% species_summary group_by(Species) %>% summarise( mean_sepal_length = mean(Sepal.Length), mean_petal_length = mean(Petal.Length), count = n() ) print(species_summary) # Visualization <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + scatter_plot geom_point(size = 3) + labs( title = "Iris: Sepal Length vs Petal Length", x = "Sepal Length (cm)", y = "Petal Length (cm)" + ) theme_minimal() print(scatter_plot) # Save results write_csv(species_summary, here("output", "species_summary.csv")) ggsave(here("output", "iris_scatter.png"), scatter_plot, width = 8, height = 6)
Run and Test
- Run script section by section
- Check that files are created in output folder
- Verify paths work correctly
Summary
You now understand how to organize R work professionally:
R Scripts
- Structure scripts with clear headers and sections
- Comment code to explain your reasoning
- Use meaningful variable and file names
- Organize code logically from setup to results
RStudio Projects
- Create projects for each analysis
- Use consistent folder structures
- Leverage relative paths with
here()
- Document your work with README files
Best Practices
- Plan your analysis structure before coding
- Test scripts section by section
- Save intermediate results
- Document your session info for reproducibility
Next Steps
With Module 1 complete, you’re ready to dive deeper into R:
- Module 2: R Language Fundamentals - Objects, data types, and control structures
- Module 3: Introduction to Tidyverse - Modern R data manipulation
Congratulations! You’ve completed Module 1 and have a solid foundation for R programming. The habits you develop now will serve you well throughout your data science journey.