library(readr)
library(ggplot2)
library(dplyr)
# Load data
<- "https://raw.githubusercontent.com/simoneSantoni/data-viz-smm635/refs/heads/master/data/googleplaystore.csv"
data_source <- read_csv(data_source) apps
Google Play Store Apps: Data Exploration with R’s ggplot2
R’s ggplot2
ggplot2 is R’s premier visualization library and the canonical implementation of the Grammar of Graphics. Created by Hadley Wickham, it revolutionized data visualization by providing a coherent system for declaratively building plots through layered components. The library enables users to construct complex visualizations by combining data, aesthetic mappings, geometric objects (geoms), statistical transformations, scales, coordinate systems, facets, and themes. This layered grammar promotes principled, reproducible graphics with exceptional flexibility, making ggplot2 the de facto standard for publication-quality visualizations in R and a model for visualization libraries across programming languages.
Setup
Data Overview
spec(apps) # Column specifications
# Data preview apps
Key Variables
Variable | Type | Description | Notes |
---|---|---|---|
App | String | Application name | May contain special characters |
Category | Categorical | Primary app category (33 categories) | Contains some miscoded entries |
Rating | Float | Average user rating (1.0-5.0) | Missing for unrated apps |
Reviews | Integer | Total number of user reviews | Indicator of popularity |
Size | String | App size with units (e.g., “19M”) | Requires parsing; “Varies with device” for some |
Installs | Categorical | Install count ranges (e.g., “10,000+”) | Ordinal categories, not exact counts |
Type | Binary | “Free” or “Paid” | Contains data quality issues |
Price | String | Price in USD (e.g., “$2.99”) | “0” for free apps |
Content Rating | Categorical | Age appropriateness | “Everyone”, “Teen”, “Mature 17+”, etc. |
Genres | String | Detailed genre (may be multiple) | Semicolon-separated |
Last Updated | Date | Last update date | Format: “Month Day, Year” |
Current Ver | String | App version | Non-standardized format |
Android Ver | String | Minimum Android version required | Format: “X.X.X and up” |
Data Visualization
Univariate Distributions
Categorical Variables
Bar charts are the standard approach for visualizing categorical distributions. ?@fig-bar-chart reveals data quality issues: some apps have \(Type = 0\) and missing values (\(NaN\)).
ggplot(apps, aes(x = Type)) +
geom_bar()
ggplot(apps, aes(x = "", fill = Type)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
theme_void()
%>%
apps filter(Type != "Free" & Type != "Paid") %>%
ggplot(aes(x = Type)) +
geom_bar()
Continuous Variables
The Reviews variable exhibits strong right skew. Multiple visualization approaches reveal different aspects of the distribution. Note: geom_histogram()
defaults to 30 bins; adjust with bins
argument.
ggplot(apps, aes(x = Reviews)) +
geom_histogram()
ggplot(apps, aes(x = Reviews)) +
geom_histogram() +
scale_x_log10()
ggplot(apps, aes(x = Reviews)) +
geom_boxplot() +
scale_x_log10()
ggplot(apps, aes(y = Reviews)) +
geom_boxplot() +
scale_y_log10()
ggplot(apps, aes(x = Reviews)) +
geom_density() +
scale_x_log10()
Bivariate Relationships
Categorical vs Continuous
Grouped visualizations reveal how continuous variables differ across categories. ?@fig-rating-by-type compares rating distributions between free and paid apps.
%>%
apps filter(Type %in% c("Free", "Paid")) %>%
ggplot(aes(x = Type, y = Rating, fill = Type)) +
geom_boxplot() +
theme_minimal()
%>%
apps filter(Type %in% c("Free", "Paid")) %>%
ggplot(aes(x = Type, y = Rating, fill = Type)) +
geom_violin() +
theme_minimal()
%>%
apps filter(Type %in% c("Free", "Paid")) %>%
ggplot(aes(x = Rating, fill = Type)) +
geom_histogram(bins = 30) +
facet_wrap(~Type, ncol = 1) +
theme_minimal()
Continuous vs Continuous
Scatterplots visualize relationships between two continuous variables. ?@fig-reviews-rating explores whether highly-reviewed apps have better ratings.
%>%
apps filter(!is.na(Rating)) %>%
ggplot(aes(x = Reviews, y = Rating)) +
geom_point(alpha = 0.3) +
scale_x_log10() +
theme_minimal()
%>%
apps filter(!is.na(Rating)) %>%
ggplot(aes(x = Reviews, y = Rating)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "loess", color = "red") +
scale_x_log10() +
theme_minimal()
%>%
apps filter(!is.na(Rating), Type %in% c("Free", "Paid")) %>%
ggplot(aes(x = Reviews, y = Rating, color = Type)) +
geom_point(alpha = 0.4) +
scale_x_log10() +
theme_minimal()
Categorical vs Categorical
Stacked and grouped bar charts show relationships between categorical variables. ?@fig-type-by-content examines how app pricing varies across content ratings.
%>%
apps filter(Type %in% c("Free", "Paid")) %>%
ggplot(aes(x = `Content Rating`, fill = Type)) +
geom_bar() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
%>%
apps filter(Type %in% c("Free", "Paid")) %>%
ggplot(aes(x = `Content Rating`, fill = Type)) +
geom_bar(position = "dodge") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
%>%
apps filter(Type %in% c("Free", "Paid")) %>%
count(Type, `Content Rating`) %>%
ggplot(aes(x = `Content Rating`, y = Type, fill = n)) +
geom_tile() +
geom_text(aes(label = n), color = "white", size = 4) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(fill = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))