Google Play Store Apps: Data Exploration with R’s ggplot2

R’s ggplot2

ggplot2 is R’s premier visualization library and the canonical implementation of the Grammar of Graphics. Created by Hadley Wickham, it revolutionized data visualization by providing a coherent system for declaratively building plots through layered components. The library enables users to construct complex visualizations by combining data, aesthetic mappings, geometric objects (geoms), statistical transformations, scales, coordinate systems, facets, and themes. This layered grammar promotes principled, reproducible graphics with exceptional flexibility, making ggplot2 the de facto standard for publication-quality visualizations in R and a model for visualization libraries across programming languages.

Setup

library(readr)
library(ggplot2)
library(dplyr)

# Load data
data_source <- "https://raw.githubusercontent.com/simoneSantoni/data-viz-smm635/refs/heads/master/data/googleplaystore.csv"
apps <- read_csv(data_source)

This dataset from Kaggle’s Google Play Store Apps project contains information on ~10K Android apps including ratings, reviews, pricing, and categories.

Data Overview

spec(apps)  # Column specifications
apps        # Data preview

Key Variables

Variable	Type	Description	Notes
App	String	Application name	May contain special characters
Category	Categorical	Primary app category (33 categories)	Contains some miscoded entries
Rating	Float	Average user rating (1.0-5.0)	Missing for unrated apps
Reviews	Integer	Total number of user reviews	Indicator of popularity
Size	String	App size with units (e.g., “19M”)	Requires parsing; “Varies with device” for some
Installs	Categorical	Install count ranges (e.g., “10,000+”)	Ordinal categories, not exact counts
Type	Binary	“Free” or “Paid”	Contains data quality issues
Price	String	Price in USD (e.g., “$2.99”)	“0” for free apps
Content Rating	Categorical	Age appropriateness	“Everyone”, “Teen”, “Mature 17+”, etc.
Genres	String	Detailed genre (may be multiple)	Semicolon-separated
Last Updated	Date	Last update date	Format: “Month Day, Year”
Current Ver	String	App version	Non-standardized format
Android Ver	String	Minimum Android version required	Format: “X.X.X and up”

Data Visualization

Univariate Distributions

Categorical Variables

Bar charts are the standard approach for visualizing categorical distributions. ?@fig-bar-chart reveals data quality issues: some apps have $Type = 0$ and missing values ($NaN$).

ggplot(apps, aes(x = Type)) +
  geom_bar()

ggplot(apps, aes(x = "", fill = Type)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  theme_void()

apps %>%
  filter(Type != "Free" & Type != "Paid") %>%
  ggplot(aes(x = Type)) +
  geom_bar()

Continuous Variables

The Reviews variable exhibits strong right skew. Multiple visualization approaches reveal different aspects of the distribution. Note: geom_histogram() defaults to 30 bins; adjust with bins argument.

ggplot(apps, aes(x = Reviews)) +
  geom_histogram()

ggplot(apps, aes(x = Reviews)) +
  geom_histogram() +
  scale_x_log10()

ggplot(apps, aes(x = Reviews)) +
  geom_boxplot() +
  scale_x_log10()

ggplot(apps, aes(y = Reviews)) +
  geom_boxplot() +
  scale_y_log10()

ggplot(apps, aes(x = Reviews)) +
  geom_density() +
  scale_x_log10()

Bivariate Relationships

Categorical vs Continuous

Grouped visualizations reveal how continuous variables differ across categories. ?@fig-rating-by-type compares rating distributions between free and paid apps.

apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = Type, y = Rating, fill = Type)) +
  geom_boxplot() +
  theme_minimal()

apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = Type, y = Rating, fill = Type)) +
  geom_violin() +
  theme_minimal()

apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = Rating, fill = Type)) +
  geom_histogram(bins = 30) +
  facet_wrap(~Type, ncol = 1) +
  theme_minimal()

Continuous vs Continuous

Scatterplots visualize relationships between two continuous variables. ?@fig-reviews-rating explores whether highly-reviewed apps have better ratings.

apps %>%
  filter(!is.na(Rating)) %>%
  ggplot(aes(x = Reviews, y = Rating)) +
  geom_point(alpha = 0.3) +
  scale_x_log10() +
  theme_minimal()

apps %>%
  filter(!is.na(Rating)) %>%
  ggplot(aes(x = Reviews, y = Rating)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10() +
  theme_minimal()

apps %>%
  filter(!is.na(Rating), Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = Reviews, y = Rating, color = Type)) +
  geom_point(alpha = 0.4) +
  scale_x_log10() +
  theme_minimal()

Categorical vs Categorical

Stacked and grouped bar charts show relationships between categorical variables. ?@fig-type-by-content examines how app pricing varies across content ratings.

apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = `Content Rating`, fill = Type)) +
  geom_bar() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = `Content Rating`, fill = Type)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  count(Type, `Content Rating`) %>%
  ggplot(aes(x = `Content Rating`, y = Type, fill = n)) +
  geom_tile() +
  geom_text(aes(label = n), color = "white", size = 4) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(fill = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Next Steps

Data Cleaning: Address missing values and miscoded entries in Type, Category, and other fields
Feature Engineering: Parse Size and Installs into numeric formats for quantitative analysis
Advanced Analysis: Examine temporal patterns using Last Updated, analyze price elasticity, investigate category-specific trends
Multivariate Analysis: Build models to predict ratings or success metrics based on app characteristics