Google Play Store Apps: Data Exploration with R’s ggplot2

R’s ggplot2

ggplot2 is R’s premier visualization library and the canonical implementation of the Grammar of Graphics. Created by Hadley Wickham, it revolutionized data visualization by providing a coherent system for declaratively building plots through layered components. The library enables users to construct complex visualizations by combining data, aesthetic mappings, geometric objects (geoms), statistical transformations, scales, coordinate systems, facets, and themes. This layered grammar promotes principled, reproducible graphics with exceptional flexibility, making ggplot2 the de facto standard for publication-quality visualizations in R and a model for visualization libraries across programming languages.

Setup

library(readr)
library(ggplot2)
library(dplyr)

# Load data
data_source <- "https://raw.githubusercontent.com/simoneSantoni/data-viz-smm635/refs/heads/master/data/googleplaystore.csv"
apps <- read_csv(data_source)

This dataset from Kaggle’s Google Play Store Apps project contains information on ~10K Android apps including ratings, reviews, pricing, and categories.

Data Overview

spec(apps)  # Column specifications
apps        # Data preview

Key Variables

Variable Type Description Notes
App String Application name May contain special characters
Category Categorical Primary app category (33 categories) Contains some miscoded entries
Rating Float Average user rating (1.0-5.0) Missing for unrated apps
Reviews Integer Total number of user reviews Indicator of popularity
Size String App size with units (e.g., “19M”) Requires parsing; “Varies with device” for some
Installs Categorical Install count ranges (e.g., “10,000+”) Ordinal categories, not exact counts
Type Binary “Free” or “Paid” Contains data quality issues
Price String Price in USD (e.g., “$2.99”) “0” for free apps
Content Rating Categorical Age appropriateness “Everyone”, “Teen”, “Mature 17+”, etc.
Genres String Detailed genre (may be multiple) Semicolon-separated
Last Updated Date Last update date Format: “Month Day, Year”
Current Ver String App version Non-standardized format
Android Ver String Minimum Android version required Format: “X.X.X and up”

Data Visualization

Univariate Distributions

Categorical Variables

Bar charts are the standard approach for visualizing categorical distributions. ?@fig-bar-chart reveals data quality issues: some apps have \(Type = 0\) and missing values (\(NaN\)).

ggplot(apps, aes(x = Type)) +
  geom_bar()
ggplot(apps, aes(x = "", fill = Type)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  theme_void()
apps %>%
  filter(Type != "Free" & Type != "Paid") %>%
  ggplot(aes(x = Type)) +
  geom_bar()

Continuous Variables

The Reviews variable exhibits strong right skew. Multiple visualization approaches reveal different aspects of the distribution. Note: geom_histogram() defaults to 30 bins; adjust with bins argument.

ggplot(apps, aes(x = Reviews)) +
  geom_histogram()
ggplot(apps, aes(x = Reviews)) +
  geom_histogram() +
  scale_x_log10()
ggplot(apps, aes(x = Reviews)) +
  geom_boxplot() +
  scale_x_log10()
ggplot(apps, aes(y = Reviews)) +
  geom_boxplot() +
  scale_y_log10()
ggplot(apps, aes(x = Reviews)) +
  geom_density() +
  scale_x_log10()

Bivariate Relationships

Categorical vs Continuous

Grouped visualizations reveal how continuous variables differ across categories. ?@fig-rating-by-type compares rating distributions between free and paid apps.

apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = Type, y = Rating, fill = Type)) +
  geom_boxplot() +
  theme_minimal()
apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = Type, y = Rating, fill = Type)) +
  geom_violin() +
  theme_minimal()
apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = Rating, fill = Type)) +
  geom_histogram(bins = 30) +
  facet_wrap(~Type, ncol = 1) +
  theme_minimal()

Continuous vs Continuous

Scatterplots visualize relationships between two continuous variables. ?@fig-reviews-rating explores whether highly-reviewed apps have better ratings.

apps %>%
  filter(!is.na(Rating)) %>%
  ggplot(aes(x = Reviews, y = Rating)) +
  geom_point(alpha = 0.3) +
  scale_x_log10() +
  theme_minimal()
apps %>%
  filter(!is.na(Rating)) %>%
  ggplot(aes(x = Reviews, y = Rating)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10() +
  theme_minimal()
apps %>%
  filter(!is.na(Rating), Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = Reviews, y = Rating, color = Type)) +
  geom_point(alpha = 0.4) +
  scale_x_log10() +
  theme_minimal()

Categorical vs Categorical

Stacked and grouped bar charts show relationships between categorical variables. ?@fig-type-by-content examines how app pricing varies across content ratings.

apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = `Content Rating`, fill = Type)) +
  geom_bar() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  ggplot(aes(x = `Content Rating`, fill = Type)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
apps %>%
  filter(Type %in% c("Free", "Paid")) %>%
  count(Type, `Content Rating`) %>%
  ggplot(aes(x = `Content Rating`, y = Type, fill = n)) +
  geom_tile() +
  geom_text(aes(label = n), color = "white", size = 4) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(fill = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Next Steps

  • Data Cleaning: Address missing values and miscoded entries in Type, Category, and other fields
  • Feature Engineering: Parse Size and Installs into numeric formats for quantitative analysis
  • Advanced Analysis: Examine temporal patterns using Last Updated, analyze price elasticity, investigate category-specific trends
  • Multivariate Analysis: Build models to predict ratings or success metrics based on app characteristics