Google Play Store Apps: Data Exploration with Python’s plotnine

Python’s plotnine

plotnine is Python’s implementation of R’s ggplot2, providing Grammar of Graphics-based visualization capabilities. It offers familiar syntax for R users while enabling declarative plot construction through layered components. The library translates ggplot2’s elegant API to Python, supporting aesthetic mappings, geometric objects (geoms), statistical transformations, scales, coordinate systems, and themes. This approach promotes reproducible, customizable visualizations with intuitive code structure, making plotnine ideal for data scientists transitioning between R and Python or seeking ggplot2’s expressiveness in Python workflows.

Setup

import pandas as pd
from plotnine import *
import warnings
warnings.filterwarnings('ignore')

# Load data
data_source = "https://raw.githubusercontent.com/simoneSantoni/data-viz-smm635/refs/heads/master/data/googleplaystore.csv"
apps = pd.read_csv(data_source)

# Convert numeric columns
apps['Reviews'] = pd.to_numeric(apps['Reviews'], errors='coerce')
apps['Rating'] = pd.to_numeric(apps['Rating'], errors='coerce')

This dataset from Kaggle’s Google Play Store Apps project contains information on ~10K Android apps including ratings, reviews, pricing, and categories.

Data Overview

apps.dtypes  # Column types
apps         # Data preview

Key Variables

Variable Type Description Notes
App String Application name May contain special characters
Category Categorical Primary app category (33 categories) Contains some miscoded entries
Rating Float Average user rating (1.0-5.0) Missing for unrated apps
Reviews Integer Total number of user reviews Indicator of popularity
Size String App size with units (e.g., “19M”) Requires parsing; “Varies with device” for some
Installs Categorical Install count ranges (e.g., “10,000+”) Ordinal categories, not exact counts
Type Binary “Free” or “Paid” Contains data quality issues
Price String Price in USD (e.g., “$2.99”) “0” for free apps
Content Rating Categorical Age appropriateness “Everyone”, “Teen”, “Mature 17+”, etc.
Genres String Detailed genre (may be multiple) Semicolon-separated
Last Updated Date Last update date Format: “Month Day, Year”
Current Ver String App version Non-standardized format
Android Ver String Minimum Android version required Format: “X.X.X and up”

Data Visualization

Univariate Distributions

Categorical Variables

Bar charts are the standard approach for visualizing categorical distributions. Figure 1 reveals data quality issues: some apps have \(Type = 0\) and missing values (\(NaN\)).

(ggplot(apps, aes(x='Type')) +
  geom_bar())
Figure 1
(apps
  .assign(category='All Apps')
  .pipe(lambda d: ggplot(d, aes(x='category', fill='Type')) +
    geom_bar(position='fill') +
    coord_flip() +
    labs(x='', y='Proportion') +
    theme_minimal()))
Figure 2
(apps
  .query('Type != "Free" & Type != "Paid"')
  .pipe(lambda d: ggplot(d, aes(x='Type')) + geom_bar()))
Figure 3

Continuous Variables

The Reviews variable exhibits strong right skew. Multiple visualization approaches reveal different aspects of the distribution. Note: geom_histogram() defaults to 30 bins; adjust with bins argument.

(ggplot(apps, aes(x='Reviews')) +
  geom_histogram())
Figure 4
(apps
  .query('Reviews > 0')
  .pipe(lambda d: ggplot(d, aes(x='Reviews')) +
    geom_histogram() +
    scale_x_log10()))
Figure 5
(apps
  .query('Reviews > 0')
  .assign(category='Reviews')
  .pipe(lambda d: ggplot(d, aes(x='category', y='Reviews')) +
    geom_boxplot() +
    scale_y_log10() +
    coord_flip() +
    labs(x='') +
    theme_minimal()))
Figure 6
(apps
  .query('Reviews > 0')
  .assign(category='Reviews')
  .pipe(lambda d: ggplot(d, aes(x='category', y='Reviews')) +
    geom_boxplot() +
    scale_y_log10() +
    labs(x='') +
    theme_minimal()))
Figure 7
(apps
  .query('Reviews > 0')
  .pipe(lambda d: ggplot(d, aes(x='Reviews')) +
    geom_density() +
    scale_x_log10()))
Figure 8

Bivariate Relationships

Categorical vs Continuous

Grouped visualizations reveal how continuous variables differ across categories. Figure 9 compares rating distributions between free and paid apps.

(apps
  .query('Type in ["Free", "Paid"]')
  .pipe(lambda d: ggplot(d, aes(x='Type', y='Rating', fill='Type')) +
    geom_boxplot() +
    theme_minimal()))
Figure 9
(apps
  .query('Type in ["Free", "Paid"]')
  .pipe(lambda d: ggplot(d, aes(x='Type', y='Rating', fill='Type')) +
    geom_violin() +
    theme_minimal()))
Figure 10
(apps
  .query('Type in ["Free", "Paid"]')
  .pipe(lambda d: ggplot(d, aes(x='Rating', fill='Type')) +
    geom_histogram(bins=30) +
    facet_wrap('~Type', ncol=1) +
    theme_minimal()))
Figure 11

Continuous vs Continuous

Scatterplots visualize relationships between two continuous variables. Figure 12 explores whether highly-reviewed apps have better ratings.

(apps
  .dropna(subset=['Rating'])
  .query('Reviews > 0')
  .pipe(lambda d: ggplot(d, aes(x='Reviews', y='Rating')) +
    geom_point(alpha=0.3) +
    scale_x_log10() +
    theme_minimal()))
Figure 12
(apps
  .dropna(subset=['Rating'])
  .query('Reviews > 0')
  .pipe(lambda d: ggplot(d, aes(x='Reviews', y='Rating')) +
    geom_point(alpha=0.2) +
    geom_smooth(method='lm', color='red') +
    scale_x_log10() +
    theme_minimal()))
Figure 13
(apps
  .dropna(subset=['Rating'])
  .query('Type in ["Free", "Paid"] and Reviews > 0')
  .pipe(lambda d: ggplot(d, aes(x='Reviews', y='Rating', color='Type')) +
    geom_point(alpha=0.4) +
    scale_x_log10() +
    theme_minimal()))
Figure 14

Categorical vs Categorical

Stacked and grouped bar charts show relationships between categorical variables. Figure 15 examines how app pricing varies across content ratings.

(apps
  .query('Type in ["Free", "Paid"]')
  .pipe(lambda d: ggplot(d, aes(x='Content Rating', fill='Type')) +
    geom_bar() +
    theme_minimal() +
    theme(axis_text_x=element_text(angle=45, hjust=1))))
Figure 15
(apps
  .query('Type in ["Free", "Paid"]')
  .pipe(lambda d: ggplot(d, aes(x='Content Rating', fill='Type')) +
    geom_bar(position='dodge') +
    theme_minimal() +
    theme(axis_text_x=element_text(angle=45, hjust=1))))
Figure 16
(apps
  .query('Type in ["Free", "Paid"]')
  .groupby(['Type', 'Content Rating'], as_index=False)
  .size()
  .rename(columns={'size': 'n'})
  .pipe(lambda d: ggplot(d, aes(x='Content Rating', y='Type', fill='n')) +
    geom_tile() +
    geom_text(aes(label='n'), color='white', size=4) +
    scale_fill_gradient(low='lightblue', high='darkblue') +
    labs(fill='Count') +
    theme_minimal() +
    theme(axis_text_x=element_text(angle=45, hjust=1))))
Figure 17

Next Steps

  • Data Cleaning: Address missing values and miscoded entries in Type, Category, and other fields
  • Feature Engineering: Parse Size and Installs into numeric formats for quantitative analysis
  • Advanced Analysis: Examine temporal patterns using Last Updated, analyze price elasticity, investigate category-specific trends
  • Multivariate Analysis: Build models to predict ratings or success metrics based on app characteristics