An Extremely Short Introduction to ggplot

Introduction to ggplot2

The package ggplot2 implements the ideas introduced by Leland Wilkinson in the book The Grammar of Graphics. The idea is that every graph is built from the same basic components:

  • Data
  • Variable transformations (e.g., counting)
  • Scale transformations (e.g., a linear or a log scale, color scales)
  • Coordinate system (default is Cartesian coordinates)
  • Elements: graphs (e.g., points) and their aesthetic attributes (e.g., color)
  • Guides (e.g., axes and legends)

In ggplot2, the components are combined using the + operator.

ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) + geom_point()

... are column names in the data.frame or tibble data. Each geom_X defines an element and uses a stat_Y function (variable transformation) to calculate what is visualizes. For example, geom_bar uses stat_count to create a bar chart by counting how often each value appears in the data (see ? geom_bar). geom_point just uses the stat "identity" to display the points using the coordinates as they are. Scales, the coordinate system and guides are added automatically and can be changed by adding them as a new component to the end of the call.

RStudio’s Data Visualization Cheat Sheet offers a comprehensive overview of available components. A good introduction can be found in the Chapter on Data Visualization of the free book R for Data Science.

Plots

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data(iris)
iris <- as_tibble(iris)
iris
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Scatterplot

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point()

Color by species

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point()

Scatterplot with Regression Lines

Default (for small datasets) is LOESS (locally estimated scatterplot smoothing).

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Linear Model

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point() + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Linear Model by species

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Histogram

ggplot(iris, aes(Petal.Width)) + geom_histogram(bins = 20)

Color by species

ggplot(iris, aes(Petal.Width, fill = Species)) + geom_histogram(bins = 20)

Density instead of counts

ggplot(iris, aes(Petal.Width, fill = Species)) + geom_density(alpha = .8)

Barplot

Barplots count! Each geom_ has a stat_ associates. geom_bar uses stat_count (see ? geom_bar).

ggplot(iris, aes(Species)) + geom_bar()

Boxplot

To compare the distribution of Sepal.Length by Species.

ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()

Facets

Group plot by a discrete variable

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point() + 
  facet_grid(cols = vars(Species)) 

Colors and Themes

Everything that changes with the data needs to go in the aes() in ggplot(). For example, do not put color into geom_point() unless you want all point having the same color.

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point()

Use a different color scheme

library(viridis)
## Loading required package: viridisLite
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
  scale_color_viridis(discrete = TRUE)

You need to distinguish between discrete (for factors) and continuous scales. You can apply scales to color and fill (i.e, scale_color_* and scale_fill_*).

Themes let you change the look of your plots.

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
  theme_minimal()

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
  theme_linedraw() + scale_color_grey()