Introduction to ggplot2

The package ggplot2 implements the ideas introduced by Leland Wilkinson in the book The Grammar of Graphics. The idea is that every graph is built from the same basic components:

In ggplot2, the components are combined using the + operator.

ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) + geom_point()

... are column names in the data.frame or tibble data. Each geom_X defines an element and uses a stat_Y function (variable transformation) to calculate what is visualizes. For example, geom_bar uses stat_count to create a bar chart by counting how often each value appears in the data (see ? geom_bar). geom_point just uses the stat "identity" to display the points using the coordinates as they are. Scales, the coordinate system and guides are added automatically and can be changed by adding them as a new component to the end of the call.

RStudio’s Data Visualization Cheat Sheet offers a comprehensive overview of available components. A good introduction can be found in the Chapter on Data Visualization of the free book R for Data Science.

Plots

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data(iris)
iris <- as_tibble(iris)
iris
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Scatterplot

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point()

Color by species

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point()

Histogram

ggplot(iris, aes(Petal.Width)) + geom_histogram(bins = 20)

Color by species

ggplot(iris, aes(Petal.Width, fill = Species)) + geom_histogram(bins = 20)

Density instead of counts

ggplot(iris, aes(Petal.Width, fill = Species)) + geom_density(alpha = .8)

Barplot

Barplots count! Each geom_ has a stat_ associates. geom_bar uses stat_count (see ? geom_bar).

ggplot(iris, aes(Species)) + geom_bar()

Boxplot

To compare different species, we have to convert the data into long format (only one value per row).

iris_long <- iris %>% mutate(id = row_number()) %>% pivot_longer(1:4)

iris_long
## # A tibble: 600 x 4
##    Species    id name         value
##    <fct>   <int> <chr>        <dbl>
##  1 setosa      1 Sepal.Length   5.1
##  2 setosa      1 Sepal.Width    3.5
##  3 setosa      1 Petal.Length   1.4
##  4 setosa      1 Petal.Width    0.2
##  5 setosa      2 Sepal.Length   4.9
##  6 setosa      2 Sepal.Width    3  
##  7 setosa      2 Petal.Length   1.4
##  8 setosa      2 Petal.Width    0.2
##  9 setosa      3 Sepal.Length   4.7
## 10 setosa      3 Sepal.Width    3.2
## # … with 590 more rows
ggplot(iris_long, aes(name, value)) + geom_boxplot()

Colors and Themes

Everything that changes with the data needs to go in the aes() in ggplot(). For example, do not put color into geom_point() unless you want all point having the same color.

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point()

Use a different color scheme

library(viridis)
## Loading required package: viridisLite
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
  scale_color_viridis(discrete=TRUE)

You need to distinguish between discrete (for factors) and continuous scales. You can apply scales to color and fill (i.e, scale_color_* and scale_fill_*).

Themes let you change the look of your plots.

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
  theme_minimal()

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
  theme_linedraw() + scale_color_grey()

Facets

Group plot by a discrete variable

ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point() + 
  facet_grid(cols = vars(Species))