An Extremely Short Introduction to ggplot
Introduction to ggplot2
The package ggplot2
implements the ideas introduced by Leland Wilkinson in the book The Grammar of Graphics. The idea is that every graph is built from the same basic components:
- Data
- Variable transformations (e.g., counting)
- Scale transformations (e.g., a linear or a log scale, color scales)
- Coordinate system (default is Cartesian coordinates)
- Elements: graphs (e.g., points) and their aesthetic attributes (e.g., color)
- Guides (e.g., axes and legends)
In ggplot2
, the components are combined using the +
operator.
ggplot(data, mapping = aes(x = ..., y = ..., color = ...)) +
geom_point()
...
are column names in the data.frame
or tibble
data
. Each geom_X
defines an element and uses a stat_Y
function (variable transformation) to calculate what is visualizes. For example, geom_bar
uses stat_count
to create a bar chart by counting how often each value appears in the data (see ? geom_bar
). geom_point
just uses the stat "identity"
to display the points using the coordinates as they are. Scales, the coordinate system and guides are added automatically and can be changed by adding them as a new component to the end of the call.
RStudio’s Data Visualization Cheat Sheet offers a comprehensive overview of available components. A good introduction can be found in the Chapter on Data Visualization of the free book R for Data Science.
Plots
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data(iris)
<- as_tibble(iris)
iris iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
Scatterplot
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point()
Color by species
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point()
Scatterplot with Regression Lines
Default (for small datasets) is LOESS (locally estimated scatterplot smoothing).
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Linear Model
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Linear Model by species
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Histogram
ggplot(iris, aes(Petal.Width)) + geom_histogram(bins = 20)
Color by species
ggplot(iris, aes(Petal.Width, fill = Species)) + geom_histogram(bins = 20)
Density instead of counts
ggplot(iris, aes(Petal.Width, fill = Species)) + geom_density(alpha = .8)
Barplot
Barplots count! Each geom_
has a stat_
associates. geom_bar
uses stat_count
(see ? geom_bar
).
ggplot(iris, aes(Species)) + geom_bar()
Boxplot
To compare the distribution of Sepal.Length
by Species.
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()
Facets
Group plot by a discrete variable
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width)) + geom_point() +
facet_grid(cols = vars(Species))
Colors and Themes
Everything that changes with the data needs to go in the aes()
in ggplot()
. For example, do not put color into geom_point()
unless you want all point having the same color.
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point()
Use a different color scheme
library(viridis)
## Loading required package: viridisLite
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
scale_color_viridis(discrete = TRUE)
You need to distinguish between discrete (for factors) and continuous scales. You can apply scales to color and fill (i.e, scale_color_*
and scale_fill_*
).
Themes let you change the look of your plots.
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
theme_minimal()
ggplot(iris, aes(x = Petal.Width, y = Sepal.Width, color = Species)) + geom_point() +
theme_linedraw() + scale_color_grey()