An Extremely Short Introduction to Tidyverse

Introduction

Tidyverse is a collection of packages that makes working with data simpler and more consistent.

A good overview is given in the RStudio Data Transformation Cheatsheet and an introduction can be found in the Section on Data Wrangling the free book R for Data Science.

To use tidyverse you have to install it first with install.packages("tidyverse"). Load tidyverse:

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Tibbles

A tibble is a replacement for a data.frame that is simpler and often faster.

The iris data comes as a data.frame.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Convert the data into a tibble.

iris <- as_tibble(iris)
iris
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

tibbles can be used (almost) exactly like data.frames.

Pipes %>%

Pipes let you compose a sequence of function calls in a more readable way. The following two lines do the same.

Standard functional form in R using nested functions:

print(head(iris))
## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

Using pipes makes this more readable as a sequence of operations:

iris %>% head() %>% print()
## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

The pipe supplies the result of the previous function as the first argument to the next function. More information on pipes can be found in the package magrittr.

Data Import with readr

Use read_csv() (note the _). It will guess data types and is faster and more versatile compared to the base R read function.

mlb <- read_csv("https://michael.hahsler.net/SMU/EMIS3309/R/examples/MLB_cleaned.csv",
                col_types = "ccffddd")
mlb
## # A tibble: 1,034 × 7
##    `First Name` `Last Name` Team  Position     `Height(inches)` `Weight(pounds)`
##    <chr>        <chr>       <fct> <fct>                   <dbl>            <dbl>
##  1 Jeff         Mathis      ANA   Catcher                    72              180
##  2 Mike         Napoli      ANA   Catcher                    72              205
##  3 Jose         Molina      ANA   Catcher                    74              220
##  4 Howie        Kendrick    ANA   First Basem…               70              180
##  5 Kendry       Morales     ANA   First Basem…               73              220
##  6 Casey        Kotchman    ANA   First Basem…               75              210
##  7 Robb         Quinlan     ANA   First Basem…               73              200
##  8 Shea         Hillenbrand ANA   First Basem…               73              211
##  9 Terry        Evans       ANA   Outfielder                 75              200
## 10 Reggie       Willits     ANA   Outfielder                 71              185
## # ℹ 1,024 more rows
## # ℹ 1 more variable: Age <dbl>

col_types can be used to specify the data type for each column (c for character, f for factor, d for double).

Data Transformation with dplyr

dplyr uses pipes to apply a series of functions to data. Functions are similar to SQL’s SELECT ... FROM ... WHERE ... GROUP BY ... ORDER BY ... syntax. :

  • Start with a table (FROM in SQL).
  • Pick variables by their names with select() (SELECT in SQL).
  • Pick observations by their values using filter() (WHERE in SQL).
  • Reorder the rows using arrange() (ORDER BY in SQL).
  • Calculate groups summaries using group_by() and summarize() (GROUP BY and aggregation function in SELECT of SQL).
  • Create new variables with functions of existing variables with mutate() (a new column in SELECT in SQL).

Variable names (columns) from the data can be directly used in the functions.

Examples:

iris %>% select(starts_with("Sepal"))
## # A tibble: 150 × 2
##    Sepal.Length Sepal.Width
##           <dbl>       <dbl>
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # ℹ 140 more rows
iris %>% filter(Species == "setosa")
## # A tibble: 50 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 40 more rows
iris %>% arrange(desc(Sepal.Length))
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>    
##  1          7.9         3.8          6.4         2   virginica
##  2          7.7         3.8          6.7         2.2 virginica
##  3          7.7         2.6          6.9         2.3 virginica
##  4          7.7         2.8          6.7         2   virginica
##  5          7.7         3            6.1         2.3 virginica
##  6          7.6         3            6.6         2.1 virginica
##  7          7.4         2.8          6.1         1.9 virginica
##  8          7.3         2.9          6.3         1.8 virginica
##  9          7.2         3.6          6.1         2.5 virginica
## 10          7.2         3.2          6           1.8 virginica
## # ℹ 140 more rows
iris %>% group_by(Species) %>% summarize_all(mean)
## # A tibble: 3 × 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03
iris %>% mutate(Sepal.Length_scaled = drop(scale(Sepal.Length)))
## # A tibble: 150 × 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_scaled
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>                 <dbl>
##  1          5.1         3.5          1.4         0.2 setosa               -0.898
##  2          4.9         3            1.4         0.2 setosa               -1.14 
##  3          4.7         3.2          1.3         0.2 setosa               -1.38 
##  4          4.6         3.1          1.5         0.2 setosa               -1.50 
##  5          5           3.6          1.4         0.2 setosa               -1.02 
##  6          5.4         3.9          1.7         0.4 setosa               -0.535
##  7          4.6         3.4          1.4         0.3 setosa               -1.50 
##  8          5           3.4          1.5         0.2 setosa               -1.02 
##  9          4.4         2.9          1.4         0.2 setosa               -1.74 
## 10          4.9         3.1          1.5         0.1 setosa               -1.14 
## # ℹ 140 more rows

Complete example:

iris %>% filter(Species == "setosa") %>% 
  select(starts_with("Sepal")) %>% 
  arrange(desc(Sepal.Length)) %>% 
  head(3)
## # A tibble: 3 × 2
##   Sepal.Length Sepal.Width
##          <dbl>       <dbl>
## 1          5.8         4  
## 2          5.7         4.4
## 3          5.7         3.8

dplyr also provides join functions (e.g., inner_join()) to merge multiple tables.

Tidy data with tidyr

Data we get is typically organized in “wide format” where each value has its own row

The iris dataset is in the wide format where each row is a flower and each column contains the values of a measurement.

head(iris)
## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

An alternative format is the “long format.” Each variable has its own row in this format.

To convert the data into a long format we can use pivot_longer(). I first add a column with the row ID to identify a flower.

iris_long <- iris %>% 
  rowid_to_column("ID") %>% 
  pivot_longer(cols = c("Sepal.Length", "Sepal.Width", 
                        "Petal.Length", "Petal.Width")) 

iris_long
## # A tibble: 600 × 4
##       ID Species name         value
##    <int> <fct>   <chr>        <dbl>
##  1     1 setosa  Sepal.Length   5.1
##  2     1 setosa  Sepal.Width    3.5
##  3     1 setosa  Petal.Length   1.4
##  4     1 setosa  Petal.Width    0.2
##  5     2 setosa  Sepal.Length   4.9
##  6     2 setosa  Sepal.Width    3  
##  7     2 setosa  Petal.Length   1.4
##  8     2 setosa  Petal.Width    0.2
##  9     3 setosa  Sepal.Length   4.7
## 10     3 setosa  Sepal.Width    3.2
## # ℹ 590 more rows

Note that each flower has now multiple rows, one for each variable.

An application is to use a long format with ggplot2 facets to produce a histogram for each variable organized by species.

ggplot(iris_long, mapping = aes(value)) + 
  geom_histogram() + 
  facet_grid(cols = vars(name), rows = vars(Species))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Long format can be converted to a wide format using pivot_wider().

Strings, Dates, Factors, etc.

Tidyverse provides many packages to work with certain data:

  • tidyr data cleaning and shaping.
  • forcats Tools for Working with Categorical Variables (Factors).
  • lubridate Make Dealing with Dates a Little Easier.
  • stringr A cohesive set of functions designed to make working with strings as easy as possible.
  • purr Apply functions with tidyverse (functional programming).

The best starting point are the RStudio Cheatsheets.