An Extremely Short Introduction to Tidyverse
Introduction
Tidyverse is a collection of packages that makes working with data simpler and more consistent.
A good overview is given in the RStudio Data Transformation Cheatsheet and an introduction can be found in the Section on Data Wrangling the free book R for Data Science.
To use tidyverse you have to install it first with
install.packages("tidyverse")
. Load tidyverse:
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Tibbles
A tibble
is
a replacement for a data.frame
that is simpler and often
faster.
The iris data comes as a data.frame
.
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Convert the data into a tibble
.
<- as_tibble(iris)
iris iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
tibbles can be used (almost) exactly like data.frames.
Pipes %>%
Pipes let you compose a sequence of function calls in a more readable way. The following two lines do the same.
Standard functional form in R using nested functions:
print(head(iris))
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Using pipes makes this more readable as a sequence of operations:
%>% head() %>% print() iris
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The pipe supplies the result of the previous function as the first argument to the next function. More information on pipes can be found in the package magrittr.
Data Import with readr
Use read_csv()
(note the _
). It will guess
data types and is faster and more versatile compared to the base R read
function.
<- read_csv("https://michael.hahsler.net/SMU/EMIS3309/R/examples/MLB_cleaned.csv",
mlb col_types = "ccffddd")
mlb
## # A tibble: 1,034 × 7
## `First Name` `Last Name` Team Position `Height(inches)` `Weight(pounds)`
## <chr> <chr> <fct> <fct> <dbl> <dbl>
## 1 Jeff Mathis ANA Catcher 72 180
## 2 Mike Napoli ANA Catcher 72 205
## 3 Jose Molina ANA Catcher 74 220
## 4 Howie Kendrick ANA First Basem… 70 180
## 5 Kendry Morales ANA First Basem… 73 220
## 6 Casey Kotchman ANA First Basem… 75 210
## 7 Robb Quinlan ANA First Basem… 73 200
## 8 Shea Hillenbrand ANA First Basem… 73 211
## 9 Terry Evans ANA Outfielder 75 200
## 10 Reggie Willits ANA Outfielder 71 185
## # ℹ 1,024 more rows
## # ℹ 1 more variable: Age <dbl>
col_types
can be used to specify the data type for each
column (c
for character, f
for factor,
d
for double).
Data Transformation with dplyr
dplyr
uses
pipes to apply a series of functions to data. Functions are similar to
SQL’s
SELECT ... FROM ... WHERE ... GROUP BY ... ORDER BY ...
syntax. :
- Start with a table (FROM in SQL).
- Pick variables by their names with
select()
(SELECT in SQL). - Pick observations by their values using
filter()
(WHERE in SQL). - Reorder the rows using
arrange()
(ORDER BY in SQL). - Calculate groups summaries using
group_by()
andsummarize()
(GROUP BY and aggregation function in SELECT of SQL). - Create new variables with functions of existing variables with
mutate()
(a new column in SELECT in SQL).
Variable names (columns) from the data can be directly used in the functions.
Examples:
%>% select(starts_with("Sepal")) iris
## # A tibble: 150 × 2
## Sepal.Length Sepal.Width
## <dbl> <dbl>
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # ℹ 140 more rows
%>% filter(Species == "setosa") iris
## # A tibble: 50 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 40 more rows
%>% arrange(desc(Sepal.Length)) iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 7.9 3.8 6.4 2 virginica
## 2 7.7 3.8 6.7 2.2 virginica
## 3 7.7 2.6 6.9 2.3 virginica
## 4 7.7 2.8 6.7 2 virginica
## 5 7.7 3 6.1 2.3 virginica
## 6 7.6 3 6.6 2.1 virginica
## 7 7.4 2.8 6.1 1.9 virginica
## 8 7.3 2.9 6.3 1.8 virginica
## 9 7.2 3.6 6.1 2.5 virginica
## 10 7.2 3.2 6 1.8 virginica
## # ℹ 140 more rows
%>% group_by(Species) %>% summarize_all(mean) iris
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
%>% mutate(Sepal.Length_scaled = drop(scale(Sepal.Length))) iris
## # A tibble: 150 × 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_scaled
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 5.1 3.5 1.4 0.2 setosa -0.898
## 2 4.9 3 1.4 0.2 setosa -1.14
## 3 4.7 3.2 1.3 0.2 setosa -1.38
## 4 4.6 3.1 1.5 0.2 setosa -1.50
## 5 5 3.6 1.4 0.2 setosa -1.02
## 6 5.4 3.9 1.7 0.4 setosa -0.535
## 7 4.6 3.4 1.4 0.3 setosa -1.50
## 8 5 3.4 1.5 0.2 setosa -1.02
## 9 4.4 2.9 1.4 0.2 setosa -1.74
## 10 4.9 3.1 1.5 0.1 setosa -1.14
## # ℹ 140 more rows
Complete example:
%>% filter(Species == "setosa") %>%
iris select(starts_with("Sepal")) %>%
arrange(desc(Sepal.Length)) %>%
head(3)
## # A tibble: 3 × 2
## Sepal.Length Sepal.Width
## <dbl> <dbl>
## 1 5.8 4
## 2 5.7 4.4
## 3 5.7 3.8
dplyr
also provides join functions (e.g.,
inner_join()
) to merge multiple tables.
Tidy data with tidyr
Data we get is typically organized in “wide format” where each value has its own row
The iris dataset is in the wide format where each row is a flower and each column contains the values of a measurement.
head(iris)
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
An alternative format is the “long format.” Each variable has its own row in this format.
To convert the data into a long format we can use
pivot_longer()
. I first add a column with the row ID to
identify a flower.
<- iris %>%
iris_long rowid_to_column("ID") %>%
pivot_longer(cols = c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width"))
iris_long
## # A tibble: 600 × 4
## ID Species name value
## <int> <fct> <chr> <dbl>
## 1 1 setosa Sepal.Length 5.1
## 2 1 setosa Sepal.Width 3.5
## 3 1 setosa Petal.Length 1.4
## 4 1 setosa Petal.Width 0.2
## 5 2 setosa Sepal.Length 4.9
## 6 2 setosa Sepal.Width 3
## 7 2 setosa Petal.Length 1.4
## 8 2 setosa Petal.Width 0.2
## 9 3 setosa Sepal.Length 4.7
## 10 3 setosa Sepal.Width 3.2
## # ℹ 590 more rows
Note that each flower has now multiple rows, one for each variable.
An application is to use a long format with ggplot2
facets to produce a histogram for each variable organized by
species.
ggplot(iris_long, mapping = aes(value)) +
geom_histogram() +
facet_grid(cols = vars(name), rows = vars(Species))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Long format can be converted to a wide format using
pivot_wider()
.
Strings, Dates, Factors, etc.
Tidyverse provides many packages to work with certain data:
tidyr
data cleaning and shaping.forcats
Tools for Working with Categorical Variables (Factors).lubridate
Make Dealing with Dates a Little Easier.stringr
A cohesive set of functions designed to make working with strings as easy as possible.purr
Apply functions with tidyverse (functional programming).
The best starting point are the RStudio Cheatsheets.