7 The tidyverse
7.1 Overview
The tidyverse is a collection of R packages which aim to augment and replace much base R functionality, providing a more consistent user interface and focusing on data science rather than classical statistics or data analysis. The name is based on the concept of tidy data24, which essentially requires that the data is arranged in a rectangular table where every column represents a variable, and every row an observational unit (subject). Table 7.1 lists some important tidyverse
packages.
Tidyverse | Related R concept(s) | Base R functions / packages |
---|---|---|
ggplot2 | Graphics | plot , boxplot , par |
dplyr | Data handling | [] , subset , transform , aggregate |
tidyr | Data cleaning | reshape |
readr | Text data import | read.table |
purrr | FIXME | FIXME |
tibble | Data frame | data.frame and related |
stringr | String processing | nchar , substr , paste etc. |
forcats | Grouping variables | factor |
haven | Data import | e.g. package foreign |
readxl | Excel data import | e.g. package openxlsx |
lubridate | Dates & times | as.Date , strptime |
magrittr | Pipelines | |> |
Now in terms of data, tidyness is only a re-formulation of the concept of “rectangular statistical data” (though the examples of how this concept can be violated or achieved in the original publication are still interesting). However, there are a number of other common design features that are shared by many tidyverse
packages:
- Functions tend to also generate tidy rectangular output, e.g. for statistical tests or regression models, which in base R are typically named lists (Section FIXME): this can be easier to read, and allows elegant processing of analysis results.25
- A focus on arranging workflows as pipelines, where output from a function call is “pumped” directly into another function, rather than creating temporary objects for intermediate results, or using nested function calls: consequently, many
tidy
functions have a data frame or similar object as first argument, rather than as second argument (as in base R functions liket.ttest
orlm
). - Functions often accept variable names in data-frame like objects in a stripped down form, without making use of the
$
-notation or turning variables into quoted strings, similar to the interface forsubset
ortransform
in base R.26
Below, we will use dplyr
, a popular package for data processing, to demonstrate these design features, and how typical interactions with a tidyverse
package can look like.
7.2 Example: Using dplyr
for data processing
7.2.1 Overview
dplyr
27 can be used for general processing of data frames and data frame-like objects, and
implements the same basic functionality for sorting, filtering, extracting and transforming data described for base R in Section 5; dplyr
also supports merging and per-group processing of data as described for base R in Section 6. We will use the same example (data on blood pressure and salt intake) and replicate the operations in these sections.
dplyr functions |
Purpose | Base R functions |
---|---|---|
slice |
Extract rows (by position) | subset , [ (positional) |
filter |
Extract rows (conditionally) | subset , [ (conditional) |
select |
Extract columns | subset , [ |
arrange |
Sort rows | [ with order |
mutate |
Transform/create columns | transform , [ |
group_by + summarise |
Groupwise processing | split + lapply |
left_join etc. |
Combine datasets | merge |
As seen in Table 7.2, dplyr
functionality is systematically implemented using functions, in contrast to base R, where we also use the [
- and $
-operators. These functions have a similar interface as base functions subset
and transform
, where the data object is the first argument, and where variable names can be used directly, without having to quote them or use the $
-notation.28
7.2.2 Basic data operations
Extracting rows of data (i.e. units of observation/subjects) by their position has its own command, namely slice
. This code extracts rows 1, 3, 4 and 89:
> library(dplyr)
> slice(salt_ht, c(1, 3, 4, 89))
ID sex sbp dbp saltadd age
1 4305 male 110 80 yes 58
2 5758 female 196 128 <NA> 53
3 2265 male 167 112 yes 68
4 8627 female 80 62 yes 31
Extracting rows of data based on a logical condition, however, is done via command filter
. This extracts all subjects older than 60 years:
> filter(salt_ht, age > 60)
ID sex sbp dbp saltadd age
1 2265 male 167 112 yes 68
2 9605 male 198 119 yes 63
3 4767 female 149 72 yes 63
4 1024 female 178 128 yes 63
5 9962 male 140 90 no 67
6 5034 female 128 84 no 64
7 1842 female 184 148 yes 61
8 7146 male 160 98 no 64
9 1457 male 170 98 no 66
10 7276 female 201 124 no 71
11 5534 male 200 110 no 66
12 9899 female 223 102 yes 63
You can use multiple logical expressions separated by a comma, in which case the expression are connected via logical AND; so to extract all subjects who are both female and over 60, you can write
> filter(salt_ht, sex == "female", age > 60)
ID sex sbp dbp saltadd age
1 4767 female 149 72 yes 63
2 1024 female 178 128 yes 63
3 5034 female 128 84 no 64
4 1842 female 184 148 yes 61
5 7276 female 201 124 no 71
6 9899 female 223 102 yes 63
Extracting columns (instead of rows) from a data frame has yet another command, namely select
.29 However, in contrast to slice
/ filter
, which are very specific about how you can extract rows, select
is flexible in how you can select columns:30 both position (integers) and names work, e.g.:
> select(salt_ht, 1:3)
ID sex sbp
1 4305 male 110
2 6606 female 85
3 5758 female 196
4 2265 male 167
5 7408 female 145
[Skipped 96 rows of output]
> select(salt_ht, sbp, dbp)
sbp dbp
1 110 80
2 85 55
3 196 128
4 167 112
5 145 110
[Skipped 96 rows of output]
select
also supports special functions that allow pattern matching on variable names, like starts_with
or ends_with
, as well as selection by variable type, e.g. as
> select(salt_ht, where(is.numeric))
ID sbp dbp age
1 4305 110 80 58
2 6606 85 55 32
3 5758 196 128 53
4 2265 167 112 68
5 7408 145 110 55
[Skipped 96 rows of output]
Selecting a rectangular subset of a data frame object does not have a separate function in dplyr
, in contrast to base R, where simultaneous selection of rows and columns via the [,]
-notation is routine. Instead, this done in two steps, by first selecting rows from the original data frame, and then selecting columns from the selected rows (or the other waty around, of course). In classic R notation, this can be done as a nested function call:
> select(slice(salt_ht, 1:4), 4:5)
dbp saltadd
1 80 yes
2 55 no
3 128 <NA>
4 112 yes
> select(filter(salt_ht, age > 65), dbp, sbp)
dbp sbp
1 112 167
2 90 140
3 98 170
4 124 201
5 110 200
Here, the first (inner) function call selects the subset of rows, including all columns; this smaller data set is then passed as argument to select
, which extracts the specified columns.
Piping Nested function calls can take time to get used to, will get harder to write and read the more processing steps are nested within each other. This is where the tidyverse
’s love for pipeline operators comes in: typically, the rectangualr selections above would be written as
> slice(salt_ht, 1:5) %>%
+ select(4:5)
dbp saltadd
1 80 yes
2 55 no
3 128 <NA>
4 112 yes
5 110 yes
> filter(salt_ht, age > 60) %>%
+ select(dbp, sbp)
dbp sbp
1 112 167
2 119 198
3 72 149
4 128 178
5 90 140
6 84 128
7 148 184
8 98 160
9 98 170
10 124 201
11 110 200
12 102 223
Note that these pipelines are just a different way of specifying the same nested function calls (select rows, then select columns) - but now, we process the data in the same direction as we read & write the code (i.e. from left to right), which many find more intuitive.31
Sorting Let’s use the pipe-notation from this point forward: dplyr
also has a special function for sorting data frames called arrange
. If we want to sort subjects by increasing age, but only display the top five rows (to save space), we can do this:
> arrange(salt_ht, age) %>%
+ slice(1:5)
ID sex sbp dbp saltadd age
1 5514 female 132 80 no 26
2 5618 male 116 75 no 29
3 7993 female 108 72 <NA> 30
4 4204 male 125 84 no 30
5 7663 female 105 78 <NA> 31
And we see five rows of participants, starting with age 26 and increasing from there.
This can easily be extended to sorting rows by multiple criteria:
> arrange(salt_ht, age, sex, dbp) %>%
+ slice(1:12)
ID sex sbp dbp saltadd age
1 5514 female 132 80 no 26
2 5618 male 116 75 no 29
3 7993 female 108 72 <NA> 30
4 4204 male 125 84 no 30
5 8627 female 80 62 yes 31
6 7663 female 105 78 <NA> 31
7 5988 female 118 82 yes 31
8 8550 male 120 68 yes 31
9 5345 male 110 72 <NA> 31
10 2215 male 120 82 no 31
11 6606 female 85 55 no 32
12 8202 female 110 70 no 32
Here, rows are first sorted by age; subjects with the same ages are then sorted by sex32, and individuals with the same age/sex by diastolic blood pressure.
Modifying data The function mutate
can both modify existing variables and generate new variables in a data frame. If we want to add the (natural) logarithms of the blood pressure variable to the data, we do this:
> mutate(salt_ht, log_dbp = log(dbp), log_sbp = log(sbp)) %>%
+ slice(1:4)
ID sex sbp dbp saltadd age log_dbp log_sbp
1 4305 male 110 80 yes 58 4.382027 4.700480
2 6606 female 85 55 no 32 4.007333 4.442651
3 5758 female 196 128 <NA> 53 4.852030 5.278115
4 2265 male 167 112 yes 68 4.718499 5.117994
mutate
works broadly the same as base R transform
(you could just exchange mutate
for transform
in the statement above, and it would still work.)33
Merging data As in Section @ref{merge-data-base}, we can add the sampling location (health centers) to the basic data set, here using the function left_join
:
> salt_ht_centers2 <- left_join(salt_ht, centers, by = "ID")
> summary(salt_ht_centers2)
ID sex sbp dbp saltadd age Center
Min. :1006 female:55 Min. : 80.0 Min. : 55.00 no :37 Min. :26.00 A:33
1st Qu.:2879 male :45 1st Qu.:121.0 1st Qu.: 80.00 yes :43 1st Qu.:39.75 B:36
Median :5237 Median :148.5 Median : 96.00 NA's:20 Median :50.00 C:31
Mean :5227 Mean :154.3 Mean : 98.51 Mean :48.71
3rd Qu.:7309 3rd Qu.:184.0 3rd Qu.:116.25 3rd Qu.:58.00
Max. :9962 Max. :238.0 Max. :158.00 Max. :71.00
Again, left_join
is more specialized than the corresponding base R function merge
: it will keep all observations (rows) in the first (“left”) data object, and only add information from the second data object where the key variable (ID
) matches; other types of data merges (e.g. only keeping rows where the key variable(s) appear in both data objects) have their function in dplyr
(see ?inner_join
). In contrast, merge
implements these different merging operations by setting different arguments (like all.x
).34
Exercises: Use the appropriate dplyr
-commands to extract the follwing subsets of salt_ht
:
- the first row only;
- all female participants;
- all participants with systolic blood pressure over 160, or diastolic blood pressure over 100;
- the first three columns of data;
- all variables whose name ends in “bp”;
- as for 3, but now sorted by decreasing age.
7.2.3 Groupwise data operations
Grouped data As seen in Section @ref{group-stats}, we can use the base R function aggregate
to calculate per-group summaries, by specifying the data, the group memberships, and a suitable summary function. dplyr
implements this functionality somewhat differently, by adding the grouping information directly to the data, via function group_by
, and carrying this grouped data set forward for processing and analysis:
> group_by(salt_ht, sex)
# A tibble: 100 × 6
# Groups: sex [2]
ID sex sbp dbp saltadd age
<int> <fct> <int> <int> <fct> <int>
1 4305 male 110 80 yes 58
2 6606 female 85 55 no 32
3 5758 female 196 128 <NA> 53
4 2265 male 167 112 yes 68
5 7408 female 145 110 yes 55
6 2160 female 179 120 no 60
7 8846 male 111 78 no 59
8 8202 female 110 70 no 32
9 9605 male 198 119 yes 63
10 4137 male 171 102 yes 58
# ℹ 90 more rows
However, because ordinary data frames do not support the inclusion of grouping information, the output from group_by
is a generalization of the data frame object called a tibble
: looking at the output above, we see that it is stated, right at the top, that this is a tibble with 100 rows and six columns; this is directly followed by the information that this is a grouped tibble, with the grouping variable sex
, which has two distinct levels. Only then the actual content of the tibble is listed, which is the same as for the underlying data frame salt_ht
.
Note that the display offers both more and less information than the display of data frames: for each column, the type of the variable is listed under the name, as integer or factor (more), but only the top ten rows are shown (less): the second part (fewer rows) is a feature, as this means that the relevant tibble / column information will not run out of the console, as it often does when listing data frames.35
An important point about the tibble is that it is a generalized data frame, in the sense that underneath, it is still a data frame. This means we can still use all the techniques we have seen for data frames ([
s- and $
-notation, subset
etc.) for tibbles, and if worst comes to worst, use the function as.data.frame
to drop all the tibble-parts and revert to a simple (non-generalized) data frame. In other words, this is not a radically new concept around which we have wrap our head, but rather a logical extension what we have been using all along.
For, now let’s store this grouped tibble as a separate object for re-use further down the line:
Groupwise summaries We can caclulate pre-group summaries using the dplyr
-function summarise
:
> salt_ht_bysex %>%
+ summarise(Count = n(), Mean_syst = mean(sbp), StdDev_syst = sd(sbp))
# A tibble: 2 × 4
sex Count Mean_syst StdDev_syst
<fct> <int> <dbl> <dbl>
1 female 55 158. 43.2
2 male 45 150. 33.8
Here, we drop the grouped data as first argument into the summarise
-function via the pipe operator %>%
, and we define three summaries that we want to have calculated for each sex:
- the number of samples, calculated via the helper function
n()
, and stored as variableCount
, - the mean systolic blood pressure, calculated via base
mean
and stored as variableMean_syst
, - the standard deviation of the systolic blood pressure, calculated via base
sd
and stored as variableStdDev_syst
.
Note that we get choose the names of the new variables (on the left hand side) ourselves, but we refer to existing variables on the right hand side. The output is again a tibble, but no longer grouped, and with only two rows, one for each sex in the original grouped data, and four columns, the three new variables and the name of the group.
We can the same thing for more than one grouping variable, e.g. sex
and saltadd
:
> group_by(salt_ht, sex, saltadd) %>%
+ summarise(Count = n(), Mean_syst = mean(sbp), StdDev_syst = sd(sbp))
# A tibble: 6 × 5
# Groups: sex [2]
sex saltadd Count Mean_syst StdDev_syst
<fct> <fct> <int> <dbl> <dbl>
1 female no 13 134. 35.1
2 female yes 28 164. 41.4
3 female <NA> 14 167. 48.1
4 male no 24 139. 26.8
5 male yes 15 160. 36.4
6 male <NA> 6 167 42.8
The output here is a tibble with six rows, as dplyr
will by default include missing values in a grouping variable as an extra level, which is arguably preferable over silently dropping these rows.
(As it happens, here the individuals with missing salt-added information have rather high mean systolic blood pressures, so it would be interesting to look more closely at why the data is actually missing here.)
Groupwise filtering Extracting rows from a grouped tibble via filter
and slice
also respects the groupind, i.e. rows will be extracted per group: while the condition age==max(age)
will extract all participants with the highest age from the original tibble (one or more), it will extract the oldest participants from each group (so at least two):
> filter(salt_ht, age == max(age))
ID sex sbp dbp saltadd age
1 7276 female 201 124 no 71
> filter(salt_ht_bysex, age == max(age))
# A tibble: 2 × 6
# Groups: sex [2]
ID sex sbp dbp saltadd age
<int> <fct> <int> <int> <fct> <int>
1 2265 male 167 112 yes 68
2 7276 female 201 124 no 71
Complex groupwise operations Grouped tibbles can also be used for more complex operations, like fitting separate regression models to different parts of the data. The most direct way is via the do
-function:36
> split_lm2 <- salt_ht_bysex %>%
+ do(model = lm(sbp ~ dbp, data = .))
> split_lm2
# A tibble: 2 × 2
# Rowwise:
sex model
<fct> <list>
1 female <lm>
2 male <lm>
Note how we specify the per-group data here, namely via a single dot .
: when running the do
-function, the dot will be replaced by the corresponding subsets of the data, one per grouping level. The result is a tibble with as many rows as groups, and a new variable with our (freely chosen) name model
, which is actually internally a list of linear models, even if this is not obvious from the tibble output:
So we can use the double bracket [[
-notation for list elements (Section @ref{basic_lists}), or function lapply
(Section @{split-apply-combine}) to process all models:
> split_lm2$model[[1]]
Call:
lm(formula = sbp ~ dbp, data = .)
Coefficients:
(Intercept) dbp
11.63 1.45
> lapply(split_lm2$model, summary)
[[1]]
Call:
lm(formula = sbp ~ dbp, data = .)
Residuals:
Min 1Q Median 3Q Max
-42.271 -12.851 -6.396 12.591 63.441
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.6308 12.5337 0.928 0.358
dbp 1.4503 0.1206 12.025 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.61 on 53 degrees of freedom
Multiple R-squared: 0.7318, Adjusted R-squared: 0.7267
F-statistic: 144.6 on 1 and 53 DF, p-value: < 2.2e-16
[[2]]
Call:
lm(formula = sbp ~ dbp, data = .)
Residuals:
Min 1Q Median 3Q Max
-34.091 -10.973 -1.218 10.609 35.968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.6551 12.8944 -0.128 0.898
dbp 1.5850 0.1323 11.983 2.71e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.42 on 43 degrees of freedom
Multiple R-squared: 0.7695, Adjusted R-squared: 0.7642
F-statistic: 143.6 on 1 and 43 DF, p-value: 2.707e-15
Exercise: Group the salt_ht
data frame by sex
and saltadd
, and calculate the mean and median difference between systolic and diastolic blood pressure for each group.
7.3 tidyverse
vs base R?
If the question in the heading does not make sense to you: fine! I agree - R is a general purpose computer language, and the tidyverse
just represents one specific, if rather opinionated and coherent, way of using that language to express things - a dialect, if you will, and so far a mutually intelligible dialect of R (in the sense that e.g. a tibble is still a data frame). By its nature, R can be extended and modified to serve different uses, and an extra set of well-designed add-on packages (which is what the tidyverse
is after all) just adds to the available choices: we can and should mix and match pragmatically according to our needs and preferences.
If the question in the heading is obvious and important to you: fine! I can see where you are coming from, and FWIW, large (or at least noisy) parts of the internet agree with you (try googling “tidyverse
vs base R”). When starting with R as a new language and analysis tool, its very flexibility can make it hard to get a grip on how to do things - even simple things, at first; encountering two very different approaches is not exactly helpful in this setting.
But consider these points:
If you are using R as primary analysis tool longer term, the choice of how to do things, whether to implement your own solution or try to find and adapt existing code, will come upon you; base R vs
tidyverse
vs e.g.data.table
is just the beginning, so you may as well embrace diversity right from the start.If you are using R only as a tool for a very specific project or study: you are missing out… but more seriously, that’s fine; suit yourself and go with whatever gets the job done better for you.
Which brings us to the next point, which is sadly absent in many discussions about base R and
tidyverse
: there is no general answer independent of the specific use case you are considering. A data scientist implementing a processing pipeline for massive amounts of continuously generated data, a researcher implementing a study protocol for an essentially fixed study cohort, and someone writing R code to implement new methods for other people to use on their data have very different needs and requirements.Much online discussion is worthless: you can quite safely ignore everything technical older than 2-3 years, anything that argues merely based on elegance, and anything that comes to strong general conclusions without regard to the use case (see previous point). There is material for at least one fascinating ethnographic PhD in how the base R vs
tidyverse
dichotomy reflects a generational and social shift from the original R Core Team to, say, the developers currently employed at Posit Software… I’m not going to write it; but be aware that much online (and some offline) discussion is driven more by tribal affiliation than by sober consideration of facts.In research at least, it is quite likely that you will spend orders of magnitude more time writing and debugging code than actually running it. Any discussion of performance and efficiency has to value the time you spend on these activities, not just the time spent on code execution (and this includes the time you spend on learning any new tools to do your coding & debugging).
While this may seem like a momentous decision to be taken early on and followed through to the bitter end, I really don’t think it is. In the light of the rapid prototyping and incremental improvement approach that R so excels at, and much in line with the previous points, it is important to
- get something off the ground that is demonstrably correct, with whatever tool you are most familiar with,
- modify and adapt your approach whenever you need to (e.g. because run times become unaccepatble).
Of course, to do b. well, you have to at least be aware what alternatives there are, so maybe it was for the best that we had this little talk about the
tidyverse
…
Wickham, H (2014): Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎
Though not all - e.g.
ggplot2
predates thetidyverse
concept and neither requires nor generates tidy data.↩︎This is known as non-standard evaluation; the Advanced R book has a good introduction.↩︎
According to the author, “The d is for dataframes, the plyr is to evoke pliers. Pronounce however you like.”↩︎
Note that the names and the organisation of the
dplyr
functions is clearly inspired by SQL, so if you have experience with relational databases, much may look familiar.↩︎FIXME - should we talk about name collisions, dplyr / MASS, dplyr / plyr?↩︎
As a matter of fact,
select
supports a whole mini-language for specifying a subset of variables, which has been implemented as a separate packagetidyselect
, see e.g.?starts_with
. This is IMO somewhat excessive.↩︎Note that “piping” is in no way limited to the
tidyverse
- you can use the operator%>%
implemented in packagemagrittr
with any R code; indeed, since Version 4.1.0, base R has its own pipe operator|>
which works in a very similar manner (and does not require an add-on package).↩︎In the example alphabetically, but really according to the order of the levels of the factor variable on which we sort.↩︎
FIXME - explain differences (rolling definitions)?↩︎
Of course, join is again a relational database term, and the join-functions in
dplyr
correspond directly to the SQL commands of the same name.↩︎The printing method for tibbles will try to show as much of the data as it can fit into the current console window, but will list the number of skipped rows, and if applicable, also the names and number of skipped variables that did not fit.↩︎
Actually, if you look at the help page
?do
, you find that this approach has been marked as “superseeded”: while it still works, it is no longer developed or maintained; the preferred approach (for now) is to usenest_by
instead ofgroup_by
. So why bring this up at all?The short answer is that this is just a very short overview of
dplyr
, and I don’t want to go into too much detail. The longer answer is that this is an example for how things can change intidyverse
packages: the explanation for superseedingdo
is given as “because its syntax never really felt like it belong with the rest of dplyr”, which seems somewhat arbitrary; also note that this now requires two new functions,nest_by
andacross
, and the decidely non-tidy new concept of a “rowwise” tibble (?rowwise
). So what started as a purist commitment to tidy principles and an elegant extension of the data frame concept, becomes increasingly complicated, and increasingly divorced from the underlying principles and concepts; and development is ongoing…↩︎