7 The tidyverse

7.1 Overview

The tidyverse is a collection of R packages which aim to augment and replace much base R functionality, providing a more consistent user interface and focusing on data science rather than classical statistics or data analysis. The name is based on the concept of tidy data24, which essentially requires that the data is arranged in a rectangular table where every column represents a variable, and every row an observational unit (subject). Table 7.1 lists some important tidyverse packages.

Table 7.1: Core tidyverse packages and corresponding functionality in base R (and add-on packages)
Tidyverse Related R concept(s) Base R functions / packages
ggplot2 Graphics plot, boxplot, par
dplyr Data handling [], subset, transform, aggregate
tidyr Data cleaning reshape
readr Text data import read.table
purrr FIXME FIXME
tibble Data frame data.frame and related
stringr String processing nchar, substr, paste etc.
forcats Grouping variables factor
haven Data import e.g. package foreign
readxl Excel data import e.g. package openxlsx
lubridate Dates & times as.Date, strptime
magrittr Pipelines |>

Now in terms of data, tidyness is only a re-formulation of the concept of “rectangular statistical data” (though the examples of how this concept can be violated or achieved in the original publication are still interesting). However, there are a number of other common design features that are shared by many tidyverse packages:

  • Functions tend to also generate tidy rectangular output, e.g. for statistical tests or regression models, which in base R are typically named lists (Section FIXME): this can be easier to read, and allows elegant processing of analysis results.25
  • A focus on arranging workflows as pipelines, where output from a function call is “pumped” directly into another function, rather than creating temporary objects for intermediate results, or using nested function calls: consequently, many tidy functions have a data frame or similar object as first argument, rather than as second argument (as in base R functions like t.ttest or lm).
  • Functions often accept variable names in data-frame like objects in a stripped down form, without making use of the $-notation or turning variables into quoted strings, similar to the interface for subset or transform in base R.26

Below, we will use dplyr, a popular package for data processing, to demonstrate these design features, and how typical interactions with a tidyverse package can look like.

7.2 Example: Using dplyr for data processing

7.2.1 Overview

dplyr27 can be used for general processing of data frames and data frame-like objects, and implements the same basic functionality for sorting, filtering, extracting and transforming data described for base R in Section 5; dplyr also supports merging and per-group processing of data as described for base R in Section 6. We will use the same example (data on blood pressure and salt intake) and replicate the operations in these sections.

Table 7.2: Correspondence between dplyr and base R functionality
dplyr functions Purpose Base R functions
slice Extract rows (by position) subset, [ (positional)
filter Extract rows (conditionally) subset, [ (conditional)
select Extract columns subset, [
arrange Sort rows [ with order
mutate Transform/create columns transform, [
group_by + summarise Groupwise processing split + lapply
left_join etc. Combine datasets merge

As seen in Table 7.2, dplyr functionality is systematically implemented using functions, in contrast to base R, where we also use the [- and $-operators. These functions have a similar interface as base functions subset and transform, where the data object is the first argument, and where variable names can be used directly, without having to quote them or use the $-notation.28

7.2.2 Basic data operations

Extracting rows of data (i.e. units of observation/subjects) by their position has its own command, namely slice. This code extracts rows 1, 3, 4 and 89:

> library(dplyr)
> slice(salt_ht, c(1, 3, 4, 89))
    ID    sex sbp dbp saltadd age
1 4305   male 110  80     yes  58
2 5758 female 196 128    <NA>  53
3 2265   male 167 112     yes  68
4 8627 female  80  62     yes  31

Extracting rows of data based on a logical condition, however, is done via command filter. This extracts all subjects older than 60 years:

> filter(salt_ht, age > 60)
     ID    sex sbp dbp saltadd age
1  2265   male 167 112     yes  68
2  9605   male 198 119     yes  63
3  4767 female 149  72     yes  63
4  1024 female 178 128     yes  63
5  9962   male 140  90      no  67
6  5034 female 128  84      no  64
7  1842 female 184 148     yes  61
8  7146   male 160  98      no  64
9  1457   male 170  98      no  66
10 7276 female 201 124      no  71
11 5534   male 200 110      no  66
12 9899 female 223 102     yes  63

You can use multiple logical expressions separated by a comma, in which case the expression are connected via logical AND; so to extract all subjects who are both female and over 60, you can write

> filter(salt_ht, sex == "female", age > 60)
    ID    sex sbp dbp saltadd age
1 4767 female 149  72     yes  63
2 1024 female 178 128     yes  63
3 5034 female 128  84      no  64
4 1842 female 184 148     yes  61
5 7276 female 201 124      no  71
6 9899 female 223 102     yes  63

Extracting columns (instead of rows) from a data frame has yet another command, namely select.29 However, in contrast to slice/ filter, which are very specific about how you can extract rows, select is flexible in how you can select columns:30 both position (integers) and names work, e.g.:

> select(salt_ht, 1:3)
      ID    sex sbp
1   4305   male 110
2   6606 female  85
3   5758 female 196
4   2265   male 167
5   7408 female 145
[Skipped 96 rows of output]
> select(salt_ht, sbp, dbp)
    sbp dbp
1   110  80
2    85  55
3   196 128
4   167 112
5   145 110
[Skipped 96 rows of output]

select also supports special functions that allow pattern matching on variable names, like starts_with or ends_with, as well as selection by variable type, e.g. as

> select(salt_ht, where(is.numeric))
      ID sbp dbp age
1   4305 110  80  58
2   6606  85  55  32
3   5758 196 128  53
4   2265 167 112  68
5   7408 145 110  55
[Skipped 96 rows of output]

Selecting a rectangular subset of a data frame object does not have a separate function in dplyr, in contrast to base R, where simultaneous selection of rows and columns via the [,]-notation is routine. Instead, this done in two steps, by first selecting rows from the original data frame, and then selecting columns from the selected rows (or the other waty around, of course). In classic R notation, this can be done as a nested function call:

> select(slice(salt_ht, 1:4), 4:5)
  dbp saltadd
1  80     yes
2  55      no
3 128    <NA>
4 112     yes
> select(filter(salt_ht, age > 65), dbp, sbp)
  dbp sbp
1 112 167
2  90 140
3  98 170
4 124 201
5 110 200

Here, the first (inner) function call selects the subset of rows, including all columns; this smaller data set is then passed as argument to select, which extracts the specified columns.

Piping Nested function calls can take time to get used to, will get harder to write and read the more processing steps are nested within each other. This is where the tidyverse’s love for pipeline operators comes in: typically, the rectangualr selections above would be written as

> slice(salt_ht, 1:5) %>%
+     select(4:5)
  dbp saltadd
1  80     yes
2  55      no
3 128    <NA>
4 112     yes
5 110     yes
> filter(salt_ht, age > 60) %>%
+     select(dbp, sbp)
   dbp sbp
1  112 167
2  119 198
3   72 149
4  128 178
5   90 140
6   84 128
7  148 184
8   98 160
9   98 170
10 124 201
11 110 200
12 102 223

Note that these pipelines are just a different way of specifying the same nested function calls (select rows, then select columns) - but now, we process the data in the same direction as we read & write the code (i.e. from left to right), which many find more intuitive.31

Sorting Let’s use the pipe-notation from this point forward: dplyr also has a special function for sorting data frames called arrange. If we want to sort subjects by increasing age, but only display the top five rows (to save space), we can do this:

> arrange(salt_ht, age) %>%
+     slice(1:5)
    ID    sex sbp dbp saltadd age
1 5514 female 132  80      no  26
2 5618   male 116  75      no  29
3 7993 female 108  72    <NA>  30
4 4204   male 125  84      no  30
5 7663 female 105  78    <NA>  31

And we see five rows of participants, starting with age 26 and increasing from there.

This can easily be extended to sorting rows by multiple criteria:

> arrange(salt_ht, age, sex, dbp) %>%
+     slice(1:12)
     ID    sex sbp dbp saltadd age
1  5514 female 132  80      no  26
2  5618   male 116  75      no  29
3  7993 female 108  72    <NA>  30
4  4204   male 125  84      no  30
5  8627 female  80  62     yes  31
6  7663 female 105  78    <NA>  31
7  5988 female 118  82     yes  31
8  8550   male 120  68     yes  31
9  5345   male 110  72    <NA>  31
10 2215   male 120  82      no  31
11 6606 female  85  55      no  32
12 8202 female 110  70      no  32

Here, rows are first sorted by age; subjects with the same ages are then sorted by sex32, and individuals with the same age/sex by diastolic blood pressure.

Modifying data The function mutate can both modify existing variables and generate new variables in a data frame. If we want to add the (natural) logarithms of the blood pressure variable to the data, we do this:

> mutate(salt_ht, log_dbp = log(dbp), log_sbp = log(sbp)) %>%
+     slice(1:4)
    ID    sex sbp dbp saltadd age  log_dbp  log_sbp
1 4305   male 110  80     yes  58 4.382027 4.700480
2 6606 female  85  55      no  32 4.007333 4.442651
3 5758 female 196 128    <NA>  53 4.852030 5.278115
4 2265   male 167 112     yes  68 4.718499 5.117994

mutate works broadly the same as base R transform (you could just exchange mutate for transform in the statement above, and it would still work.)33

Merging data As in Section @ref{merge-data-base}, we can add the sampling location (health centers) to the basic data set, here using the function left_join:

> salt_ht_centers2 <- left_join(salt_ht, centers, by = "ID")
> summary(salt_ht_centers2)
       ID           sex          sbp             dbp         saltadd        age        Center
 Min.   :1006   female:55   Min.   : 80.0   Min.   : 55.00   no  :37   Min.   :26.00   A:33  
 1st Qu.:2879   male  :45   1st Qu.:121.0   1st Qu.: 80.00   yes :43   1st Qu.:39.75   B:36  
 Median :5237               Median :148.5   Median : 96.00   NA's:20   Median :50.00   C:31  
 Mean   :5227               Mean   :154.3   Mean   : 98.51             Mean   :48.71         
 3rd Qu.:7309               3rd Qu.:184.0   3rd Qu.:116.25             3rd Qu.:58.00         
 Max.   :9962               Max.   :238.0   Max.   :158.00             Max.   :71.00         

Again, left_join is more specialized than the corresponding base R function merge: it will keep all observations (rows) in the first (“left”) data object, and only add information from the second data object where the key variable (ID) matches; other types of data merges (e.g. only keeping rows where the key variable(s) appear in both data objects) have their function in dplyr (see ?inner_join). In contrast, merge implements these different merging operations by setting different arguments (like all.x).34

Exercises: Use the appropriate dplyr-commands to extract the follwing subsets of salt_ht:

  1. the first row only;
  2. all female participants;
  3. all participants with systolic blood pressure over 160, or diastolic blood pressure over 100;
  4. the first three columns of data;
  5. all variables whose name ends in “bp”;
  6. as for 3, but now sorted by decreasing age.

7.2.3 Groupwise data operations

Grouped data As seen in Section @ref{group-stats}, we can use the base R function aggregate to calculate per-group summaries, by specifying the data, the group memberships, and a suitable summary function. dplyr implements this functionality somewhat differently, by adding the grouping information directly to the data, via function group_by, and carrying this grouped data set forward for processing and analysis:

> group_by(salt_ht, sex)
# A tibble: 100 × 6
# Groups:   sex [2]
      ID sex      sbp   dbp saltadd   age
   <int> <fct>  <int> <int> <fct>   <int>
 1  4305 male     110    80 yes        58
 2  6606 female    85    55 no         32
 3  5758 female   196   128 <NA>       53
 4  2265 male     167   112 yes        68
 5  7408 female   145   110 yes        55
 6  2160 female   179   120 no         60
 7  8846 male     111    78 no         59
 8  8202 female   110    70 no         32
 9  9605 male     198   119 yes        63
10  4137 male     171   102 yes        58
# ℹ 90 more rows

However, because ordinary data frames do not support the inclusion of grouping information, the output from group_by is a generalization of the data frame object called a tibble: looking at the output above, we see that it is stated, right at the top, that this is a tibble with 100 rows and six columns; this is directly followed by the information that this is a grouped tibble, with the grouping variable sex, which has two distinct levels. Only then the actual content of the tibble is listed, which is the same as for the underlying data frame salt_ht.

Note that the display offers both more and less information than the display of data frames: for each column, the type of the variable is listed under the name, as integer or factor (more), but only the top ten rows are shown (less): the second part (fewer rows) is a feature, as this means that the relevant tibble / column information will not run out of the console, as it often does when listing data frames.35

An important point about the tibble is that it is a generalized data frame, in the sense that underneath, it is still a data frame. This means we can still use all the techniques we have seen for data frames ([s- and $-notation, subset etc.) for tibbles, and if worst comes to worst, use the function as.data.frame to drop all the tibble-parts and revert to a simple (non-generalized) data frame. In other words, this is not a radically new concept around which we have wrap our head, but rather a logical extension what we have been using all along.

For, now let’s store this grouped tibble as a separate object for re-use further down the line:

> salt_ht_bysex <- group_by(salt_ht, sex)

Groupwise summaries We can caclulate pre-group summaries using the dplyr-function summarise:

> salt_ht_bysex %>%
+     summarise(Count = n(), Mean_syst = mean(sbp), StdDev_syst = sd(sbp))
# A tibble: 2 × 4
  sex    Count Mean_syst StdDev_syst
  <fct>  <int>     <dbl>       <dbl>
1 female    55      158.        43.2
2 male      45      150.        33.8

Here, we drop the grouped data as first argument into the summarise-function via the pipe operator %>%, and we define three summaries that we want to have calculated for each sex:

  • the number of samples, calculated via the helper function n(), and stored as variable Count,
  • the mean systolic blood pressure, calculated via base mean and stored as variable Mean_syst,
  • the standard deviation of the systolic blood pressure, calculated via base sd and stored as variable StdDev_syst.

Note that we get choose the names of the new variables (on the left hand side) ourselves, but we refer to existing variables on the right hand side. The output is again a tibble, but no longer grouped, and with only two rows, one for each sex in the original grouped data, and four columns, the three new variables and the name of the group.

We can the same thing for more than one grouping variable, e.g. sex and saltadd:

> group_by(salt_ht, sex, saltadd) %>%
+     summarise(Count = n(), Mean_syst = mean(sbp), StdDev_syst = sd(sbp))
# A tibble: 6 × 5
# Groups:   sex [2]
  sex    saltadd Count Mean_syst StdDev_syst
  <fct>  <fct>   <int>     <dbl>       <dbl>
1 female no         13      134.        35.1
2 female yes        28      164.        41.4
3 female <NA>       14      167.        48.1
4 male   no         24      139.        26.8
5 male   yes        15      160.        36.4
6 male   <NA>        6      167         42.8

The output here is a tibble with six rows, as dplyr will by default include missing values in a grouping variable as an extra level, which is arguably preferable over silently dropping these rows.

(As it happens, here the individuals with missing salt-added information have rather high mean systolic blood pressures, so it would be interesting to look more closely at why the data is actually missing here.)

Groupwise filtering Extracting rows from a grouped tibble via filter and slice also respects the groupind, i.e. rows will be extracted per group: while the condition age==max(age) will extract all participants with the highest age from the original tibble (one or more), it will extract the oldest participants from each group (so at least two):

> filter(salt_ht, age == max(age))
    ID    sex sbp dbp saltadd age
1 7276 female 201 124      no  71
> filter(salt_ht_bysex, age == max(age))
# A tibble: 2 × 6
# Groups:   sex [2]
     ID sex      sbp   dbp saltadd   age
  <int> <fct>  <int> <int> <fct>   <int>
1  2265 male     167   112 yes        68
2  7276 female   201   124 no         71

Complex groupwise operations Grouped tibbles can also be used for more complex operations, like fitting separate regression models to different parts of the data. The most direct way is via the do-function:36

> split_lm2 <- salt_ht_bysex %>%
+     do(model = lm(sbp ~ dbp, data = .))
> split_lm2
# A tibble: 2 × 2
# Rowwise: 
  sex    model 
  <fct>  <list>
1 female <lm>  
2 male   <lm>  

Note how we specify the per-group data here, namely via a single dot .: when running the do-function, the dot will be replaced by the corresponding subsets of the data, one per grouping level. The result is a tibble with as many rows as groups, and a new variable with our (freely chosen) name model, which is actually internally a list of linear models, even if this is not obvious from the tibble output:

> is.list(split_lm2$model)
[1] TRUE

So we can use the double bracket [[-notation for list elements (Section @ref{basic_lists}), or function lapply (Section @{split-apply-combine}) to process all models:

> split_lm2$model[[1]]

Call:
lm(formula = sbp ~ dbp, data = .)

Coefficients:
(Intercept)          dbp  
      11.63         1.45  
> lapply(split_lm2$model, summary)
[[1]]

Call:
lm(formula = sbp ~ dbp, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.271 -12.851  -6.396  12.591  63.441 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.6308    12.5337   0.928    0.358    
dbp           1.4503     0.1206  12.025   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.61 on 53 degrees of freedom
Multiple R-squared:  0.7318,    Adjusted R-squared:  0.7267 
F-statistic: 144.6 on 1 and 53 DF,  p-value: < 2.2e-16


[[2]]

Call:
lm(formula = sbp ~ dbp, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.091 -10.973  -1.218  10.609  35.968 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.6551    12.8944  -0.128    0.898    
dbp           1.5850     0.1323  11.983 2.71e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.42 on 43 degrees of freedom
Multiple R-squared:  0.7695,    Adjusted R-squared:  0.7642 
F-statistic: 143.6 on 1 and 43 DF,  p-value: 2.707e-15

Exercise: Group the salt_ht data frame by sex and saltadd, and calculate the mean and median difference between systolic and diastolic blood pressure for each group.

7.3 tidyverse vs base R?

If the question in the heading does not make sense to you: fine! I agree - R is a general purpose computer language, and the tidyverse just represents one specific, if rather opinionated and coherent, way of using that language to express things - a dialect, if you will, and so far a mutually intelligible dialect of R (in the sense that e.g. a tibble is still a data frame). By its nature, R can be extended and modified to serve different uses, and an extra set of well-designed add-on packages (which is what the tidyverse is after all) just adds to the available choices: we can and should mix and match pragmatically according to our needs and preferences.

If the question in the heading is obvious and important to you: fine! I can see where you are coming from, and FWIW, large (or at least noisy) parts of the internet agree with you (try googling “tidyverse vs base R”). When starting with R as a new language and analysis tool, its very flexibility can make it hard to get a grip on how to do things - even simple things, at first; encountering two very different approaches is not exactly helpful in this setting.

But consider these points:

  1. If you are using R as primary analysis tool longer term, the choice of how to do things, whether to implement your own solution or try to find and adapt existing code, will come upon you; base R vs tidyverse vs e.g. data.table is just the beginning, so you may as well embrace diversity right from the start.

  2. If you are using R only as a tool for a very specific project or study: you are missing out… but more seriously, that’s fine; suit yourself and go with whatever gets the job done better for you.

  3. Which brings us to the next point, which is sadly absent in many discussions about base R and tidyverse: there is no general answer independent of the specific use case you are considering. A data scientist implementing a processing pipeline for massive amounts of continuously generated data, a researcher implementing a study protocol for an essentially fixed study cohort, and someone writing R code to implement new methods for other people to use on their data have very different needs and requirements.

  4. Much online discussion is worthless: you can quite safely ignore everything technical older than 2-3 years, anything that argues merely based on elegance, and anything that comes to strong general conclusions without regard to the use case (see previous point). There is material for at least one fascinating ethnographic PhD in how the base R vs tidyverse dichotomy reflects a generational and social shift from the original R Core Team to, say, the developers currently employed at Posit Software… I’m not going to write it; but be aware that much online (and some offline) discussion is driven more by tribal affiliation than by sober consideration of facts.

  5. In research at least, it is quite likely that you will spend orders of magnitude more time writing and debugging code than actually running it. Any discussion of performance and efficiency has to value the time you spend on these activities, not just the time spent on code execution (and this includes the time you spend on learning any new tools to do your coding & debugging).

  6. While this may seem like a momentous decision to be taken early on and followed through to the bitter end, I really don’t think it is. In the light of the rapid prototyping and incremental improvement approach that R so excels at, and much in line with the previous points, it is important to

    1. get something off the ground that is demonstrably correct, with whatever tool you are most familiar with,
    2. modify and adapt your approach whenever you need to (e.g. because run times become unaccepatble).

    Of course, to do b. well, you have to at least be aware what alternatives there are, so maybe it was for the best that we had this little talk about the tidyverse


  1. Wickham, H (2014): Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎

  2. Though not all - e.g. ggplot2 predates the tidyverse concept and neither requires nor generates tidy data.↩︎

  3. This is known as non-standard evaluation; the Advanced R book has a good introduction.↩︎

  4. According to the author, “The d is for dataframes, the plyr is to evoke pliers. Pronounce however you like.”↩︎

  5. Note that the names and the organisation of the dplyr functions is clearly inspired by SQL, so if you have experience with relational databases, much may look familiar.↩︎

  6. FIXME - should we talk about name collisions, dplyr / MASS, dplyr / plyr?↩︎

  7. As a matter of fact, select supports a whole mini-language for specifying a subset of variables, which has been implemented as a separate package tidyselect, see e.g. ?starts_with. This is IMO somewhat excessive.↩︎

  8. Note that “piping” is in no way limited to the tidyverse - you can use the operator %>% implemented in package magrittr with any R code; indeed, since Version 4.1.0, base R has its own pipe operator |> which works in a very similar manner (and does not require an add-on package).↩︎

  9. In the example alphabetically, but really according to the order of the levels of the factor variable on which we sort.↩︎

  10. FIXME - explain differences (rolling definitions)?↩︎

  11. Of course, join is again a relational database term, and the join-functions in dplyr correspond directly to the SQL commands of the same name.↩︎

  12. The printing method for tibbles will try to show as much of the data as it can fit into the current console window, but will list the number of skipped rows, and if applicable, also the names and number of skipped variables that did not fit.↩︎

  13. Actually, if you look at the help page ?do, you find that this approach has been marked as “superseeded”: while it still works, it is no longer developed or maintained; the preferred approach (for now) is to use nest_by instead of group_by. So why bring this up at all?

    The short answer is that this is just a very short overview of dplyr, and I don’t want to go into too much detail. The longer answer is that this is an example for how things can change in tidyverse packages: the explanation for superseeding do is given as “because its syntax never really felt like it belong with the rest of dplyr”, which seems somewhat arbitrary; also note that this now requires two new functions, nest_by and across, and the decidely non-tidy new concept of a “rowwise” tibble (?rowwise). So what started as a purist commitment to tidy principles and an elegant extension of the data frame concept, becomes increasingly complicated, and increasingly divorced from the underlying principles and concepts; and development is ongoing…↩︎