6 Processing data

6.1 Overview

The previous section has already covered central aspects of data handling in base R: how to extract parts of a data frame, and how to change parts of a data frame, using either the bracket- or $-notation, or helper functions subset and transform. However, we have not yet discussed a good solution for working with groups of data, or how to combine data from multiple sources, which is what will be covered in this section. This will also involve a short excursion into how to write your own function in R, which is actually not hard and often very useful.

As example data, we will use the same data on blood pressure and salt consumption in 100 subjects already seen in Section 4:

> head(salt_ht, nrow = 5)
    ID    sex sbp dbp saltadd age
1 4305   male 110  80     yes  58
2 6606 female  85  55      no  32
3 5758 female 196 128    <NA>  53
4 2265   male 167 112     yes  68
5 7408 female 145 110     yes  55
6 2160 female 179 120      no  60

As before, this section will only deal with methods from base R. For a demonstration of how to achieve the same tasks using the package that is part of the tidyverse, see Section 7.

6.2 Groupwise statistics

For groupwise descriptive statistics, base R has the function aggregate, which offers a convenient formula interface. If we want to calculate e.g. the average systolic blood pressure in our example data for males and for females separately, we can do this:

> aggregate(sbp ~ sex, data = salt_ht, mean)
     sex      sbp
1 female 157.8182
2   male 150.0444

The variable to be aggregated (systolic blood pressure) is specified on the left hand side of the formula, and the grouping variable (sex) on the right hand side; the second argument is the data frame from where we draw these variables, i.e. salt_ht; and the third and final argument is the function we want to use to calculate a groupwise summary, here just the usual mean-function.

The return value from the aggregate-function is a data frame with as many rows as groups in the data (so here two), which shows the grouping levels (male and female) and the summary statistic (an average of ca. 158 for women and 150 for men). Note that this is hugely more compact and elegant than our standard approach so far, which was defining two separate subsets for males and females and calcularing the means on these.

If we want to calculate instead the standard deviation in each group, we just swap out the summary function in the call to aggregate, and replace mean with sd:

> aggregate(sbp ~ sex, data = salt_ht, sd)
     sex      sbp
1 female 43.24482
2   male 33.81632

We see that women also have greater variability in their blood pressures than men.

This basic approach can be extended in several ways. To start with, we can use a more complicated summary function like MeanCI from package DescTools to calculate separate means and confidence intervals:

> library(DescTools)
> aggregate(sbp ~ sex, data = salt_ht, MeanCI)
     sex sbp.mean sbp.lwr.ci sbp.upr.ci
1 female 157.8182   146.1275   169.5089
2   male 150.0444   139.8849   160.2040

MeanCI returns not just a single number like mean or sd, but a vector of length three, and consequently, the return value from aggregate has three separate columns for the summary statistics.

In addition, we can also specify multiple grouping variables at the same time, simply by listing them on the right hand side of the formula. separated by a plus sign. We then get summaries for all combinations of the specified grouping variables in the data set. Here, we get four groups for all combinations of male/female salt-adder/non-salt adder:

> aggregate(sbp ~ sex + saltadd, data = salt_ht, mean)
     sex saltadd      sbp
1 female      no 133.7692
2   male      no 139.4167
3 female     yes 164.4643
4   male     yes 160.2667

A point to remember is that aggregate will drop all rows of the data frame where any of the grouping variables are missing - you may remember from the initial example script that there are ca. 20% missing values for the saltadd variable, but this is not reflected in the results here. You can get around this by making the NAs a separate category of the saltadd variable in a pre-processing step, using the factor-function, but that is entirely your responsibility.

On top of this, we can also do on-the-fly transformations of the outcome variable on the left hand side as part of the formula. If we wanted for some reason to calculate the groupwise averages of the logarithmised blood pressures, we could do this:

> aggregate(log(sbp) ~ sex + saltadd, data = salt_ht, mean)
     sex saltadd log(sbp)
1 female      no 4.862900
2   male      no 4.921043
3 female     yes 5.067426
4   male     yes 5.052278

Finally, we can process multiple variables at the same time by specifying them within a call to the function cbind, which stands for column-bind, and combines the columns we are interested in into one single object suitable for the left hand side of a formula.

> aggregate(cbind(sbp, dbp) ~ sex + saltadd, data = salt_ht, mean)
     sex saltadd      sbp       dbp
1 female      no 133.7692  86.30769
2   male      no 139.4167  90.33333
3 female     yes 164.4643 106.17857
4   male     yes 160.2667 102.40000

6.3 Using your own functions

Defining a function What if we want to calculate the groupwise means and standard deviations at the same time, instead of running aggregate twice, as above? Then we need a function that takes as argument a numerical vector, and returns a vector of length two, with the mean as first element and the standard deviation as second element. Among the 19,000+ add-on packages on CRAN, we are virtually certain to find such function, but finding it will not be easy - indeed, it will be much simpler to write our own function for this.

Functions work a bit like recipes: ingredients go in, some processing takes place, and a delicious meal is returned. Let’s start with a trivial example, where we write a wrapper for the mean function. Instead of calling mean directly, we define a function with one argument x, which performs one action, namely calculating the mean value of x, and which returns exactly one value, namely that mean value, as result:

> aggregate(sbp ~ sex, data = salt_ht, function(x) mean(x))
     sex      sbp
1 female 157.8182
2   male 150.0444

Here, function tells R that we are about define our own function, the x in parentheses tells R that we expect exactly one argument, which we decide to call x, and the rest are the actual instructions what to do with the argument. Here, we just apss it on to the usual mean-function. By default, the function will return the value of the last command executed - as we only have one command, the call to mean, this is what we get back. And indeed, this works as intended.

Now that we have something simple that works, we can extend it to do something more useful: we have the same call to aggregate, starting with the same function definition, and the same argument x, but now we calculate both the mean of and the standard deviation of x; and we know already how to combine two unrelated numbers into a vector, namely by using the function c():

> aggregate(sbp ~ sex, data = salt_ht, function(x) c(mean(x), sd(x)))
     sex     sbp.1     sbp.2
1 female 157.81818  43.24482
2   male 150.04444  33.81632

And compared to the results above, this works exactly as intended. The only thing that is less than nice are the names of the summary columns.

We can take this a step further and try to fix the name issue, simply by adding name definitions in our call to c(): we call the first element Mean and the second StdDev. Even though we have not discussed this yet, this is a valid way of calling the c-function, and the result is still a vector of length two, but now with names19.

> aggregate(sbp ~ sex, data = salt_ht, function(x) c(Mean = mean(x), StdDev = sd(x)))
     sex  sbp.Mean sbp.StdDev
1 female 157.81818   43.24482
2   male 150.04444   33.81632

Note that this stepwise refinement of our initial, very simple and not very useful function is a nice example for the rapid prototyping / fast iteration approach that works well with R: build something simple that works, and refine it step by step until you reach your goal.

Creating a function object This new function that we have defined is actually not completely useless, and we may want to use it in other settings. We could of course re-type the same function definition as above whenever we need it, but that is hardly useful. Fortunately, we can save this function definition as an object (because R is not just functional, but also object-oriented):

> MeanSd <- function(x) c(Mean = mean(x), StdDev = sd(x))

We now have define our first own function object. If we just type the name, R will display the definition (value) of the function:

> MeanSd
function(x) c(Mean = mean(x), StdDev = sd(x))

This works exactly as before:

> aggregate(sbp ~ sex, data = salt_ht, MeanSd)
     sex  sbp.Mean sbp.StdDev
1 female 157.81818   43.24482
2   male 150.04444   33.81632

However, this function now works everywhere20, even outside of aggregate:

> MeanSd(salt_ht$sbp)
     Mean    StdDev 
154.32000  39.28628 

6.4 Split - Apply - Combine

Background aggregate works well for simple summary statistics like a mean or a median, and reasonably well for slightly more complex summary functions like MeanCI or our own MeanSd. However, for more complicated groupwise operations, this becomes awkward. Therefore, we will consider a more general mechanism that allows groupwise operations of any complexity, and which is well supported in a number of data processing languages. This is a three step process:

  1. Split the data into the groups you are interested in.
  2. Apply whatever processing or modelling function you want to run to the separate data chunks you have created in the first step.
  3. Combine the results from applying the function to the different data chunks into an informative ensemble.

In base R, this can be implemented via a pair of functions, split and lapply, where

  • split takes as arguments a data frame and some grouping information, and returns a named list21 of data frames: each list element is a smaller data frame, with the same columns as the full data set, but only the rows corresponding to one group at a time, so that the list has as many elements (as many sub-data frames) as there are grouping levels.

  • lapply takes this list as first argument, together with the function we want to run on each sub-data frame as the second argument. lapply will run this for us, and collect the return values as a list of the same length as the input list.

You will notice that there is no extra function for the combination step here: as shown below, this can either be skipped, or implemented by a second call to lapply. With this approach, we can run any kind of per-group calculation, even if the output is complicated, like for regression models.

There is also a variant of lapply called sapply, for simplified apply, which under some circumstances will be able to return not a list, but some nice rectangular object, as demonstrated in the example below.

Example I: descriptives We start by splitting the salt data by levels of sex. The resulting object split_salt is a list with two elements, with names female and male, and each element is a data frame with same six columns as the original data:

> split_salt <- split(salt_ht, salt_ht$sex)
> str(split_salt)
List of 2
 $ female:'data.frame': 55 obs. of  6 variables:
  ..$ ID     : int [1:55] 6606 5758 7408 2160 8202 9571 4767 1024 2627 7707 ...
  ..$ sex    : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ sbp    : int [1:55] 85 196 145 179 110 133 149 178 126 182 ...
  ..$ dbp    : int [1:55] 55 128 110 120 70 75 72 128 78 96 ...
  ..$ saltadd: Factor w/ 2 levels "no","yes": 1 NA 2 1 1 2 2 2 1 2 ...
  ..$ age    : int [1:55] 32 53 55 60 32 46 63 63 46 60 ...
 $ male  :'data.frame': 45 obs. of  6 variables:
  ..$ ID     : int [1:45] 4305 2265 8846 9605 4137 1598 9962 1006 7120 6888 ...
  ..$ sex    : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
  ..$ sbp    : int [1:45] 110 167 111 198 171 118 140 192 118 121 ...
  ..$ dbp    : int [1:45] 80 112 78 119 102 72 90 118 72 86 ...
  ..$ saltadd: Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 1 1 1 ...
  ..$ age    : int [1:45] 58 68 59 63 58 52 67 42 40 36 ...

We start with a simple summary:

> lapply(split_salt, summary)
$female
       ID           sex          sbp             dbp        saltadd        age       
 Min.   :1024   female:55   Min.   : 80.0   Min.   : 55.0   no  :13   Min.   :26.00  
 1st Qu.:2632   male  : 0   1st Qu.:121.5   1st Qu.: 80.0   yes :28   1st Qu.:37.50  
 Median :5758               Median :152.0   Median :102.0   NA's:14   Median :48.00  
 Mean   :5256               Mean   :157.8   Mean   :100.8             Mean   :47.69  
 3rd Qu.:7536               3rd Qu.:194.5   3rd Qu.:119.0             3rd Qu.:58.00  
 Max.   :9899               Max.   :238.0   Max.   :158.0             Max.   :71.00  

$male
       ID           sex          sbp           dbp         saltadd        age       
 Min.   :1006   female: 0   Min.   :110   Min.   : 68.00   no  :24   Min.   :29.00  
 1st Qu.:3210   male  :45   1st Qu.:121   1st Qu.: 82.00   yes :15   1st Qu.:41.00  
 Median :4592               Median :137   Median : 90.00   NA's: 6   Median :53.00  
 Mean   :5192               Mean   :150   Mean   : 95.71             Mean   :49.96  
 3rd Qu.:7120               3rd Qu.:171   3rd Qu.:110.00             3rd Qu.:58.00  
 Max.   :9962               Max.   :225   Max.   :150.00             Max.   :68.00  

As expected, this returns a list with two elements, each of them a summary output.

At first glance, this may look very similar to the manual splitting we have been doing previously using the subset-function. However, this is not the case: there, we had to keep track of each level of the grouping variable es, and create each sub-data frame manually as a separate object. Here, this works the same whether the grouping variable has two or 18 levels, and the result is always one list object instead of two or 18 data frame objects, which can be processed with a single call to lapply.

This also works for a more complicated descriptive function like descr from package summarytools. We can even add extra arguments to lapply, like stats here, which will be passed on to descr:

> library(summarytools)
> lapply(split_salt, descr, stats = "common")

This creates the a list of two summary tables, using the somewhat shorter outpit format corresponding to stats="common" (see ?descr for details).

We can even use this mechanism to do implement something similar to aggregate by defining our own processing function: here, the argument x for the function is a data frame, the part of the original data corresponding to single grouping level; we can therefore extract the variable sbp with the usual $-notation and pass this vector as argument to MeanCI. This returns a list with two confidence intervals, one for females, one for males:

> lapply(split_salt, function(x) MeanCI(x$sbp))
$female
    mean   lwr.ci   upr.ci 
157.8182 146.1275 169.5089 

$male
    mean   lwr.ci   upr.ci 
150.0444 139.8849 160.2040 

This is actually a sitation where we can use the sapply function to improve the output: sapply understands that the output from each of the separate function calls are all vectors of the same length and combines them for us.

> sapply(split_salt, function(x) MeanCI(x$sbp))
         female     male
mean   157.8182 150.0444
lwr.ci 146.1275 139.8849
upr.ci 169.5089 160.2040

This is almost useful; if we combine this with the function t() (for transpose), which flips rows and columns of a matrix, we get something very similar to the original aggregate output:

> t(sapply(split_salt, function(x) MeanCI(x$sbp)))
           mean   lwr.ci   upr.ci
female 157.8182 146.1275 169.5089
male   150.0444 139.8849 160.2040

In this sense, split/sapply is the more general approach than aggregate, but for simpler use cases like this example here, aggregate is easier to use.

Example II: tests & models The split/sapply approach works especially well with functions that have a formula interface, as these functions generally have a data-argument to specify the data frame from which to take the variables in the formula. Consequently, it is very easy to run e.g. groupwise t-tests:

> lapply(split_salt, function(x) t.test(sbp ~ saltadd, data = x))
$female

    Welch Two Sample t-test

data:  sbp by saltadd
t = -2.4579, df = 27.399, p-value = 0.02058
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -56.301333  -5.088777
sample estimates:
 mean in group no mean in group yes 
         133.7692          164.4643 


$male

    Welch Two Sample t-test

data:  sbp by saltadd
t = -1.9155, df = 23.455, p-value = 0.0677
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -43.343346   1.643346
sample estimates:
 mean in group no mean in group yes 
         139.4167          160.2667 

We get a list with two elements, female and male, and each element is a t-test result (here with a conventionally statistically significant difference for females, but not for males). Note that test results like these are relatively complicated R objects, so sapply is not going to do us much good in this situation.

This works exactly the same for regression: if we want to model systolic blood pressure as function of diastolic blood pressure for females and males separately, we use the same type of wrapping function where we pass in the data argument to function lm:

> split_lm <- lapply(split_salt, function(x) lm(sbp ~ dbp, data = x))
> split_lm
$female

Call:
lm(formula = sbp ~ dbp, data = x)

Coefficients:
(Intercept)          dbp  
      11.63         1.45  


$male

Call:
lm(formula = sbp ~ dbp, data = x)

Coefficients:
(Intercept)          dbp  
     -1.655        1.585  

he result is again a list of length two, and each is this the kind of very short summary we get for a linear regression object at the console, showing only the original function call and the estimated coefficients - note that the call looks the same for both elements, but the estimates are different, because x means different things for the two models.

Here it makes sense to run an extra post-processing step after the model fit(s). If we want to see a mreo informative summary of the model, including standard errors and p-values for the regression parameters and measures of model fit like \(R^2\), we can run summary on each model:

> lapply(split_lm, summary)
$female

Call:
lm(formula = sbp ~ dbp, data = x)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.271 -12.851  -6.396  12.591  63.441 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.6308    12.5337   0.928    0.358    
dbp           1.4503     0.1206  12.025   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.61 on 53 degrees of freedom
Multiple R-squared:  0.7318,    Adjusted R-squared:  0.7267 
F-statistic: 144.6 on 1 and 53 DF,  p-value: < 2.2e-16


$male

Call:
lm(formula = sbp ~ dbp, data = x)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.091 -10.973  -1.218  10.609  35.968 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.6551    12.8944  -0.128    0.898    
dbp           1.5850     0.1323  11.983 2.71e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.42 on 43 degrees of freedom
Multiple R-squared:  0.7695,    Adjusted R-squared:  0.7642 
F-statistic: 143.6 on 1 and 43 DF,  p-value: 2.707e-15

We see rather similar slopes for both models, both highly statistically significant, but rather different intercepts, which however are not statistically significantly different from zero.

Alternatively, if we are just interested in the slopes of the regression models, we can just extract these and use sapply for a more compact output:

> sapply(split_lm, function(x) coef(x)[2])
female.dbp   male.dbp 
  1.450272   1.584973 

6.5 Merging data sets

Background So far, we have had our data conveniently available as one large data frame, with all variables and subjects of interest collected in one object. However, this is rarely how data starts out: a Swedish register study will have commonly multiple different sources of data, combining information from e.g. the cancer register, patient register, and drug prescription register. Consequently, one of the first steps of data preparation is to combine all required information from the available sources.

Merging data will generally require a unique subject identifier that is shared across the different data sources: for Swedish registers, this will usually be the personal identification number (PID, Swedish personnummer), or some pseudonymized version thereof (usually a sequential identification number, or running number, Swedish löpnummer).22 From there, the process is straightforward: you start with your canonical list of participants (aka your study cohort), and you keep linking in information from other data sources one at a time until you have collected all required information in one table.[^yesbut]

So this reduces the problem to merging two tables at a time. In base R, this can be done via the merge-function: this function is not especially fast, and has a somewhat special interface, but it works out of the box.

Example: adding columns The data for the salt-intake example was actually collected at a number of different health centers. This information is not part of the original data, but we happen to have an extra text file that contains the health centers, and we want to add this information to the existing data frame salt_ht.

We first read in the file as usually:

> centers <- read.table("Data/center_saltadd.txt", stringsAsFactors = TRUE, header = TRUE)
> summary(centers)
       ID       Center
 Min.   :1006   A:56  
 1st Qu.:2787   B:57  
 Median :5527   C:37  
 Mean   :5367         
 3rd Qu.:7754         
 Max.   :9962         

The new data frame centers has two variables, one which seems to be a unique identifier (ID) and one with the center information (Center, A-D). However, it is also clear that the new data sets has information on more subjects (more than 100) than we have in the original data. A direct comparison also shows that these data frames are definitely not in the same order:

> centers[1:4, ]
    ID Center
1 1006      A
2 1086      A
3 1219      A
4 1265      A
> salt_ht[1:4, ]
    ID    sex sbp dbp saltadd age
1 4305   male 110  80     yes  58
2 6606 female  85  55      no  32
3 5758 female 196 128    <NA>  53
4 2265   male 167 112     yes  68

In a first step, we want to see how many individuals in the original salt data also have information on the health center. We can do this via the function setdiff, which takes as arguments two vectors, and returns the vector of all elements which are part of the first, but not the second vector:23

> setdiff(salt_ht$ID, centers$ID)
integer(0)
> setdiff(centers$ID, salt_ht$ID)
 [1] 1265 1458 1514 1678 2769 3388 3678 3966 5723 6358 6911 7076 7228 7689 8129 8396 8990 9113 9469
[20] 9470 9490 9943 9949 1014 1162 1533 1670 1700 1938 2621 3117 3597 3762 4346 5520 5764 5791 7915
[39] 8069 8179 8993 9226 9580 9754 2304 2523 5070 6210 8023 9358

We see that removing the identifiers in the center-data from the identifiers in the original salt study returns a vector of length zero: this means there is no subject left from the original data frame form whom we do not have information on the center in the second data set - good! Flipping this however, by removing the original salt data IDs from the center IDs, we get a non-zero vector of subject IDs for which we have center information, but none of the blood pressure- or demographic variables. This means that we can augment our original data set, by adding the Center variable to salt_ht. In database language, this is known as a left join, and we can use merge to do this:

> salt_ht_centers <- merge(x = salt_ht, y = centers, by = "ID", all.x = TRUE, all.y = FALSE)

We specify the original salt_ht as the first data set (argument x), and the center data set as second argument (y); we tell merge which column holds the identifier (this case column ID in both data frames); and then we decide what is kept and what is dropped: by setting all.x = TRUE, we state that we want all rows in x, that is salt_ht, to appear in the result. At the same time, we set all.y = FALSE, thereby stating that all rows of y (the center data) which have no match in x, should be dropped from the result.

For this combination of all.x and all.y, the result should be a data frame with the same number of rows as salt_ht, but extra columns from centers:

> summary(salt_ht_centers)
       ID           sex          sbp             dbp         saltadd        age        Center
 Min.   :1006   female:55   Min.   : 80.0   Min.   : 55.00   no  :37   Min.   :26.00   A:33  
 1st Qu.:2879   male  :45   1st Qu.:121.0   1st Qu.: 80.00   yes :43   1st Qu.:39.75   B:36  
 Median :5237               Median :148.5   Median : 96.00   NA's:20   Median :50.00   C:31  
 Mean   :5227               Mean   :154.3   Mean   : 98.51             Mean   :48.71         
 3rd Qu.:7309               3rd Qu.:184.0   3rd Qu.:116.25             3rd Qu.:58.00         
 Max.   :9962               Max.   :238.0   Max.   :158.00             Max.   :71.00         
> salt_ht_centers[1:10, ]
     ID    sex sbp dbp saltadd age Center
1  1006   male 192 118      no  42      A
2  1024 female 178 128     yes  63      C
3  1086 female 160 108     yes  47      A
4  1219 female 148 108     yes  55      A
5  1457   male 170  98      no  66      A
6  1511 female 149  94      no  48      B
7  1598   male 118  72      no  52      B
8  1635 female 115  73     yes  40      C
9  1640   male 171 105     yes  53      B
10 1684   male 184  98    <NA>  58      B

And this works: as we see here, we still have 55 women and 45 men, but we also have the extra variable Center, which breaks down roughly equally across A, B and C, with no missing values.

Example: adding rows and columns In contrast, we can request that all rows from both data sets are kept in the merged data frame, simply by also setting all.y = TRUE.

> salt_ht_centers_all <- merge(x = salt_ht, y = centers, by = "ID", all.x = TRUE, all.y = TRUE)
> summary(salt_ht_centers_all)
       ID           sex          sbp             dbp         saltadd        age        Center
 Min.   :1006   female:55   Min.   : 80.0   Min.   : 55.00   no  :37   Min.   :26.00   A:56  
 1st Qu.:2787   male  :45   1st Qu.:121.0   1st Qu.: 80.00   yes :43   1st Qu.:39.75   B:57  
 Median :5527   NA's  :50   Median :148.5   Median : 96.00   NA's:70   Median :50.00   C:37  
 Mean   :5367               Mean   :154.3   Mean   : 98.51             Mean   :48.71         
 3rd Qu.:7754               3rd Qu.:184.0   3rd Qu.:116.25             3rd Qu.:58.00         
 Max.   :9962               Max.   :238.0   Max.   :158.00             Max.   :71.00         
                            NA's   :50      NA's   :50                 NA's   :50            
> salt_ht_centers_all[1:10, ]
     ID    sex sbp dbp saltadd age Center
1  1006   male 192 118      no  42      A
2  1014   <NA>  NA  NA    <NA>  NA      B
3  1024 female 178 128     yes  63      C
4  1086 female 160 108     yes  47      A
5  1162   <NA>  NA  NA    <NA>  NA      B
6  1219 female 148 108     yes  55      A
7  1265   <NA>  NA  NA    <NA>  NA      A
8  1457   male 170  98      no  66      A
9  1458   <NA>  NA  NA    <NA>  NA      A
10 1511 female 149  94      no  48      B

Looking at the resulting data frame, we see the same columns (including Center) as before, but now also have missing values for all variables except ID and Center: these are precisely new 50 missing values in the blood pressure- and demographic variables for subjects who are part of center, but not salt_ht. This kind of merging known as a full or outer join.

6.6 Using pipelines

UNDER CONSTRUCTION

6.7 Next steps

For an alternative approach to data processing, group- and otherwise, see Section 7.


  1. This is called a named vector, and works quite similar to named lists we have already discussed: we can still use the basic indexing methods (by position, logical), but we can also use the names to reference elements of the vector. We have already seen examples of named vectors without knowing it, namely the result of calling summary for a basic vector. FIXME: example?↩︎

  2. Everywhere the object MeanSd is defined, actually.↩︎

  3. As described in Section @ref(named_lists)↩︎

  4. Should your data set not include a unique identifier, one of your first steps should be to create a new variable that holds a unique identifier, e.g. as mydata$newid <- 1:nrow(mydata). And if anyone hands you multiple data sets without a shared unique identifier expecting you to merge them, you should complain loudly until they cough up one.↩︎

  5. setdiff stands for set difference, as in, remove all elements of the second vector from the first one.↩︎