6 Processing data
6.1 Overview
The previous section has already covered central aspects of data handling in base R: how to extract parts of a data frame, and how to change parts of a data frame, using either the bracket- or $
-notation, or helper functions subset
and transform
. However, we have not yet discussed a good solution for working with groups of data, or how to combine data from multiple sources, which is what will be covered in this section. This will also involve a short excursion into how to write your own function in R, which is actually not hard and often very useful.
As example data, we will use the same data on blood pressure and salt consumption in 100 subjects already seen in Section 4:
> head(salt_ht, nrow = 5)
ID sex sbp dbp saltadd age
1 4305 male 110 80 yes 58
2 6606 female 85 55 no 32
3 5758 female 196 128 <NA> 53
4 2265 male 167 112 yes 68
5 7408 female 145 110 yes 55
6 2160 female 179 120 no 60
As before, this section will only deal with methods from base R. For a demonstration of how to achieve the same tasks using the package that is part of the tidyverse, see Section 7.
6.2 Groupwise statistics
For groupwise descriptive statistics, base R has the function aggregate, which offers a convenient formula interface. If we want to calculate e.g. the average systolic blood pressure in our example data for males and for females separately, we can do this:
The variable to be aggregated (systolic blood pressure) is specified on the left hand side of the formula, and the grouping variable (sex) on the right hand side; the second argument is the data frame from where we draw these variables, i.e. salt_ht; and the third and final argument is the function we want to use to calculate a groupwise summary, here just the usual mean-function.
The return value from the aggregate-function is a data frame with as many rows as groups in the data (so here two), which shows the grouping levels (male and female) and the summary statistic (an average of ca. 158 for women and 150 for men). Note that this is hugely more compact and elegant than our standard approach so far, which was defining two separate subsets for males and females and calcularing the means on these.
If we want to calculate instead the standard deviation in each group, we just swap out the summary function in the call to aggregate, and replace mean with sd:
We see that women also have greater variability in their blood pressures than men.
This basic approach can be extended in several ways. To start with, we can use a more complicated summary function like MeanCI from package DescTools to calculate separate means and confidence intervals:
> library(DescTools)
> aggregate(sbp ~ sex, data = salt_ht, MeanCI)
sex sbp.mean sbp.lwr.ci sbp.upr.ci
1 female 157.8182 146.1275 169.5089
2 male 150.0444 139.8849 160.2040
MeanCI returns not just a single number like mean or sd, but a vector of length three, and consequently, the return value from aggregate has three separate columns for the summary statistics.
In addition, we can also specify multiple grouping variables at the same time, simply by listing them on the right hand side of the formula. separated by a plus sign. We then get summaries for all combinations of the specified grouping variables in the data set. Here, we get four groups for all combinations of male/female salt-adder/non-salt adder:
> aggregate(sbp ~ sex + saltadd, data = salt_ht, mean)
sex saltadd sbp
1 female no 133.7692
2 male no 139.4167
3 female yes 164.4643
4 male yes 160.2667
A point to remember is that aggregate will drop all rows of the data frame where any of the grouping variables are missing - you may remember from the initial example script that there are ca. 20% missing values for the saltadd variable, but this is not reflected in the results here. You can get around this by making the NAs a separate category of the saltadd variable in a pre-processing step, using the factor-function, but that is entirely your responsibility.
On top of this, we can also do on-the-fly transformations of the outcome variable on the left hand side as part of the formula. If we wanted for some reason to calculate the groupwise averages of the logarithmised blood pressures, we could do this:
> aggregate(log(sbp) ~ sex + saltadd, data = salt_ht, mean)
sex saltadd log(sbp)
1 female no 4.862900
2 male no 4.921043
3 female yes 5.067426
4 male yes 5.052278
Finally, we can process multiple variables at the same time by specifying them within a call to the function cbind, which stands for column-bind, and combines the columns we are interested in into one single object suitable for the left hand side of a formula.
6.3 Using your own functions
Defining a function What if we want to calculate the groupwise means and standard deviations at the same time, instead of running aggregate twice, as above? Then we need a function that takes as argument a numerical vector, and returns a vector of length two, with the mean as first element and the standard deviation as second element. Among the 19,000+ add-on packages on CRAN, we are virtually certain to find such function, but finding it will not be easy - indeed, it will be much simpler to write our own function for this.
Functions work a bit like recipes: ingredients go in, some processing takes place, and a delicious meal is returned. Let’s start with a trivial example, where we write a wrapper for the mean function. Instead of calling mean directly, we define a function with one argument x, which performs one action, namely calculating the mean value of x, and which returns exactly one value, namely that mean value, as result:
> aggregate(sbp ~ sex, data = salt_ht, function(x) mean(x))
sex sbp
1 female 157.8182
2 male 150.0444
Here, function tells R that we are about define our own function, the x in parentheses tells R that we expect exactly one argument, which we decide to call x, and the rest are the actual instructions what to do with the argument. Here, we just apss it on to the usual mean-function. By default, the function will return the value of the last command executed - as we only have one command, the call to mean, this is what we get back. And indeed, this works as intended.
Now that we have something simple that works, we can extend it to do something more useful: we have the same call to aggregate, starting with the same function definition, and the same argument x, but now we calculate both the mean of and the standard deviation of x; and we know already how to combine two unrelated numbers into a vector, namely by using the function c():
> aggregate(sbp ~ sex, data = salt_ht, function(x) c(mean(x), sd(x)))
sex sbp.1 sbp.2
1 female 157.81818 43.24482
2 male 150.04444 33.81632
And compared to the results above, this works exactly as intended. The only thing that is less than nice are the names of the summary columns.
We can take this a step further and try to fix the name issue, simply by adding name definitions in our call to c(): we call the first element Mean and the second StdDev. Even though we have not discussed this yet, this is a valid way of calling the c-function, and the result is still a vector of length two, but now with names19.
> aggregate(sbp ~ sex, data = salt_ht, function(x) c(Mean = mean(x), StdDev = sd(x)))
sex sbp.Mean sbp.StdDev
1 female 157.81818 43.24482
2 male 150.04444 33.81632
Note that this stepwise refinement of our initial, very simple and not very useful function is a nice example for the rapid prototyping / fast iteration approach that works well with R: build something simple that works, and refine it step by step until you reach your goal.
Creating a function object This new function that we have defined is actually not completely useless, and we may want to use it in other settings. We could of course re-type the same function definition as above whenever we need it, but that is hardly useful. Fortunately, we can save this function definition as an object (because R is not just functional, but also object-oriented):
We now have define our first own function object. If we just type the name, R will display the definition (value) of the function:
This works exactly as before:
> aggregate(sbp ~ sex, data = salt_ht, MeanSd)
sex sbp.Mean sbp.StdDev
1 female 157.81818 43.24482
2 male 150.04444 33.81632
However, this function now works everywhere20, even outside of aggregate:
6.4 Split - Apply - Combine
Background aggregate works well for simple summary statistics like a mean or a median, and reasonably well for slightly more complex summary functions like MeanCI or our own MeanSd. However, for more complicated groupwise operations, this becomes awkward. Therefore, we will consider a more general mechanism that allows groupwise operations of any complexity, and which is well supported in a number of data processing languages. This is a three step process:
- Split the data into the groups you are interested in.
- Apply whatever processing or modelling function you want to run to the separate data chunks you have created in the first step.
- Combine the results from applying the function to the different data chunks into an informative ensemble.
In base R, this can be implemented via a pair of functions, split and lapply, where
split takes as arguments a data frame and some grouping information, and returns a named list21 of data frames: each list element is a smaller data frame, with the same columns as the full data set, but only the rows corresponding to one group at a time, so that the list has as many elements (as many sub-data frames) as there are grouping levels.
lapply takes this list as first argument, together with the function we want to run on each sub-data frame as the second argument. lapply will run this for us, and collect the return values as a list of the same length as the input list.
You will notice that there is no extra function for the combination step here: as shown below, this can either be skipped, or implemented by a second call to lapply. With this approach, we can run any kind of per-group calculation, even if the output is complicated, like for regression models.
There is also a variant of lapply called sapply, for simplified apply, which under some circumstances will be able to return not a list, but some nice rectangular object, as demonstrated in the example below.
Example I: descriptives We start by splitting the salt data by levels of sex. The resulting object split_salt is a list with two elements, with names female and male, and each element is a data frame with same six columns as the original data:
> split_salt <- split(salt_ht, salt_ht$sex)
> str(split_salt)
List of 2
$ female:'data.frame': 55 obs. of 6 variables:
..$ ID : int [1:55] 6606 5758 7408 2160 8202 9571 4767 1024 2627 7707 ...
..$ sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
..$ sbp : int [1:55] 85 196 145 179 110 133 149 178 126 182 ...
..$ dbp : int [1:55] 55 128 110 120 70 75 72 128 78 96 ...
..$ saltadd: Factor w/ 2 levels "no","yes": 1 NA 2 1 1 2 2 2 1 2 ...
..$ age : int [1:55] 32 53 55 60 32 46 63 63 46 60 ...
$ male :'data.frame': 45 obs. of 6 variables:
..$ ID : int [1:45] 4305 2265 8846 9605 4137 1598 9962 1006 7120 6888 ...
..$ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
..$ sbp : int [1:45] 110 167 111 198 171 118 140 192 118 121 ...
..$ dbp : int [1:45] 80 112 78 119 102 72 90 118 72 86 ...
..$ saltadd: Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 1 1 1 ...
..$ age : int [1:45] 58 68 59 63 58 52 67 42 40 36 ...
We start with a simple summary
:
> lapply(split_salt, summary)
$female
ID sex sbp dbp saltadd age
Min. :1024 female:55 Min. : 80.0 Min. : 55.0 no :13 Min. :26.00
1st Qu.:2632 male : 0 1st Qu.:121.5 1st Qu.: 80.0 yes :28 1st Qu.:37.50
Median :5758 Median :152.0 Median :102.0 NA's:14 Median :48.00
Mean :5256 Mean :157.8 Mean :100.8 Mean :47.69
3rd Qu.:7536 3rd Qu.:194.5 3rd Qu.:119.0 3rd Qu.:58.00
Max. :9899 Max. :238.0 Max. :158.0 Max. :71.00
$male
ID sex sbp dbp saltadd age
Min. :1006 female: 0 Min. :110 Min. : 68.00 no :24 Min. :29.00
1st Qu.:3210 male :45 1st Qu.:121 1st Qu.: 82.00 yes :15 1st Qu.:41.00
Median :4592 Median :137 Median : 90.00 NA's: 6 Median :53.00
Mean :5192 Mean :150 Mean : 95.71 Mean :49.96
3rd Qu.:7120 3rd Qu.:171 3rd Qu.:110.00 3rd Qu.:58.00
Max. :9962 Max. :225 Max. :150.00 Max. :68.00
As expected, this returns a list with two elements, each of them a summary output.
At first glance, this may look very similar to the manual splitting we have been doing previously using the subset-function. However, this is not the case: there, we had to keep track of each level of the grouping variable es, and create each sub-data frame manually as a separate object. Here, this works the same whether the grouping variable has two or 18 levels, and the result is always one list object instead of two or 18 data frame objects, which can be processed with a single call to lapply.
This also works for a more complicated descriptive function like descr from package summarytools. We can even add extra arguments to lapply, like stats here, which will be passed on to descr:
This creates the a list of two summary tables, using the somewhat shorter outpit format corresponding to stats="common"
(see ?descr
for details).
We can even use this mechanism to do implement something similar to aggregate by defining our own processing function: here, the argument x for the function is a data frame, the part of the original data corresponding to single grouping level; we can therefore extract the variable sbp with the usual $-notation and pass this vector as argument to MeanCI. This returns a list with two confidence intervals, one for females, one for males:
> lapply(split_salt, function(x) MeanCI(x$sbp))
$female
mean lwr.ci upr.ci
157.8182 146.1275 169.5089
$male
mean lwr.ci upr.ci
150.0444 139.8849 160.2040
This is actually a sitation where we can use the sapply function to improve the output: sapply understands that the output from each of the separate function calls are all vectors of the same length and combines them for us.
> sapply(split_salt, function(x) MeanCI(x$sbp))
female male
mean 157.8182 150.0444
lwr.ci 146.1275 139.8849
upr.ci 169.5089 160.2040
This is almost useful; if we combine this with the function t() (for transpose), which flips rows and columns of a matrix, we get something very similar to the original aggregate output:
> t(sapply(split_salt, function(x) MeanCI(x$sbp)))
mean lwr.ci upr.ci
female 157.8182 146.1275 169.5089
male 150.0444 139.8849 160.2040
In this sense, split/sapply is the more general approach than aggregate, but for simpler use cases like this example here, aggregate is easier to use.
Example II: tests & models The split/sapply approach works especially well with functions that have a formula interface, as these functions generally have a data-argument to specify the data frame from which to take the variables in the formula. Consequently, it is very easy to run e.g. groupwise t-tests:
> lapply(split_salt, function(x) t.test(sbp ~ saltadd, data = x))
$female
Welch Two Sample t-test
data: sbp by saltadd
t = -2.4579, df = 27.399, p-value = 0.02058
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
-56.301333 -5.088777
sample estimates:
mean in group no mean in group yes
133.7692 164.4643
$male
Welch Two Sample t-test
data: sbp by saltadd
t = -1.9155, df = 23.455, p-value = 0.0677
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
-43.343346 1.643346
sample estimates:
mean in group no mean in group yes
139.4167 160.2667
We get a list with two elements, female and male, and each element is a t-test result (here with a conventionally statistically significant difference for females, but not for males). Note that test results like these are relatively complicated R objects, so sapply is not going to do us much good in this situation.
This works exactly the same for regression: if we want to model systolic blood pressure as function of diastolic blood pressure for females and males separately, we use the same type of wrapping function where we pass in the data argument to function lm:
> split_lm <- lapply(split_salt, function(x) lm(sbp ~ dbp, data = x))
> split_lm
$female
Call:
lm(formula = sbp ~ dbp, data = x)
Coefficients:
(Intercept) dbp
11.63 1.45
$male
Call:
lm(formula = sbp ~ dbp, data = x)
Coefficients:
(Intercept) dbp
-1.655 1.585
he result is again a list of length two, and each is this the kind of very short summary we get for a linear regression object at the console, showing only the original function call and the estimated coefficients - note that the call looks the same for both elements, but the estimates are different, because x means different things for the two models.
Here it makes sense to run an extra post-processing step after the model fit(s). If we want to see a mreo informative summary of the model, including standard errors and p-values for the regression parameters and measures of model fit like \(R^2\), we can run summary on each model:
> lapply(split_lm, summary)
$female
Call:
lm(formula = sbp ~ dbp, data = x)
Residuals:
Min 1Q Median 3Q Max
-42.271 -12.851 -6.396 12.591 63.441
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.6308 12.5337 0.928 0.358
dbp 1.4503 0.1206 12.025 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.61 on 53 degrees of freedom
Multiple R-squared: 0.7318, Adjusted R-squared: 0.7267
F-statistic: 144.6 on 1 and 53 DF, p-value: < 2.2e-16
$male
Call:
lm(formula = sbp ~ dbp, data = x)
Residuals:
Min 1Q Median 3Q Max
-34.091 -10.973 -1.218 10.609 35.968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.6551 12.8944 -0.128 0.898
dbp 1.5850 0.1323 11.983 2.71e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.42 on 43 degrees of freedom
Multiple R-squared: 0.7695, Adjusted R-squared: 0.7642
F-statistic: 143.6 on 1 and 43 DF, p-value: 2.707e-15
We see rather similar slopes for both models, both highly statistically significant, but rather different intercepts, which however are not statistically significantly different from zero.
Alternatively, if we are just interested in the slopes of the regression models, we can just extract these and use sapply for a more compact output:
6.5 Merging data sets
Background So far, we have had our data conveniently available as one large data frame, with all variables and subjects of interest collected in one object. However, this is rarely how data starts out: a Swedish register study will have commonly multiple different sources of data, combining information from e.g. the cancer register, patient register, and drug prescription register. Consequently, one of the first steps of data preparation is to combine all required information from the available sources.
Merging data will generally require a unique subject identifier that is shared across the different data sources: for Swedish registers, this will usually be the personal identification number (PID, Swedish personnummer), or some pseudonymized version thereof (usually a sequential identification number, or running number, Swedish löpnummer).22 From there, the process is straightforward: you start with your canonical list of participants (aka your study cohort), and you keep linking in information from other data sources one at a time until you have collected all required information in one table.[^yesbut]
So this reduces the problem to merging two tables at a time. In base R, this can be done via the merge-function: this function is not especially fast, and has a somewhat special interface, but it works out of the box.
Example: adding columns The data for the salt-intake example was actually collected at a number of different health centers. This information is not part of the original data, but we happen to have an extra text file that contains the health centers, and we want to add this information to the existing data frame salt_ht.
We first read in the file as usually:
> centers <- read.table("Data/center_saltadd.txt", stringsAsFactors = TRUE, header = TRUE)
> summary(centers)
ID Center
Min. :1006 A:56
1st Qu.:2787 B:57
Median :5527 C:37
Mean :5367
3rd Qu.:7754
Max. :9962
The new data frame centers has two variables, one which seems to be a unique identifier (ID) and one with the center information (Center, A-D). However, it is also clear that the new data sets has information on more subjects (more than 100) than we have in the original data. A direct comparison also shows that these data frames are definitely not in the same order:
> centers[1:4, ]
ID Center
1 1006 A
2 1086 A
3 1219 A
4 1265 A
> salt_ht[1:4, ]
ID sex sbp dbp saltadd age
1 4305 male 110 80 yes 58
2 6606 female 85 55 no 32
3 5758 female 196 128 <NA> 53
4 2265 male 167 112 yes 68
In a first step, we want to see how many individuals in the original salt data also have information on the health center. We can do this via the function setdiff
, which takes as arguments two vectors, and returns the vector of all elements which are part of the first, but not the second vector:23
> setdiff(salt_ht$ID, centers$ID)
integer(0)
> setdiff(centers$ID, salt_ht$ID)
[1] 1265 1458 1514 1678 2769 3388 3678 3966 5723 6358 6911 7076 7228 7689 8129 8396 8990 9113 9469
[20] 9470 9490 9943 9949 1014 1162 1533 1670 1700 1938 2621 3117 3597 3762 4346 5520 5764 5791 7915
[39] 8069 8179 8993 9226 9580 9754 2304 2523 5070 6210 8023 9358
We see that removing the identifiers in the center-data from the identifiers in the original salt study returns a vector of length zero: this means there is no subject left from the original data frame form whom we do not have information on the center in the second data set - good! Flipping this however, by removing the original salt data IDs from the center IDs, we get a non-zero vector of subject IDs for which we have center information, but none of the blood pressure- or demographic variables. This means that we can augment our original data set, by adding the Center variable to salt_ht. In database language, this is known as a left join, and we can use merge to do this:
We specify the original salt_ht as the first data set (argument x), and the center data set as second argument (y); we tell merge which column holds the identifier (this case column ID in both data frames); and then we decide what is kept and what is dropped: by setting all.x = TRUE
, we state that we want all rows in x, that is salt_ht, to appear in the result. At the same time, we set all.y = FALSE
, thereby stating that all rows of y (the center data) which have no match in x, should be dropped from the result.
For this combination of all.x and all.y, the result should be a data frame with the same number of rows as salt_ht, but extra columns from centers:
> summary(salt_ht_centers)
ID sex sbp dbp saltadd age Center
Min. :1006 female:55 Min. : 80.0 Min. : 55.00 no :37 Min. :26.00 A:33
1st Qu.:2879 male :45 1st Qu.:121.0 1st Qu.: 80.00 yes :43 1st Qu.:39.75 B:36
Median :5237 Median :148.5 Median : 96.00 NA's:20 Median :50.00 C:31
Mean :5227 Mean :154.3 Mean : 98.51 Mean :48.71
3rd Qu.:7309 3rd Qu.:184.0 3rd Qu.:116.25 3rd Qu.:58.00
Max. :9962 Max. :238.0 Max. :158.00 Max. :71.00
> salt_ht_centers[1:10, ]
ID sex sbp dbp saltadd age Center
1 1006 male 192 118 no 42 A
2 1024 female 178 128 yes 63 C
3 1086 female 160 108 yes 47 A
4 1219 female 148 108 yes 55 A
5 1457 male 170 98 no 66 A
6 1511 female 149 94 no 48 B
7 1598 male 118 72 no 52 B
8 1635 female 115 73 yes 40 C
9 1640 male 171 105 yes 53 B
10 1684 male 184 98 <NA> 58 B
And this works: as we see here, we still have 55 women and 45 men, but we also have the extra variable Center, which breaks down roughly equally across A
, B
and C
, with no missing values.
Example: adding rows and columns In contrast, we can request that all rows from both data sets are kept in the merged data frame, simply by also setting all.y = TRUE
.
> salt_ht_centers_all <- merge(x = salt_ht, y = centers, by = "ID", all.x = TRUE, all.y = TRUE)
> summary(salt_ht_centers_all)
ID sex sbp dbp saltadd age Center
Min. :1006 female:55 Min. : 80.0 Min. : 55.00 no :37 Min. :26.00 A:56
1st Qu.:2787 male :45 1st Qu.:121.0 1st Qu.: 80.00 yes :43 1st Qu.:39.75 B:57
Median :5527 NA's :50 Median :148.5 Median : 96.00 NA's:70 Median :50.00 C:37
Mean :5367 Mean :154.3 Mean : 98.51 Mean :48.71
3rd Qu.:7754 3rd Qu.:184.0 3rd Qu.:116.25 3rd Qu.:58.00
Max. :9962 Max. :238.0 Max. :158.00 Max. :71.00
NA's :50 NA's :50 NA's :50
> salt_ht_centers_all[1:10, ]
ID sex sbp dbp saltadd age Center
1 1006 male 192 118 no 42 A
2 1014 <NA> NA NA <NA> NA B
3 1024 female 178 128 yes 63 C
4 1086 female 160 108 yes 47 A
5 1162 <NA> NA NA <NA> NA B
6 1219 female 148 108 yes 55 A
7 1265 <NA> NA NA <NA> NA A
8 1457 male 170 98 no 66 A
9 1458 <NA> NA NA <NA> NA A
10 1511 female 149 94 no 48 B
Looking at the resulting data frame, we see the same columns (including Center
) as before, but now also have missing values for all variables except ID
and Center
: these are precisely new 50 missing values in the blood pressure- and demographic variables for subjects who are part of center, but not salt_ht. This kind of merging known as a full or outer join.
6.7 Next steps
For an alternative approach to data processing, group- and otherwise, see Section 7.
This is called a named vector, and works quite similar to named lists we have already discussed: we can still use the basic indexing methods (by position, logical), but we can also use the names to reference elements of the vector. We have already seen examples of named vectors without knowing it, namely the result of calling
summary
for a basic vector. FIXME: example?↩︎Everywhere the object
MeanSd
is defined, actually.↩︎As described in Section @ref(named_lists)↩︎
Should your data set not include a unique identifier, one of your first steps should be to create a new variable that holds a unique identifier, e.g. as
mydata$newid <- 1:nrow(mydata)
. And if anyone hands you multiple data sets without a shared unique identifier expecting you to merge them, you should complain loudly until they cough up one.↩︎setdiff
stands for set difference, as in, remove all elements of the second vector from the first one.↩︎