5 Data types and structures
5.1 Overview
This document assumes that you can interact with R/RStudio in a basic manner (start the program, load data, perform simple analysis, quit safely) and a working knowledge of basic data types (numerical, character, factors) and data structures (vectors and data frames). Having gone through the accompanying Starting in R should provide the necessary context.
The goal of this document is to inform you about:
- vectors as basic data structures in R,
- calculations on vectors,
- extracting and modifying parts of a vector by position (indexing),
- the
logical
data type for storing true/false data in R, - logical operators and functions that return true/false values,
- the use of logical expression to extract and modify part of a vector (logical indexing),
- indexing for rectangular data structures like data frames,
- lists as general all-purpose data structure in R,
- names in data frames and lists.
After going through this document and working through the examples in it, you should be able to
- extract parts of a vector or data frame by position,
- build and evaluate logical conditions in your data,
- extract parts of a vector or data frame based on logical conditions,
- modify parts of an existing vector or data frame.
5.1.1 Data examples
We will use two standard examples for demonstrating operations on vectors and data frames respectively throughout: for vectors, we have a mini data set of five subjects with body weights in kg before and after a diet:
For a data frame, we use the data on blood pressure and salt consumption in 100 subjects from the introduction, the top five rows of which are shown below: FIXME reference
5.2 Background
5.2.1 Recap
In the introduction, we have seen how statistical data can be combined into aggregated structures for manipulation, display and analysis. The two structures we have discussed were linear vectors and rectangular data frames. As originally mentioned, we still want to be able to extract, inspect, display and process smaller parts of the combined data, and we have very briefly looked at how to do this for data frames using the $
-notation and the subset
-function.
5.2.2 Motivation
Data extraction and -manipulation and their technical details may not be the most exciting subjects, but they are essential for any practical statistical work. They are crucial for transforming raw original data into clean analysis data; they are an integral part of descriptive statistics, and they will pop up naturally when doing subgroup- or sensitivity analyses.
Additionally, the concepts discussed in this document (vectors, data frames, lists, indexing) are central for how R works. Understanding them at a somewhat more technical level makes it possible to read, understand and modify existing code for one’s own analysis, and provides context for extension methods (like the tidyverse) that build upon it. FIXME: reference
5.3 More about vectors
5.3.1 Vector calculations
As discussed, a vector is a simple linear arrangement of data items of the same type (e.g. all numerical or all character). It is also an extremely fundamental data type in R, both technically and conceptually: actually, individual numbers or character strings are really just vectors of length one, rather than some different type. This is exactly why R displays a single number with the trailing [1]
just as it does for vectors:
Consequently, many basic operations in R and many of the better functions in add-on packages are vector-oriented, i.e. they work for vectors in the same way as for individual data items, simply by acting component-wise. So in order to calculate the change in weight in our five individuals from before to after the diet, we can simply subtract the two vectors:
Note how the subtraction is performed for each matching pair of weights from the two operand vectors, and the resulting collection of differences is returned as a vector of the same length. The same works also for multiplication, so if we want to express the weight after the diet as a proportion of the weight before the diet, we can simply write:
We can extend this calculation to the percentage in an intuitive manner:
Clearly, the function round
, which rounds a real number to the closest, integer is also vector-oriented, in that it can work component-wise on a vector of numbers and return a vector of results. You may be a bit puzzled by the role of the multiplier 100: while we can interpret this as a vector of length one, as per above, how is this combined work with the vector after/before
, which is of length five? The answer is that when two vectors of different lengths are combined in an operation, then the shorter is repeated as often as necessary to create two vectors of the same length, chopping off extra parts if necessary (last repetition may be partial). This is not a problem if the shorter vector has length one, it is just replaced by as many copies of itself as necessary. In our case, the operation above is the same as
which makes perfect sense. If the shorter vector is not of length one, this is more often than not unintended (i.e. an error), and R will generate a warning whenever information is chopped off6.
Exercises:
- Using the conversion factor 1 kg = 2.205 lbs, convert the dieters’ weights to pounds.
- Use vector arithmetic to calculate the variance of a vector of numbers. Hint: use functions
mean
,sum
,length
and the exponentiation operator^
.
5.3.2 Indexing vectors by position
By construction, the easiest way to specify part of a vector is by position: data items are lined up one after the other, from first to last, so each datum has a specific position, or number, or address, starting with one (unlike our friends from the computer sciences, we count starting with one). In R, this is specified via square brackets []
; so if we want to extract the weight after diet of the second subject, we just write
Calling the bracket-expression an index just highlights the connection to the mathematical idea of a vector, as an \(n\)-tuple of numbers \((x_1, \ldots x_n)\). In other words, x[i]
is just the R expression for the dreaded \(x_i\) so beloved by teachers of statistics courses.
As it turns out, the extraction function implemented by the brackets is itself vector-oriented, in the sense explained above. This means that we can specify a whole vector of index positions, e.g. to extract the weights before diet for the three first subjects in the data:
A useful shorthand for writing an index vector in this context is the :
-operator, which generates a vector of consecutive integers7, as in
which we can use to achieve the same extraction with much less typing:
We can use the same technique for changing parts of a vector, simply by moving the expression with the brackets to the left hand side of an assignment. So e.g. assuming that the weight after diet for the second subject was originally misspecified and should really by 96, we can fix this by
And of course, if we come to the conclusion that 94 war correct all along, we can easily change it back via
after[2] <- 94
. This works in the same way for an index vector, so assuming that the last three pre-diet weights were measured on a mis-calibrated scale that added +2 kg to the true weights, we can fix this via
Now this may be fairly cool functionality from a pure data manipulation point of view, but it’s actually relatively uncommon that we want to extract or modify observations simply based on their position in a data set8. In practice, we are much more interested in selecting observations based on information contained in one or more other variables, like splitting a data set by sex or age groups. We can still do this using brackets, but we additionally need the concept of logical data introduced in the next section.
Exercises:
- Use indexing to extract the weight before diet of the last subject in the vector, regardless of the actual number of subjects.
- What happens if you evaluate the expression
after[1:3] <- 72
? Experiment and explain.
5.4 Logical data
5.4.1 Definition
We have encountered two basic data types so far: numeric
data can take on any value that can be represented as a floating point number with 53 bit precision (see ?double
and ?.Machine
for details); character
data can contain any sequence of alphanumeric characters. In contrast, logical
is a basic data type in R that only allows two possible values: TRUE
and FALSE
9.
As such, it can be used to represent binary data; however, while it is not uncommon, it is not really necessary to do that, and often a factor with two levels is more informative and easier to read: e.g. I prefer using a factor variable smoking_status
with levels smoker
and non-smoker
to a logical variable smoker
with possible values TRUE
and FALSE
.
More importantly, this data type is used to store values of logical expressions and comparisons. One application for this is in programming with R, where different code may be executed depending on whether some value of interest is over or under a specified threshold (e.g. via an if
-statement). Another application is the extraction and modification of parts of a data set based on conditions involving observed values.
5.4.2 Logical expressions
We can use the comparison operators ==
(for equality), <
, >
, <=
and >=
to compare two objects in an expression. R will evaluate the expression and return TRUE
or FALSE
based on whether the expression is valid:
As for numerical expressions, we can include arithmetic operators, functions and objects (variables):
Furthermore, we can also use standard logical operators to combine logical expressions: these are logical AND (&
), logical OR (|
) and logical NOT (!
). E.g.:
> !TRUE
[1] FALSE
> (sqrt(32) <= 5) & (cos(pi) <= 0)
[1] FALSE
> (sqrt(32) <= 5) | (cos(pi) <= 0)
[1] TRUE
And then there are functions with logical return values. A simple example for such a function is is.numeric
, which is often used when writing code to check that an input from the user was indeed numerical:
5.4.3 Logical vectors
As for the other basic data types, the basic logical operations listed above are vector-oriented, so if we want to record for each subject in our little toy example whether or not whether their weight after diet was above 65 kg, we can just write
And of course, we can store the result of any such expression as an object (variable) in R under any technically valid name:
And needless to say, we can use this object again to build further logical expressions, like
As an aside, logical expressions tabulate well, so this can actually be useful in data analysis. Switching to the data example on adding salt to food, we can easily count subjects over 60, with or without elevated systolic blood pressure:
> table(salt_ht$age > 60)
FALSE TRUE
88 12
> table(salt_ht$age > 60, salt_ht$sbp > 130)
FALSE TRUE
FALSE 37 51
TRUE 1 11
R also has useful summary functions that can complement table
when it is of interest whether e.g. any subject over 60 years of age has high systolic blood pressure, or whether all subjects under 30 are female10
Logical vectors can also be generated by function calls. A useful and common example for such a function is is.na
: it accepts (among other things) a vector as argument, and returns a vector of the same length, where each entry indicates whether the corresponding value in the original vector was indeed the special value NA
indicating missingness (TRUE
) or not (FALSE
)11.
5.4.4 Logical vectors for indexing
We can use logical vectors together with brackets to extract all observations for which a logical condition holds (or equivalently, dropping all observations where the condition does not hold). Conceptually, the bracketed vector (of any data type) and the logical index vector are lined up side by side, and only those values of the bracketed vector where the index vector evaluates to TRUE
are returned. So e.g.
returns the first, second and fifth value of vector after
.
Of course this is not how logical vectors are commonly used (we know already how to extract by position). Rather, we use logical expressions as index vectors; if we want to extract all weights post-diet that are over 65 kg, we just write
This is of course not limited to expressions involving the bracketed vector:
And of course we can use extended logical expressions for indexing:
As before, the same technique can be used to change parts of a vector; as before, one has to be careful that the vectors on the left (assignee) and the right (assigned) have compatible lengths and line up as they should. Let’s look at a slightly convoluted example for our diet mini data: let’s assume that subjects should have been weighed twice after their diet (repeated measurements) to reduce technical variability, but by mistake this did not happen for all subjects. So we have a second vector of post-diet weights with some missing values:
Let’s also assume that the researchers also decide to report the average of the two values where available, and otherwise only the single measured value (which is frankly not a great idea, but crazier things happen, so let’s go with this here). We can try some vector arithemtic:
This works where we actually have two observations, but only shows NA
where we only have one - which is actually correct, as the result of any arithmetic operation involving a missing value should properly be missing. However, we can specify that the averaging should only take place at the positions where the second vector of weights after2
has no missing values. Function is.na
is vector-oriented, so we can do this:
Based on this logical vector, we can calculate the valid averages and store them at the correct locations:
> after_new <- after
> after_new[ndx] <- (after[ndx] + after2[ndx])/2
> after_new
[1] 81.0 94.0 59.5 70.0 62.0
Note how storing the logical vector as object ndx
here saves some calculations (the index vector is only calculated once, but used three times in the averaging) and makes the expressions easier to write and read, at the expense of some extra memory for object ndx
.
Exercises:
- What would the expression
after[TRUE]
return? Experiment and explain. - What does the expression
before[before > 120]
return? What is this, and does it make sense? - Select the after-diet weights of those subjects whose weight has a) gone down by at least 2 kg, b) changed by at least 2 kg.
5.5 More on rectangular data
With all the fun we have had so far with indexing, analysis data is not usually processed as a collection of unrelated vectors. The standard data set format is still a rectangular table, with subjects as rows and measurements / variables as columns, for good reasons (not the least to keep the variables and measurements synchronized and available for joint processing).
As it turns out, the idea of indexing and the use of brackets translates easily from vectors to rectangular data tables: here, observations are not lined up linearly, but in a grid of rows and columns; to uniquely identify an observation, we have two specify two indices, one for the rows and one for the columns. Correspondingly, we can refer to a specific observation via
x[ <row index>, <column index>]
i.e. we still use brackets, specify the row index first and separate it from the column index with a comma. Note that this is again directly inspired by standard algebraic notation \(x_{ij}\) for the element in the \(i\)-th row and \(j\)-th column of a general matrix \(X\).
How this works in practice will be demonstrated directly below for data frames. I will then briefly introduce a simpler way for arranging data in a rectangular manner in R, the matrix, and compare it with the more general data frames (including use of indexing.)
5.5.1 Data frame
Let’s formalize our toy example from above as a data frame (reverting to the original observations), with a subject identifier12 added:
> diet <- data.frame(Before = before, After = after, ID = c("A", "B", "C", "D", "E"))
> diet
Before After ID
1 83 81 A
2 102 94 B
3 57 59 C
4 72 71 D
5 68 62 E
5.5.1.1 Indexing by position
As before, we can simply indicate a specific observation by its position; so if we want to refer to the weight after diet (second column) for the third subject, we can just do
As before, we can also change the content of the data frame at the specified location by putting it on the left hand side of an assignment, as in
though we don’t want to do that here.
Again as before, we can use vectors of positions for both rows and columns: if we want to extract only the weight after diet (second column), but also keep the identifier (third column), only for subjects B-D, we can specify
In this way, we can extract any rectangular subset from the original data frame, i.e. any combination of rows and columns, in any order we want.
A useful shorthand applies when we only want to drop some subjects from the data, but keep all variables, or reversely, only drop some variables, but keep all subjects: by keeping the corresponding index slot empty, R will automatically return all rows or all columns. So we can get e.g. all variables for subjects A-C via
or all subjects for only the weight variables via
Note that we still need the comma to indicate whether the row- or column index was dropped13. Formally, this is the counterpart to the algebraic notation \(x_{i\cdot}\) and \(x_{\cdot j}\) for indicating the whole \(i\)-th row or \(j\)-th column of a general matrix \(X\).
Exercise: Use a column index to re-sort the variables in data frame diet
so that the identifier is the first column.
5.5.1.2 Logical indexing
This works as we would expect at this point, i.e. we can plug in logical expressions for either row- or column index. In practice though this is more natural for selecting subjects (rows): we have the same set of variables for all subjects, making them all comparable and addressable via a logical expression; variables (columns) on the other hand can be of widely different types (numerical, categorical etc.), so more care must be taken when formulating a logical expression that makes sense for all columns.
As a simple example, let’s extract all subjects whose weight before diet was over 70 kg:
What’s a bit awkward here is that we have to specify that the variable Before
is part of the same diet
data frame from which we extract anyway - more about that later.
Exercise: Extract all subjects with weight before diet less or equal to 70 kg, using only bracket notation (i.e. no $
-notation).
5.5.1.3 Mix & match
Just to point out what you would expect: you can combine different styles of indexing for rows and columns, so e.g.
is absolutely ok and works as it should14.
5.5.2 Matrix
In contrast to a data frame, where all observations in the same column have the same type, a matrix in R is a rectangular arrangement where all elements have the same type, e.g. numeric or character. A matrix is more limited as a container of general data, but due to its simple structure can be efficient for large amounts of data of the same type, e.g. as generated in high-throughput molecular assays. For general data with different variable types (numerical, categorical, dates etc.) however, data frames (or some comparable general container, see below) are more appropriate.
On the other hand, for actual statistical calculations (model fitting etc.), general data has to be converted to a numerical matrix, using dummy coding for categorical variables etc. This is generally not done by hand, but internally by the respective statistical procedures. We will see some examples of this later.
For completeness sake, let it be stated that brackets and indexing work exactly the same way as for data frames. If we e.g. construct a matrix from our toy example by binding the two weight vectors together as columns (cbind
), we get
> diet_mat <- cbind(Before = before, After = after)
> diet_mat
Before After
[1,] 83 81
[2,] 102 94
[3,] 57 59
[4,] 72 71
[5,] 68 62
which looks similar to a data frame, though without the default row names we see there (instead, we have an obvious extension of the [1]
notation that R displays when printing vectors).
Now we can do
for extracting the first three rows / subjects.
A note15 on some further technical aspects of matrices. FIXME: clean up awkward reference.
5.5.3 Extensions & alternatives
An array
is a generalization of a matrix
which has more than two indices, e.g. a three dimensional array has three indices, separated by two commas, and so on in higher dimensions. This has its specialized uses with data processing.
data.table
is a re-implementation of the idea of a data frame (as a rectangular arrangement of data with mixed variable types) provided by the add-on package data.table
. It is highly efficient, even for large data sets, and supports a range of database functionality, as well as the usual indexing via brackets.
> library(data.table)
> diet_dt <- as.data.table(diet)
> diet_dt[After > 70, ]
Before After ID
1: 83 81 A
2: 102 94 B
3: 72 71 D
tibble
is another re-implementation of the data frame concept, provided by package tibble
, which is part of the larger collection of packages known as the tidyverse, which will be discussed in more detail later. It also supports database operations and is efficient for large data sets. FIXME: clean up reference
5.6 Helpers: subset
and transform
We have already used the function subset
to achieve some of what we can do using brackets and indexing. Indeed, using a so far unused extra argument to the subset
function, namely select
, as well as the companion function transform
, we can do all logical indexing for extraction as well as some modification, at least for data frames and objects that extend them, like data.table
. And we can save some typing, too.
subset
handles the extraction side:
The extra argument applies to variables, so if we only want the weight variables, we can use
And of course, we can combine both things16:
transform
covers both the generation of new variables (as function of existing ones) and the genuine transformation of existing variables, i.e. the original variables get overwritten. Let’s add the weight loss to the current data frame:
> diet <- transform(diet, WeightLoss = Before - After)
> diet
Before After ID WeightLoss
1 83 81 A 2
2 102 94 B 8
3 57 59 C -2
4 72 71 D 1
5 68 62 E 6
We immediately change our mind and want to report the weight loss as a percentage of the original weight. We can modify the new variable:
> diet <- transform(diet, WeightLoss = round(100 * WeightLoss/Before, 1))
> diet
Before After ID WeightLoss
1 83 81 A 2.4
2 102 94 B 7.8
3 57 59 C -3.5
4 72 71 D 1.4
5 68 62 E 8.8
Note that both subset
and transform
are convenience functions for use at the command line and in scripts, but not for serious programming.
5.7 Free-style data: lists
5.7.1 Background
All data structures so far have had some kind of limitation with regard to the type of data they can hold: either all data items have to be of the same type (vectors, matrices), or all items in the same column have to be of the same type (data frames). In contrast, R has also a general purpose type of structure that can hold all kinds of data known to R, the list.
As it often happens, with greater flexibility comes less specific utility: lists are not especially useful for holding generic tabular analysis data, compared to matrices and data frames. When you are starting out in R with more or less straightforward analyses, you can mostly do without lists. I still introduce them at this point for a number of reasons:
Lists can be very handy for processing group-wise data, or data where a large number of outcome variables is of interest (say high-throughput molecular data)
Parameters for more complex algorithms are often collected in lists and passed to functions for fine-tuning how the algorithms are run.
Most complicated data structures in base R, like hypothesis tests or regression models, are at core built as lists, with some extra magic for display; the same holds for many complicated data structures outside of base R, e.g.
ggplot2
-objects are essenitally just fancy lists, too. Understanding lists therefore increases understanding of how data and results are handled in R, and allows direct access to results (e.g. the p-value in a t-test).Finally, lists emphasise that R is by design not just a statistics program, but rather a programming language and environment built on more general ideas about data, processing and structures than simple generic tables.
5.7.2 Basic list
A list can be generated by listing any number of R expressions, of any type, as arguments to the function list
. So if we want to combine numerical data, character data and the result of a statistical procedure in one handy R object, we can just write
> mylist <- list(1:3, "a", t.test(rnorm(10)))
> mylist
[[1]]
[1] 1 2 3
[[2]]
[1] "a"
[[3]]
One Sample t-test
data: rnorm(10)
t = 1.1432, df = 9, p-value = 0.2824
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.4629581 1.4089867
sample estimates:
mean of x
0.4730143
In this sense, list()
works similar to c()
, except for the part about any type of data. Specifically, this means that within the list, the different components are stored in the order they were originally specified. This is indicated in the output above by the component number written in double brackets [[
(as opposed to the single brackets we have used so far).
It may come as not much of a surprise that at this point that we can use these double brackets to access an element of the list by position:
The important difference to single brackets is that a) we can only access a single element of the list (i.e. no vector indexing) and b) we cannot use logical indexing to extract a matching subset of elements from a list17. However, within these limitations, the double bracket works as one would expect, and specifically allows assignments and modifications. If we want to replace the second element in our toy list with the standard deviation of the first element, this works straightforwardly:
5.7.3 Named lists
As an additional service, mostly to readability, we can also name the components of a list, e.g. by simple specifying the name in the call to list
:
> mylist_named <- list(data = 1:3, label = "a", test_result = t.test(rnorm(10)))
> mylist_named
$data
[1] 1 2 3
$label
[1] "a"
$test_result
One Sample t-test
data: rnorm(10)
t = 0.58327, df = 9, p-value = 0.574
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.6056709 1.0265049
sample estimates:
mean of x
0.210417
The components of the list are the same, but now they have somewhat informative names, which is displayed in the output above instead of their number, and with leading $
instead of the double brackets.
And of course we can use this $
-notation to access (and change) the elements of a named list via their name:
Note that a) the plot thickens (as the $
notation is already familiar at this point), and b) we can still use the double bracket as before for named lists:
As a matter of fact, we can even use the double bracket with the name:
All three notations above for accessing the second element / element with name label
are equivalent18.
5.7.4 Example: data frames
What we have been leading up to is the simple fact that internally, data frames are just lists of vectors of the same length. There is some extra functionality implemented in R to make the row/column-indexing work as for a matrix, but at heart, data frames are somewhat constrained lists:
If we strip away all the extra data frame goodness with the function unclass
, we see this directly:
> unclass(diet)
$Before
[1] 83 102 57 72 68
$After
[1] 81 94 59 71 62
$ID
[1] "A" "B" "C" "D" "E"
$WeightLoss
[1] 2.4 7.8 -3.5 1.4 8.8
attr(,"row.names")
[1] 1 2 3 4 5
This explains why we have been able to use the $
-notation in our ivestigations so far: we just use the fact that this notation works for lists. As a consequence, we can also write e.g.
though this is not very common.
Conceptually, one can think of a vector in R as a continuous stretch of sequential memory locations that hold the machine representation of the data elements in the vector in the correct order. What with the requirements of memory management, this is not literally true, but it’s close enough when thinking e.g. about vectors and matrices.↩︎
A more general helper function for generating regular integer vectors that can be useful in extracting or modifying data is
seq
.↩︎R also has predefined constants
T
andF
that evaluate toTRUE
andFALSE
, respectively. Note however that anyone can re-define these objects, intentionally or by mistake, with hilarious consequences (e.g.T = FALSE
). Best to avoid, IMO.↩︎Unsurprisingly, these functions are called
any
andall
. They accept a logical vector of any length and return a single logical value, summarizing whether any entry in the argument is true, or whether all entries are true, respectively. The examples above would therefore translate into:> any(salt_ht$age > 60 & salt_ht$sbp > 130) [1] TRUE > all(salt_ht$age < 30 & salt_ht$sex == "female") [1] FALSE
.↩︎
Indeed, the function
is.na
is (almost) the only legitimate way of checking for the special codeNA
in R. Should your try to use a comparison operator, as inthe expression will simply evaluate to
NA
again, which is not helpful.↩︎Useful to know, R has two pre-defined objects available,
LETTERS
andletters
, which contain exactly what their names suggest; so we could have defined the identifiers here simply asID = LETTERS[1:5]
.↩︎Note that in R, the type of object that is returned from a bracketing operation on a data frame will depend on what index was specified: for a single element, it will always be a vector of length one, of whatever type the corresponding column in the data frame is; when selecting rows only, the operation will always return a data frame. However, when specifying columns only, it depends on whether one column was selected, in which case a vector is returned, or more than one column was specified, in which case a data frame is returned. This inconsistency is rarely a problem when doing interactive data analysis, but it can be annoying when writing code, and more database-oriented implementation of rectangular data structures like
data.table
ortibble
will always return an object of the same class when bracketed.↩︎Note that the
select
-argument tosubset
allows you to use the names of the variables (columns) in a very non-standard, un-Ry way (though written in and fully compatible with R, alas, such is its power\(\ldots\)). We can write vectors of variable names without quotation marks and even use the:
for ranges of variables:> subset(diet, Before < 80, select = c(Before, ID)) Before ID 3 57 C 4 72 D 5 68 E > subset(diet, Before < 80, select = Before:ID) Before After ID 3 57 59 C 4 72 71 D 5 68 62 E
This is done through some really clever programming, but is rather fragile, and one of the reasons that
subset
is not recommended for programming use.↩︎Actually, we can use single brackets with lists, with all the goodness of vector indexing and logical indexing. However, the result is always going to be a list, even if the list only contains a single element:
which may or may not be what we want (note however that this is consistent behavior: single brackets applied to a list will always return a list, which is not true e.g. for data frames, as outlined above).↩︎
Actually, the double bracket
[[
requires the exact name, whereas the$
-notation can be abbreviated. On the other hand, you are free to combine named and unnamed items in the same list, as in.↩︎