5 Data types and structures

5.1 Overview

This document assumes that you can interact with R/RStudio in a basic manner (start the program, load data, perform simple analysis, quit safely) and a working knowledge of basic data types (numerical, character, factors) and data structures (vectors and data frames). Having gone through the accompanying Starting in R should provide the necessary context.

The goal of this document is to inform you about:

  • vectors as basic data structures in R,
  • calculations on vectors,
  • extracting and modifying parts of a vector by position (indexing),
  • the logical data type for storing true/false data in R,
  • logical operators and functions that return true/false values,
  • the use of logical expression to extract and modify part of a vector (logical indexing),
  • indexing for rectangular data structures like data frames,
  • lists as general all-purpose data structure in R,
  • names in data frames and lists.

After going through this document and working through the examples in it, you should be able to

  • extract parts of a vector or data frame by position,
  • build and evaluate logical conditions in your data,
  • extract parts of a vector or data frame based on logical conditions,
  • modify parts of an existing vector or data frame.

5.1.1 Data examples

We will use two standard examples for demonstrating operations on vectors and data frames respectively throughout: for vectors, we have a mini data set of five subjects with body weights in kg before and after a diet:

> before <- c(83, 102, 57, 72, 68)
> after <- c(81, 94, 59, 71, 62)

For a data frame, we use the data on blood pressure and salt consumption in 100 subjects from the introduction, the top five rows of which are shown below: FIXME reference

> head(salt_ht, nrow = 5)
    ID    sex sbp dbp saltadd age
1 4305   male 110  80     yes  58
2 6606 female  85  55      no  32
3 5758 female 196 128    <NA>  53
4 2265   male 167 112     yes  68
5 7408 female 145 110     yes  55
6 2160 female 179 120      no  60

5.2 Background

5.2.1 Recap

In the introduction, we have seen how statistical data can be combined into aggregated structures for manipulation, display and analysis. The two structures we have discussed were linear vectors and rectangular data frames. As originally mentioned, we still want to be able to extract, inspect, display and process smaller parts of the combined data, and we have very briefly looked at how to do this for data frames using the $-notation and the subset-function.

5.2.2 Motivation

Data extraction and -manipulation and their technical details may not be the most exciting subjects, but they are essential for any practical statistical work. They are crucial for transforming raw original data into clean analysis data; they are an integral part of descriptive statistics, and they will pop up naturally when doing subgroup- or sensitivity analyses.

Additionally, the concepts discussed in this document (vectors, data frames, lists, indexing) are central for how R works. Understanding them at a somewhat more technical level makes it possible to read, understand and modify existing code for one’s own analysis, and provides context for extension methods (like the tidyverse) that build upon it. FIXME: reference

5.3 More about vectors

5.3.1 Vector calculations

As discussed, a vector is a simple linear arrangement of data items of the same type (e.g. all numerical or all character). It is also an extremely fundamental data type in R, both technically and conceptually: actually, individual numbers or character strings are really just vectors of length one, rather than some different type. This is exactly why R displays a single number with the trailing [1] just as it does for vectors:

> 42
[1] 42
> before
[1]  83 102  57  72  68

Consequently, many basic operations in R and many of the better functions in add-on packages are vector-oriented, i.e. they work for vectors in the same way as for individual data items, simply by acting component-wise. So in order to calculate the change in weight in our five individuals from before to after the diet, we can simply subtract the two vectors:

> after - before
[1] -2 -8  2 -1 -6

Note how the subtraction is performed for each matching pair of weights from the two operand vectors, and the resulting collection of differences is returned as a vector of the same length. The same works also for multiplication, so if we want to express the weight after the diet as a proportion of the weight before the diet, we can simply write:

> after/before
[1] 0.9759036 0.9215686 1.0350877 0.9861111 0.9117647

We can extend this calculation to the percentage in an intuitive manner:

> round(100 * after/before)
[1]  98  92 104  99  91

Clearly, the function round, which rounds a real number to the closest, integer is also vector-oriented, in that it can work component-wise on a vector of numbers and return a vector of results. You may be a bit puzzled by the role of the multiplier 100: while we can interpret this as a vector of length one, as per above, how is this combined work with the vector after/before, which is of length five? The answer is that when two vectors of different lengths are combined in an operation, then the shorter is repeated as often as necessary to create two vectors of the same length, chopping off extra parts if necessary (last repetition may be partial). This is not a problem if the shorter vector has length one, it is just replaced by as many copies of itself as necessary. In our case, the operation above is the same as

> round(c(100, 100, 100, 100, 100) * after/before)
[1]  98  92 104  99  91

which makes perfect sense. If the shorter vector is not of length one, this is more often than not unintended (i.e. an error), and R will generate a warning whenever information is chopped off6.

Exercises:

  1. Using the conversion factor 1 kg = 2.205 lbs, convert the dieters’ weights to pounds.
  2. Use vector arithmetic to calculate the variance of a vector of numbers. Hint: use functions mean, sum, length and the exponentiation operator ^.

5.3.2 Indexing vectors by position

By construction, the easiest way to specify part of a vector is by position: data items are lined up one after the other, from first to last, so each datum has a specific position, or number, or address, starting with one (unlike our friends from the computer sciences, we count starting with one). In R, this is specified via square brackets []; so if we want to extract the weight after diet of the second subject, we just write

> after[2]
[1] 94

Calling the bracket-expression an index just highlights the connection to the mathematical idea of a vector, as an \(n\)-tuple of numbers \((x_1, \ldots x_n)\). In other words, x[i] is just the R expression for the dreaded \(x_i\) so beloved by teachers of statistics courses.

As it turns out, the extraction function implemented by the brackets is itself vector-oriented, in the sense explained above. This means that we can specify a whole vector of index positions, e.g. to extract the weights before diet for the three first subjects in the data:

> before[c(1, 2, 3)]
[1]  83 102  57

A useful shorthand for writing an index vector in this context is the :-operator, which generates a vector of consecutive integers7, as in

> 1:3
[1] 1 2 3

which we can use to achieve the same extraction with much less typing:

> before[1:3]
[1]  83 102  57

We can use the same technique for changing parts of a vector, simply by moving the expression with the brackets to the left hand side of an assignment. So e.g. assuming that the weight after diet for the second subject was originally misspecified and should really by 96, we can fix this by

> after[2] <- 96
> after
[1] 81 96 59 71 62

And of course, if we come to the conclusion that 94 war correct all along, we can easily change it back via after[2] <- 94. This works in the same way for an index vector, so assuming that the last three pre-diet weights were measured on a mis-calibrated scale that added +2 kg to the true weights, we can fix this via

> before
[1]  83 102  57  72  68
> before[3:5] <- before[3:5] - 2
> before
[1]  83 102  55  70  66

Now this may be fairly cool functionality from a pure data manipulation point of view, but it’s actually relatively uncommon that we want to extract or modify observations simply based on their position in a data set8. In practice, we are much more interested in selecting observations based on information contained in one or more other variables, like splitting a data set by sex or age groups. We can still do this using brackets, but we additionally need the concept of logical data introduced in the next section.

Exercises:

  1. Use indexing to extract the weight before diet of the last subject in the vector, regardless of the actual number of subjects.
  2. What happens if you evaluate the expression after[1:3] <- 72? Experiment and explain.

5.4 Logical data

5.4.1 Definition

We have encountered two basic data types so far: numeric data can take on any value that can be represented as a floating point number with 53 bit precision (see ?double and ?.Machine for details); character data can contain any sequence of alphanumeric characters. In contrast, logical is a basic data type in R that only allows two possible values: TRUE and FALSE9.

As such, it can be used to represent binary data; however, while it is not uncommon, it is not really necessary to do that, and often a factor with two levels is more informative and easier to read: e.g. I prefer using a factor variable smoking_status with levels smoker and non-smoker to a logical variable smoker with possible values TRUE and FALSE.

More importantly, this data type is used to store values of logical expressions and comparisons. One application for this is in programming with R, where different code may be executed depending on whether some value of interest is over or under a specified threshold (e.g. via an if-statement). Another application is the extraction and modification of parts of a data set based on conditions involving observed values.

5.4.2 Logical expressions

We can use the comparison operators == (for equality), <, >, <= and >= to compare two objects in an expression. R will evaluate the expression and return TRUE or FALSE based on whether the expression is valid:

> 3 < 4
[1] TRUE
> 3 <= 4
[1] TRUE
> 3 == 4
[1] FALSE

As for numerical expressions, we can include arithmetic operators, functions and objects (variables):

> 2 * 3 > sqrt(17)
[1] TRUE
> sqrt(pi) < 1.4
[1] FALSE

Furthermore, we can also use standard logical operators to combine logical expressions: these are logical AND (&), logical OR (|) and logical NOT (!). E.g.:

> !TRUE
[1] FALSE
> (sqrt(32) <= 5) & (cos(pi) <= 0)
[1] FALSE
> (sqrt(32) <= 5) | (cos(pi) <= 0)
[1] TRUE

And then there are functions with logical return values. A simple example for such a function is is.numeric, which is often used when writing code to check that an input from the user was indeed numerical:

> is.numeric(42)
[1] TRUE
> is.numeric("a")
[1] FALSE

5.4.3 Logical vectors

As for the other basic data types, the basic logical operations listed above are vector-oriented, so if we want to record for each subject in our little toy example whether or not whether their weight after diet was above 65 kg, we can just write

> after > 65
[1]  TRUE  TRUE FALSE  TRUE FALSE

And of course, we can store the result of any such expression as an object (variable) in R under any technically valid name:

> over65after <- after > 65
> over65after
[1]  TRUE  TRUE FALSE  TRUE FALSE

And needless to say, we can use this object again to build further logical expressions, like

> over65after & (before > 65)
[1]  TRUE  TRUE FALSE  TRUE FALSE

As an aside, logical expressions tabulate well, so this can actually be useful in data analysis. Switching to the data example on adding salt to food, we can easily count subjects over 60, with or without elevated systolic blood pressure:

> table(salt_ht$age > 60)

FALSE  TRUE 
   88    12 
> table(salt_ht$age > 60, salt_ht$sbp > 130)
       
        FALSE TRUE
  FALSE    37   51
  TRUE      1   11

R also has useful summary functions that can complement table when it is of interest whether e.g. any subject over 60 years of age has high systolic blood pressure, or whether all subjects under 30 are female10

Logical vectors can also be generated by function calls. A useful and common example for such a function is is.na: it accepts (among other things) a vector as argument, and returns a vector of the same length, where each entry indicates whether the corresponding value in the original vector was indeed the special value NA indicating missingness (TRUE) or not (FALSE)11.

> is.na(before)
[1] FALSE FALSE FALSE FALSE FALSE
> is.na(c("case", NA, "control"))
[1] FALSE  TRUE FALSE

5.4.4 Logical vectors for indexing

We can use logical vectors together with brackets to extract all observations for which a logical condition holds (or equivalently, dropping all observations where the condition does not hold). Conceptually, the bracketed vector (of any data type) and the logical index vector are lined up side by side, and only those values of the bracketed vector where the index vector evaluates to TRUE are returned. So e.g.

> after[c(TRUE, TRUE, FALSE, FALSE, TRUE)]
[1] 81 94 62

returns the first, second and fifth value of vector after.

Of course this is not how logical vectors are commonly used (we know already how to extract by position). Rather, we use logical expressions as index vectors; if we want to extract all weights post-diet that are over 65 kg, we just write

> after[after > 65]
[1] 81 94 71

This is of course not limited to expressions involving the bracketed vector:

> before[after > 65]
[1]  83 102  72

And of course we can use extended logical expressions for indexing:

> before[after < 95 & before > 100]
[1] 102

As before, the same technique can be used to change parts of a vector; as before, one has to be careful that the vectors on the left (assignee) and the right (assigned) have compatible lengths and line up as they should. Let’s look at a slightly convoluted example for our diet mini data: let’s assume that subjects should have been weighed twice after their diet (repeated measurements) to reduce technical variability, but by mistake this did not happen for all subjects. So we have a second vector of post-diet weights with some missing values:

> after2 = c(81, NA, 60, 69, NA)
> after2
[1] 81 NA 60 69 NA

Let’s also assume that the researchers also decide to report the average of the two values where available, and otherwise only the single measured value (which is frankly not a great idea, but crazier things happen, so let’s go with this here). We can try some vector arithemtic:

> after_new <- (after + after2)/2
> after_new
[1] 81.0   NA 59.5 70.0   NA

This works where we actually have two observations, but only shows NA where we only have one - which is actually correct, as the result of any arithmetic operation involving a missing value should properly be missing. However, we can specify that the averaging should only take place at the positions where the second vector of weights after2 has no missing values. Function is.na is vector-oriented, so we can do this:

> ndx <- !is.na(after2)
> ndx
[1]  TRUE FALSE  TRUE  TRUE FALSE

Based on this logical vector, we can calculate the valid averages and store them at the correct locations:

> after_new <- after
> after_new[ndx] <- (after[ndx] + after2[ndx])/2
> after_new
[1] 81.0 94.0 59.5 70.0 62.0

Note how storing the logical vector as object ndx here saves some calculations (the index vector is only calculated once, but used three times in the averaging) and makes the expressions easier to write and read, at the expense of some extra memory for object ndx.

Exercises:

  1. What would the expression after[TRUE] return? Experiment and explain.
  2. What does the expression before[before > 120] return? What is this, and does it make sense?
  3. Select the after-diet weights of those subjects whose weight has a) gone down by at least 2 kg, b) changed by at least 2 kg.

5.5 More on rectangular data

With all the fun we have had so far with indexing, analysis data is not usually processed as a collection of unrelated vectors. The standard data set format is still a rectangular table, with subjects as rows and measurements / variables as columns, for good reasons (not the least to keep the variables and measurements synchronized and available for joint processing).

As it turns out, the idea of indexing and the use of brackets translates easily from vectors to rectangular data tables: here, observations are not lined up linearly, but in a grid of rows and columns; to uniquely identify an observation, we have two specify two indices, one for the rows and one for the columns. Correspondingly, we can refer to a specific observation via

x[ <row index>, <column index>]

i.e. we still use brackets, specify the row index first and separate it from the column index with a comma. Note that this is again directly inspired by standard algebraic notation \(x_{ij}\) for the element in the \(i\)-th row and \(j\)-th column of a general matrix \(X\).

How this works in practice will be demonstrated directly below for data frames. I will then briefly introduce a simpler way for arranging data in a rectangular manner in R, the matrix, and compare it with the more general data frames (including use of indexing.)

5.5.1 Data frame

Let’s formalize our toy example from above as a data frame (reverting to the original observations), with a subject identifier12 added:

> diet <- data.frame(Before = before, After = after, ID = c("A", "B", "C", "D", "E"))
> diet
  Before After ID
1     83    81  A
2    102    94  B
3     57    59  C
4     72    71  D
5     68    62  E

5.5.1.1 Indexing by position

As before, we can simply indicate a specific observation by its position; so if we want to refer to the weight after diet (second column) for the third subject, we can just do

> diet[3, 2]
[1] 59

As before, we can also change the content of the data frame at the specified location by putting it on the left hand side of an assignment, as in

> diet[3, 2] <- 60

though we don’t want to do that here.

Again as before, we can use vectors of positions for both rows and columns: if we want to extract only the weight after diet (second column), but also keep the identifier (third column), only for subjects B-D, we can specify

> diet[2:4, 2:3]
  After ID
2    94  B
3    59  C
4    71  D

In this way, we can extract any rectangular subset from the original data frame, i.e. any combination of rows and columns, in any order we want.

A useful shorthand applies when we only want to drop some subjects from the data, but keep all variables, or reversely, only drop some variables, but keep all subjects: by keeping the corresponding index slot empty, R will automatically return all rows or all columns. So we can get e.g. all variables for subjects A-C via

> diet[1:3, ]
  Before After ID
1     83    81  A
2    102    94  B
3     57    59  C

or all subjects for only the weight variables via

> diet[, 1:2]
  Before After
1     83    81
2    102    94
3     57    59
4     72    71
5     68    62

Note that we still need the comma to indicate whether the row- or column index was dropped13. Formally, this is the counterpart to the algebraic notation \(x_{i\cdot}\) and \(x_{\cdot j}\) for indicating the whole \(i\)-th row or \(j\)-th column of a general matrix \(X\).

Exercise: Use a column index to re-sort the variables in data frame diet so that the identifier is the first column.

5.5.1.2 Logical indexing

This works as we would expect at this point, i.e. we can plug in logical expressions for either row- or column index. In practice though this is more natural for selecting subjects (rows): we have the same set of variables for all subjects, making them all comparable and addressable via a logical expression; variables (columns) on the other hand can be of widely different types (numerical, categorical etc.), so more care must be taken when formulating a logical expression that makes sense for all columns.

As a simple example, let’s extract all subjects whose weight before diet was over 70 kg:

> diet[diet$Before > 70, ]
  Before After ID
1     83    81  A
2    102    94  B
4     72    71  D

What’s a bit awkward here is that we have to specify that the variable Before is part of the same diet data frame from which we extract anyway - more about that later.

Exercise: Extract all subjects with weight before diet less or equal to 70 kg, using only bracket notation (i.e. no $-notation).

5.5.1.3 Mix & match

Just to point out what you would expect: you can combine different styles of indexing for rows and columns, so e.g.

> diet[diet$Before > 70, 1:2]
  Before After
1     83    81
2    102    94
4     72    71

is absolutely ok and works as it should14.

5.5.2 Matrix

In contrast to a data frame, where all observations in the same column have the same type, a matrix in R is a rectangular arrangement where all elements have the same type, e.g. numeric or character. A matrix is more limited as a container of general data, but due to its simple structure can be efficient for large amounts of data of the same type, e.g. as generated in high-throughput molecular assays. For general data with different variable types (numerical, categorical, dates etc.) however, data frames (or some comparable general container, see below) are more appropriate.

On the other hand, for actual statistical calculations (model fitting etc.), general data has to be converted to a numerical matrix, using dummy coding for categorical variables etc. This is generally not done by hand, but internally by the respective statistical procedures. We will see some examples of this later.

For completeness sake, let it be stated that brackets and indexing work exactly the same way as for data frames. If we e.g. construct a matrix from our toy example by binding the two weight vectors together as columns (cbind), we get

> diet_mat <- cbind(Before = before, After = after)
> diet_mat
     Before After
[1,]     83    81
[2,]    102    94
[3,]     57    59
[4,]     72    71
[5,]     68    62

which looks similar to a data frame, though without the default row names we see there (instead, we have an obvious extension of the [1] notation that R displays when printing vectors).

Now we can do

> diet_mat[1:3, ]
     Before After
[1,]     83    81
[2,]    102    94
[3,]     57    59

for extracting the first three rows / subjects.

A note15 on some further technical aspects of matrices. FIXME: clean up awkward reference.

5.5.3 Extensions & alternatives

An array is a generalization of a matrix which has more than two indices, e.g. a three dimensional array has three indices, separated by two commas, and so on in higher dimensions. This has its specialized uses with data processing.

data.table is a re-implementation of the idea of a data frame (as a rectangular arrangement of data with mixed variable types) provided by the add-on package data.table. It is highly efficient, even for large data sets, and supports a range of database functionality, as well as the usual indexing via brackets.

> library(data.table)
> diet_dt <- as.data.table(diet)
> diet_dt[After > 70, ]
   Before After ID
1:     83    81  A
2:    102    94  B
3:     72    71  D

tibble is another re-implementation of the data frame concept, provided by package tibble, which is part of the larger collection of packages known as the tidyverse, which will be discussed in more detail later. It also supports database operations and is efficient for large data sets. FIXME: clean up reference

> library(tibble)
> diet_tib <- as_tibble(diet)
> diet_tib[diet_tib$After > 70, ]
# A tibble: 3 × 3
  Before After ID   
   <dbl> <dbl> <chr>
1     83    81 A    
2    102    94 B    
3     72    71 D    

5.6 Helpers: subset and transform

We have already used the function subset to achieve some of what we can do using brackets and indexing. Indeed, using a so far unused extra argument to the subset function, namely select, as well as the companion function transform, we can do all logical indexing for extraction as well as some modification, at least for data frames and objects that extend them, like data.table. And we can save some typing, too.

subset handles the extraction side:

> subset(diet, After > 70 & Before > 70)
  Before After ID
1     83    81  A
2    102    94  B
4     72    71  D

The extra argument applies to variables, so if we only want the weight variables, we can use

> subset(diet, select = 1:2)
  Before After
1     83    81
2    102    94
3     57    59
4     72    71
5     68    62

And of course, we can combine both things16:

> subset(diet, After > 70 & Before > 70, select = 1:2)
  Before After
1     83    81
2    102    94
4     72    71

transform covers both the generation of new variables (as function of existing ones) and the genuine transformation of existing variables, i.e. the original variables get overwritten. Let’s add the weight loss to the current data frame:

> diet <- transform(diet, WeightLoss = Before - After)
> diet
  Before After ID WeightLoss
1     83    81  A          2
2    102    94  B          8
3     57    59  C         -2
4     72    71  D          1
5     68    62  E          6

We immediately change our mind and want to report the weight loss as a percentage of the original weight. We can modify the new variable:

> diet <- transform(diet, WeightLoss = round(100 * WeightLoss/Before, 1))
> diet
  Before After ID WeightLoss
1     83    81  A        2.4
2    102    94  B        7.8
3     57    59  C       -3.5
4     72    71  D        1.4
5     68    62  E        8.8

Note that both subset and transform are convenience functions for use at the command line and in scripts, but not for serious programming.

5.7 Free-style data: lists

5.7.1 Background

All data structures so far have had some kind of limitation with regard to the type of data they can hold: either all data items have to be of the same type (vectors, matrices), or all items in the same column have to be of the same type (data frames). In contrast, R has also a general purpose type of structure that can hold all kinds of data known to R, the list.

As it often happens, with greater flexibility comes less specific utility: lists are not especially useful for holding generic tabular analysis data, compared to matrices and data frames. When you are starting out in R with more or less straightforward analyses, you can mostly do without lists. I still introduce them at this point for a number of reasons:

  1. Lists can be very handy for processing group-wise data, or data where a large number of outcome variables is of interest (say high-throughput molecular data)

  2. Parameters for more complex algorithms are often collected in lists and passed to functions for fine-tuning how the algorithms are run.

  3. Most complicated data structures in base R, like hypothesis tests or regression models, are at core built as lists, with some extra magic for display; the same holds for many complicated data structures outside of base R, e.g. ggplot2-objects are essenitally just fancy lists, too. Understanding lists therefore increases understanding of how data and results are handled in R, and allows direct access to results (e.g. the p-value in a t-test).

  4. Finally, lists emphasise that R is by design not just a statistics program, but rather a programming language and environment built on more general ideas about data, processing and structures than simple generic tables.

5.7.2 Basic list

A list can be generated by listing any number of R expressions, of any type, as arguments to the function list. So if we want to combine numerical data, character data and the result of a statistical procedure in one handy R object, we can just write

> mylist <- list(1:3, "a", t.test(rnorm(10)))
> mylist
[[1]]
[1] 1 2 3

[[2]]
[1] "a"

[[3]]

    One Sample t-test

data:  rnorm(10)
t = 1.1432, df = 9, p-value = 0.2824
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.4629581  1.4089867
sample estimates:
mean of x 
0.4730143 

In this sense, list() works similar to c(), except for the part about any type of data. Specifically, this means that within the list, the different components are stored in the order they were originally specified. This is indicated in the output above by the component number written in double brackets [[ (as opposed to the single brackets we have used so far).

It may come as not much of a surprise that at this point that we can use these double brackets to access an element of the list by position:

> mylist[[1]]
[1] 1 2 3
> mylist[[2]]
[1] "a"

The important difference to single brackets is that a) we can only access a single element of the list (i.e. no vector indexing) and b) we cannot use logical indexing to extract a matching subset of elements from a list17. However, within these limitations, the double bracket works as one would expect, and specifically allows assignments and modifications. If we want to replace the second element in our toy list with the standard deviation of the first element, this works straightforwardly:

> mylist[[2]] <- sd(mylist[[1]])
> mylist
[[1]]
[1] 1 2 3

[[2]]
[1] 1

[[3]]

    One Sample t-test

data:  rnorm(10)
t = 1.1432, df = 9, p-value = 0.2824
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.4629581  1.4089867
sample estimates:
mean of x 
0.4730143 

5.7.3 Named lists

As an additional service, mostly to readability, we can also name the components of a list, e.g. by simple specifying the name in the call to list:

> mylist_named <- list(data = 1:3, label = "a", test_result = t.test(rnorm(10)))
> mylist_named
$data
[1] 1 2 3

$label
[1] "a"

$test_result

    One Sample t-test

data:  rnorm(10)
t = 0.58327, df = 9, p-value = 0.574
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.6056709  1.0265049
sample estimates:
mean of x 
 0.210417 

The components of the list are the same, but now they have somewhat informative names, which is displayed in the output above instead of their number, and with leading $ instead of the double brackets.

And of course we can use this $-notation to access (and change) the elements of a named list via their name:

> mylist_named$data
[1] 1 2 3
> mylist_named$label <- "A"

Note that a) the plot thickens (as the $ notation is already familiar at this point), and b) we can still use the double bracket as before for named lists:

> mylist_named$label
[1] "A"
> mylist_named[[2]]
[1] "A"

As a matter of fact, we can even use the double bracket with the name:

> mylist_named[["label"]]
[1] "A"

All three notations above for accessing the second element / element with name label are equivalent18.

5.7.4 Example: data frames

What we have been leading up to is the simple fact that internally, data frames are just lists of vectors of the same length. There is some extra functionality implemented in R to make the row/column-indexing work as for a matrix, but at heart, data frames are somewhat constrained lists:

> is.list(diet)
[1] TRUE

If we strip away all the extra data frame goodness with the function unclass, we see this directly:

> unclass(diet)
$Before
[1]  83 102  57  72  68

$After
[1] 81 94 59 71 62

$ID
[1] "A" "B" "C" "D" "E"

$WeightLoss
[1]  2.4  7.8 -3.5  1.4  8.8

attr(,"row.names")
[1] 1 2 3 4 5

This explains why we have been able to use the $-notation in our ivestigations so far: we just use the fact that this notation works for lists. As a consequence, we can also write e.g. 

> diet[["Before"]]
[1]  83 102  57  72  68

though this is not very common.


  1. Conceptually, one can think of a vector in R as a continuous stretch of sequential memory locations that hold the machine representation of the data elements in the vector in the correct order. What with the requirements of memory management, this is not literally true, but it’s close enough when thinking e.g. about vectors and matrices.↩︎

  2. A more general helper function for generating regular integer vectors that can be useful in extracting or modifying data is seq.↩︎

  3. R also has predefined constants T and F that evaluate to TRUE and FALSE, respectively. Note however that anyone can re-define these objects, intentionally or by mistake, with hilarious consequences (e.g. T = FALSE). Best to avoid, IMO.↩︎

  4. Unsurprisingly, these functions are called any and all. They accept a logical vector of any length and return a single logical value, summarizing whether any entry in the argument is true, or whether all entries are true, respectively. The examples above would therefore translate into:

    > any(salt_ht$age > 60 & salt_ht$sbp > 130)
    [1] TRUE
    > all(salt_ht$age < 30 & salt_ht$sex == "female")
    [1] FALSE

    .↩︎

  5. Indeed, the function is.na is (almost) the only legitimate way of checking for the special code NA in R. Should your try to use a comparison operator, as in

    > 3 == NA
    [1] NA
    > NA == NA
    [1] NA

    the expression will simply evaluate to NA again, which is not helpful.↩︎

  6. Useful to know, R has two pre-defined objects available, LETTERS and letters, which contain exactly what their names suggest; so we could have defined the identifiers here simply as ID = LETTERS[1:5].↩︎

  7. Note that in R, the type of object that is returned from a bracketing operation on a data frame will depend on what index was specified: for a single element, it will always be a vector of length one, of whatever type the corresponding column in the data frame is; when selecting rows only, the operation will always return a data frame. However, when specifying columns only, it depends on whether one column was selected, in which case a vector is returned, or more than one column was specified, in which case a data frame is returned. This inconsistency is rarely a problem when doing interactive data analysis, but it can be annoying when writing code, and more database-oriented implementation of rectangular data structures like data.table or tibble will always return an object of the same class when bracketed.↩︎

  8. Note that the select-argument tosubset allows you to use the names of the variables (columns) in a very non-standard, un-Ry way (though written in and fully compatible with R, alas, such is its power\(\ldots\)). We can write vectors of variable names without quotation marks and even use the : for ranges of variables:

    > subset(diet, Before < 80, select = c(Before, ID))
      Before ID
    3     57  C
    4     72  D
    5     68  E
    > subset(diet, Before < 80, select = Before:ID)
      Before After ID
    3     57    59  C
    4     72    71  D
    5     68    62  E

    This is done through some really clever programming, but is rather fragile, and one of the reasons that subset is not recommended for programming use.↩︎

  9. Actually, we can use single brackets with lists, with all the goodness of vector indexing and logical indexing. However, the result is always going to be a list, even if the list only contains a single element:

    > mylist[1:2]
    [[1]]
    [1] 1 2 3
    
    [[2]]
    [1] 1
    > mylist[1]
    [[1]]
    [1] 1 2 3

    which may or may not be what we want (note however that this is consistent behavior: single brackets applied to a list will always return a list, which is not true e.g. for data frames, as outlined above).↩︎

  10. Actually, the double bracket [[ requires the exact name, whereas the $-notation can be abbreviated. On the other hand, you are free to combine named and unnamed items in the same list, as in

    > list(1:3, letters = letters[1:3])
    [[1]]
    [1] 1 2 3
    
    $letters
    [1] "a" "b" "c"

    .↩︎