8 Basic Statistics & Epidemiology

In this chapter, we look at basic statistical measures and methods in R, with special emphasis on tools and concepts commonly used in epidemiology, especially measures of effect and risk.

This chapter assumes that you can interact with R/RStudio in a basic manner (start the program, load data, perform simple analysis, quit safely) and have a working knowledge of basic data types (numerical, character, factors) and data structures (vectors and data frames). Having gone through the introduction in Chapter 2 should provide the necessary context. This chapter also makes heavy use of add-on packages, so you should also be able to install and load such packages, as outlined in Section @ref{#def-lower-pane}.

The examples in this chapter make use of a classic data set37 that contains information on 189 mother-child pairs recorded at an US hospital. The data file comes as part of the collection of examples that come with these notes

> load("Data/birthweights.RData")
> names(bwts)
 [1] "LowBw"          "Age"            "LastWeight"     "Race"           "Smoking"       
 [6] "PrevPremature"  "Hypertension"   "UterineIrritab" "PhysVisits"     "BirthWeight"   

with birth weight in the infant generally seen as main outcome (either binary for low weight or continous in g). The other eight potential risk factors were measured on the mothers: age at birth, weight before pregnancy (lbs), the peculiar ethno-demographic race category popular in the US, smoking during pregnancy, number of previous premature labors, history of hypertension, uterine irritability, and number of physician visits during the first trimester.

8.1 Descriptive statistics

8.1.1 In base R

Let’s start with what we know already, the summary function:

> summary(bwts)
 LowBw          Age          LastWeight       Race    Smoking   PrevPremature Hypertension
 no :130   Min.   :14.00   Min.   : 80.0   white:96   no :115   no  :159      no :177     
 yes: 59   1st Qu.:19.00   1st Qu.:110.0   black:26   yes: 74   yes : 24      yes: 12     
           Median :23.00   Median :121.0   other:67             NA's:  6                  
           Mean   :23.24   Mean   :129.8                                                  
           3rd Qu.:26.00   3rd Qu.:140.0                                                  
           Max.   :45.00   Max.   :250.0                                                  
 UterineIrritab   PhysVisits      BirthWeight  
 no :161        Min.   :0.0000   Min.   : 709  
 yes: 28        1st Qu.:0.0000   1st Qu.:2414  
                Median :0.0000   Median :2977  
                Mean   :0.7937   Mean   :2945  
                3rd Qu.:1.0000   3rd Qu.:3487  
                Max.   :6.0000   Max.   :4990  

As we have seen before, this serves as an excellent quality control for, and initial introduction to, a reasonably small data set, with relevant information for all variables: we see about 2/3 low birth weights, maternal ages between 14 to 45 years, with a median at 23 years, range of weights before pregnancy corresponding to ca. 40 to 125kg, a lot of smoking during pregnancy (this is old data), previous premature labor is rare, with some missing values, not much hypertension or uterine irritability; shockingly, more than half of the women had no physician’s visit during the first trimester. The actual birth weights vary from a scary 700g to a very solid 5kg, with a median of ca. 3kg.

Looking at these birth weights specifically, we may interested in other quantities, too: e.g. we may want to complement the mean weight with the standard deviation, or we may be interested in the lowest birth weights specifically, e.g. the lowest 5% and 10%. In this case, we can use specific functions to calculate the statistics of interest, like sd and quantile in this case: we see a standard deviation of more than 700g, which seems quite large. We also see that 5% of births are at under 1800g, and 10% under just about 2000g, which seems a lot.

> summary(bwts$BirthWeight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    709    2414    2977    2945    3487    4990 
> sd(bwts$BirthWeight)
[1] 729.2143
> quantile(bwts$BirthWeight, c(0.05, 0.1))
    5%    10% 
1801.2 2038.0 

Moving on to discrete variables, we have already encountered the basic functions table and proportions in FIXME. Applied to the low birth weight indicator in our data, we see the same 59 cases as above, corresponding to 31% of all births.

> tab <- table(bwts$LowBw)
> tab

 no yes 
130  59 
> proportions(tab)

       no       yes 
0.6878307 0.3121693 

The table-function can also be used for cross-tabulating two variables, like e.g. low birth weight and smoking during pregnancy.

> table(bwts$Smoking, bwts$LowBw)
     
      no yes
  no  86  29
  yes 44  30

Note that in a situation like this, where both variables have the same levels (yes/no), it is not obvious how to read the result. Fortunately, we can specify row- and column names in the call to table, which makes this far easier to interpret:

> tab_smo <- table(Smoking = bwts$Smoking, LowBwt = bwts$LowBw)
> tab_smo
       LowBwt
Smoking no yes
    no  86  29
    yes 44  30
> proportions(tab_smo, margin = 1)
       LowBwt
Smoking        no       yes
    no  0.7478261 0.2521739
    yes 0.5945946 0.4054054

We see almost the same number of low birth weights, but overall more non-smoking mothers. This corresponds to about 25% low birth weights among non-smoking mothers, and 40% for smoking mothers.

Though not strictly descriptive, we can also run this table through a \(\chi^2\)-test, and we find that the difference between smoking and non-smoking mothers is indeed (just about) statistically significant at the usual 5% level:

> chisq.test(tab_smo)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab_smo
X-squared = 4.2359, df = 1, p-value = 0.03958

However, and this is epidemiologically disappointing, we get just a naked p-value, and no corresponding measure of effect strength, like a relative risk, so this is somewhat limited.

Assessment

While this assembly approach to descriptive statistics in base R can produce perfectly respectable results, it does have a couple of shortcomings: firstly, it may take a bit of patience to put together all the pieces, and we have to remember quite a number of different functions for doing so (like min, max, range, IQR etc.). Secondly, as main function, summary only offers a fixed set of statistics, which does not include any measure of variability for the data38. Thirdly, there is generally a lack of information about the variability of the descriptive statistics, like standard errors or confidence intervals39. And finally, base R has very little40 to show in terms of basic epidemiological descriptives like risk ratios or odds ratios.

But that is the beauty of R as a platform for programming and code sharing: it does not have to do everything itself, because it allows users to implement and distribute additional functionality - if people don’t like the base R descriptives, they can roll their own - and they have! There are thousands of packages on CRAN that provide functions for calculating descriptive statistics, from the common (e.g. coefficient of variation) to the more specialized (e.g. winsorized means) to the obscure (e.g. trimmed L-moments).

Packages For the purpose of these notes, I have selected three complementary packages that address the potential shortcomings in base R as listed above to a useful degree:

  • summarytools, which offers flexible general descriptives, as seen in the rest of this section,
  • DescTools, which offers confidence intervals for a wide range of statistics,
  • epitools, which calculates epidemiological risk measures. FIXME: references

Importantly, these are not canonical solutions - you should absolutely explore the R package space for alternatives if they do not fulfill your needs (and maybe even if they mostly do).

8.1.2 Using package summarytools

In contrast to base summary, summarytools splits the job of calculating numerical descriptives into two separate functions: descr for continuous variables and freq for discrete variables.

Let’s start by loading the package and looking at the descriptives for the birthweights:

> library(summarytools)
> descr(bwts$BirthWeight)
Descriptive Statistics  
bwts$BirthWeight  
N: 189  

                    BirthWeight
----------------- -------------
             Mean       2944.59
          Std.Dev        729.21
              Min        709.00
               Q1       2414.00
           Median       2977.00
               Q3       3487.00
              Max       4990.00
              MAD        834.70
              IQR       1073.00
               CV          0.25
         Skewness         -0.21
      SE.Skewness          0.18
         Kurtosis         -0.14
          N.Valid        189.00
        Pct.Valid        100.00

By default, we get everything we get from base summary (mean, median, min/max, quartiles), but we also get the standard deviation, right below the mean. We also get two additional measures of variability further down, the interquartile range (IQR) and the median absolute deviation (MAD): like the standard deviation, these are non-negative measures of dispersion, with larger values implying larger dispersion, in the same units as the underlying variable41 (so for this example, reported values are in g); however, these measures are more robust to outliers, and can be useful for comparing variability between noisy data sets.

descr also displays numerical descriptives for the shape of the distribution: skewness measures the asymmetry, with values around zero indicating approximate symmetry, large negative values indicating a long left tail, and large positive values indicating a long right tail in the data distribution. (Excess) kurtosis measures the presence of very large or very small observations in the tails of the distribution, relative to a normal distribution, with negative values indicating fewer extreme values, and positive values indicating more extreme values42. Note that while these measures may be somewhat useful in comparing distributions, they cannot replace inspecting the actual data for shape and outliers (e.g. via a histogram).

Looking at the distribution on low birth weights, we see that freq generates a ver comprehensive frequency table, which by default reports missing values and several types of percentages, in a manner not dissimilar to SAS PROC FREQ:

> freq(bwts$LowBw)
Frequencies  
bwts$LowBw  
Type: Factor  

              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
----------- ------ --------- -------------- --------- --------------
         no    130     68.78          68.78     68.78          68.78
        yes     59     31.22         100.00     31.22         100.00
       <NA>      0                               0.00         100.00
      Total    189    100.00         100.00    100.00         100.00

Both descr and freq allow you to configure what statistics to display. For example, we can display a shorter list of descriptive statistics, or suppress the missing value reports, which generates more compact output

> descr(bwts$BirthWeight, stats = "common")  # Fewer statistics
Descriptive Statistics  
bwts$BirthWeight  
N: 189  

                  BirthWeight
--------------- -------------
           Mean       2944.59
        Std.Dev        729.21
            Min        709.00
         Median       2977.00
            Max       4990.00
        N.Valid        189.00
      Pct.Valid        100.00
> freq(bwts$LowBw, report.nas = FALSE)  # Do not report missing values
Frequencies  
bwts$LowBw  
Type: Factor  

              Freq        %   % Cum.
----------- ------ -------- --------
         no    130    68.78    68.78
        yes     59    31.22   100.00
      Total    189   100.00   100.00

Details can be found in the documentation, via e.g. ?descr or the tool tips in the RStudio, and vignette("introduction", package="summarytools").

Both descr and freq can also be applied to whole data frames, as a kind of drop-in replacement for base summary, though descr will only report numerical variables, and freq only discrete variables43. We can combine these options to e.g. produce a nice compact table of means and standard deviations for all numeric variables:

> descr(bwts, stats = c("mean", "sd"))
Descriptive Statistics  
bwts  
N: 189  

                  Age   BirthWeight   LastWeight   PhysVisits
------------- ------- ------------- ------------ ------------
         Mean   23.24       2944.59       129.81         0.79
      Std.Dev    5.30        729.21        30.58         1.06

Note that both descr and freq do not just print to the console, but return the statistics as an R object that can be stored and further processed; we can e.g. take the calculated statistics and strip them from all decorative elemenst by turning them into a data frame for further plotting or display:

> mnsd <- descr(bwts, stats = c("mean", "sd"))
> as.data.frame(mnsd)
              Age BirthWeight LastWeight PhysVisits
Mean    23.238095   2944.5873  129.81481  0.7936508
Std.Dev  5.298678    729.2143   30.57938  1.0592861

styby, groupwise processing

panderstyles

8.2 Confidence intervals

Base R is not great with confidence intervals - some test functions like t.test and fisher.test sometimes include confidence intervals, but these are very limited and inflexible; and while base R includes the function confint, this only works for regression models, and I would rather not re-formulate simple confidence intervals for means or proportions as regression problems.

Package DescTools offers a family of functions for calculating simple and not-so-simple confidence intervals directly from the observed data44. Starting with the birth weights again, we can use MeanCI to calculate a 95% confidence interval for the mean birth weight:

> library(DescTools)
> MeanCI(bwts$BirthWeight)
    mean   lwr.ci   upr.ci 
2944.587 2839.952 3049.222 

For this example, the confidence interval is fairly tight, just about \(\pm\) 100g around the mean

BinomCI calculates proportions plus confidence intervals based on the frequency of events. Here we have to specify the number of events (successes) as first argument, and the number of attempts (records) as the second argument; e.g. we know from our initial exploration of the birth weight data set that we have 59 low birth weights out of 189 live births, so we can calculate the proportion as

> BinomCI(x = 59, n = 189)
           est    lwr.ci    upr.ci
[1,] 0.3121693 0.2504031 0.3814188

So we get a proportion of ca. 31% with 95% confidence interval [25%, 38%]. We can also combine the call to BinomCI with a call to table and get the results in one go:

> BinomCI(table(bwts$LowBw), n = nrow(bwts))
          est    lwr.ci    upr.ci
no  0.6878307 0.6185812 0.7495969
yes 0.3121693 0.2504031 0.3814188

Here we use the function nrow to extract the number of rows (records) in data frame bwts. As the table function returns counts for both normal and low birth weights in the data, BinomCI returns proportions and confidence intervals for both (though as they are complementary, we would generally only report one of them).

Both MeanCI and BinomCI allow different confidence levels: if we are willing to accept a bit more uncertainty, e.g. at a 90% confidence level, we get slightly narrower confidence intervals, e.g. here for the mean:

> MeanCI(bwts$BirthWeight, conf.level = 0.9)  # Default: 0.95 (as is tradition)
    mean   lwr.ci   upr.ci 
2944.587 2856.908 3032.267 

Both functions also support a method-argument, that allows you to specify different ways for calculating the confidence interval. For MeanCI, these methods are

  • classic, which calculates a conventional confidence interval based on the t-distribution45

  • boot, which calculates bootstrap confidence intervals.

Bootstrapping46 is a re-sampling based method for calculating standard errors and confidence intervals which does not rely on exact or approximate assumptions47 about the underlying distribution - this can be useful in situations where the usual assumptions may be suspect (e.g. small sample sizes, asymmetric data distributions for a t-test). For our example, we find that the classic and the bootstrap confidence intervals are very close, suggesting that our assumptions for the classic case are probably justified.

> MeanCI(bwts$BirthWeight, method = "boot")  # Boostrap CI
    mean   lwr.ci   upr.ci 
2944.587 2835.090 3050.450 

BinomCI offers 16 different ways of calculating a confidence interval for a proportion (though no bootstrap-option). In practice, the different versions will generally differ very little48, and the deault method (wilson) will do nicely. E.g. for this is the exact confidence interval for our example:

> BinomCI(tab, n = nrow(bwts), method = "clopper")  # exact CI
          est    lwr.ci    upr.ci
no  0.6878307 0.6165454 0.7531114
yes 0.3121693 0.2468886 0.3834546

This is extremely close to the approximate confidence interval above.

We can use the apropos function to list all functions whose name ends in CI using the following expression:

> apropos("CI$", ignore.case = FALSE)
 [1] "BinomCI"      "BinomDiffCI"  "BinomRatioCI" "BootCI"       "CoefVarCI"    "CorCI"       
 [7] "MADCI"        "MeanCI"       "MeanDiffCI"   "MedianCI"     "MultinomCI"   "PlotDotCI"   
[13] "PoissonCI"    "QuantileCI"   "VarCI"       

We see that DescTools offers a wide range of ready-made functions for confidence intervals for common descriptives, including the median, quantiles, correlation etc. This set of tools can be extended to (almost) any descriptive statistic of the data using the function BootCI, which allows you to calculate a bootstrap confidence interval for any function of one or two variables in your data set, including functions that you define yourself (Section 6.3).

8.3 Statistical tests

UNDER CONSTRUCTION

8.4 Epidemiological risk measures

Package epitools contains some specialized tools which are useful in epidemiological data analysis, but rarely well supported in general statistical software, including base R, e.g. for handling dates or for calculating age-adjusted rates. The main functionality however is for calculating three classic epidemiological risk measures, namely risk ratios, odds ratios and event rates, from simple tables of outcome vs exposure levels. In applications, these estimates will generally only be a starting point, as they are not adjusted for potential confounding - that will require at least some kind of regression model which can accommodate both an exposure variable and as many confounding variables as necessary, something is actually well supported in R. However, these crude risk estimates are a natural part of the initial descriptive phase of an analysis, and epitools offers a convenient interface for them.

The main function for this is epitab. Assuming that we have already generated a table of the exposure-outcome association beforehand, with the exposure levels as rows and the (two) outcome level as columns, we can just feed this table to epitab:

> library(epitools)
> tab_smo  ## Low birth weight vs smoking from above
       LowBwt
Smoking no yes
    no  86  29
    yes 44  30
> epitab(tab_smo)
$tab
       LowBwt
Smoking no        p0 yes        p1 oddsratio   lower    upper   p.value
    no  86 0.6615385  29 0.4915254  1.000000      NA       NA        NA
    yes 44 0.3384615  30 0.5084746  2.021944 1.08066 3.783112 0.0361765

$measure
[1] "wald"

$conf.level
[1] 0.95

$pvalue
[1] "fisher.exact"

This output is somewhat different from what we have seen so far, in that it is a list containing both the results of interest (as element $tab) and some parameter settings for the calculation of the results (see 5.7 for more about lists). We can easily extract the main result using the $-notation, in the same way as for data frames:

> epitab(tab_smo)$tab
       LowBwt
Smoking no        p0 yes        p1 oddsratio   lower    upper   p.value
    no  86 0.6615385  29 0.4915254  1.000000      NA       NA        NA
    yes 44 0.3384615  30 0.5084746  2.021944 1.08066 3.783112 0.0361765

This output shows first the frequencies of the exposure-outcome combinations, and the proportional split of the exposure levels within each outcome level (e.g. here, we have 66% non-smoking mothers vs 34% smoking mothers for the births where the infant did not have low birth weight). This is followed by the actual odds ratio (the default risk measure) with a confidence interval and an associated p-value for the mull hypothesis that the true odds ratio is actually one: we see an odds ratio of ca. 2, with a fairly wide confidence interval [1.1, 3.8] and a marginally statistically significant p-value of 0.036. This is clearly a much more appealing summary than the simple table plus \(\chi^2\)-test we have seen in Section 8.1.1 above.

If we want to look at the risk ratio instead of the odds ratio (feasible, as this is a cross-sectional cohort design, not a case-control design), we just specify the corresponding method-argument:

> epitab(tab_smo, method = "riskratio")$tab  # RR
       LowBwt
Smoking no        p0 yes        p1 riskratio    lower    upper   p.value
    no  86 0.7478261  29 0.2521739  1.000000       NA       NA        NA
    yes 44 0.5945946  30 0.4054054  1.607642 1.057812 2.443262 0.0361765

We find a risk ratio of ca. 1.6, with a confidence interval from 1.06 to 2.4, and the same p-value as before (because the same test is used). This is a bit smaller than the odds ratio, but leads to the same conclusion, namely that the risk of a low-weight birth is increased in mothers smoking during pregnancy.

A short demonstration that epitab also works with more than two exposure levels:

> tab_r <- table(Race = bwts$Race, LowBw = bwts$LowBw)
> epitab(tab_r)$tab
       LowBw
Race    no        p0 yes        p1 oddsratio     lower    upper    p.value
  white 73 0.5615385  23 0.3898305  1.000000        NA       NA         NA
  black 15 0.1153846  11 0.1864407  2.327536 0.9385073 5.772385 0.08433263
  other 42 0.3230769  25 0.4237288  1.889234 0.9554577 3.735597 0.08111446

8.5 Improved display of descriptives

The descriptive statistics we have displayed so far in this chapter have been shown as they would appear in the R console, as pure text. This is perfectly reasonable for interactive work, but it is not really suitable for a report or a manuscript: when we e.g. compile a script from within RStudio, as described in Section 4.4, these results will be shown as unattractive blobs of unformatted text in the resulting .pdf or .docx files, and will require significant manual clean-up and editing to be presentable. We will talk extensively about how to write scripts that generate publication-quality tabular output in Chapters 13 and 14, but here I just want to introduce a quick solution that improves display quality immensely with almost no effort, namely the package pander.

The main function in this package is also called pander, with simple default usage: just wrap any output you want show in a compiled script into a call to pander. At the console, this just shows a slightly different text-based output:

> library(pander)
> pander(descr(bwts, stats = c("mean", "sd")))

-------------------------------------------------------------
   &nbsp;       Age    BirthWeight   LastWeight   PhysVisits 
------------- ------- ------------- ------------ ------------
  **Mean**     23.24      2945         129.8        0.7937   

 **Std.Dev**   5.299      729.2        30.58        1.059    
-------------------------------------------------------------

However, in a compiled script or manuscript, like this document, this will be translated into an actual table object:

> pander(descr(bwts, stats = c("mean", "sd")))
  Age BirthWeight LastWeight PhysVisits
Mean 23.24 2945 129.8 0.7937
Std.Dev 5.299 729.2 30.58 1.059

This does look ok-ish in its own right, and if you want to modify the appearance, you can either do this on the R side, by adding arguments to pander (as described in pandoc.table), or by editing the resulting .html or .docx file (which is substantially less work than working on the raw unformatted text).

The advantage of pander is that it will work with a very wide range of R objects, including all the descriptive statistics we have seen in this chapter:

> pander(summary(bwts))
Table continues below
LowBw Age LastWeight Race Smoking PrevPremature
no :130 Min. :14.00 Min. : 80.0 white:96 no :115 no :159
yes: 59 1st Qu.:19.00 1st Qu.:110.0 black:26 yes: 74 yes : 24
NA Median :23.00 Median :121.0 other:67 NA NA’s: 6
NA Mean :23.24 Mean :129.8 NA NA NA
NA 3rd Qu.:26.00 3rd Qu.:140.0 NA NA NA
NA Max. :45.00 Max. :250.0 NA NA NA
Hypertension UterineIrritab PhysVisits BirthWeight
no :177 no :161 Min. :0.0000 Min. : 709
yes: 12 yes: 28 1st Qu.:0.0000 1st Qu.:2414
NA NA Median :0.0000 Median :2977
NA NA Mean :0.7937 Mean :2945
NA NA 3rd Qu.:1.0000 3rd Qu.:3487
NA NA Max. :6.0000 Max. :4990
> pander(MeanCI(bwts$BirthWeight))
mean lwr.ci upr.ci
2945 2840 3049
> pander(freq(bwts$LowBw))
  Freq % Valid % Valid Cum. % Total % Total Cum.
no 130 68.78 68.78 68.78 68.78
yes 59 31.22 100 31.22 100
0 NA NA 0 100
Total 189 100 100 100 100
> pander(chisq.test(tab_smo))
Pearson’s Chi-squared test with Yates’ continuity correction: tab_smo
Test statistic df P value
4.236 1 0.03958 *
> pander(epitab(tab_smo))
  • tab:

      no p0 yes p1 oddsratio lower upper p.value
    no 86 0.6615 29 0.4915 1 NA NA NA
    yes 44 0.3385 30 0.5085 2.022 1.081 3.783 0.03618
  • measure: wald

  • conf.level: 0.95

  • pvalue: fisher.exact

8.6 Next steps

For more on lists and other complex R objects, as seen in the epitab-output, see Chapter 5. For more in-depth modelling of exposure-outcome associations that can also account for confounding variables, see Chapters 9 and 10. Graphical descriptives as complement to the numerical descriptives here are discussed in Chapters 11 and 12. More on creating attractive tables in scripts in Chapter 13, and more on integrating R results into reports and manuscripts in Chapter 14.

CRAN has a task view (a curated collection of R packages) on the subject of epidemiomology at https://cran.r-project.org/web/views/Epidemiology.html. Note however that most of the packages are very specifically intended for infectious disease epidemiology, which may or may not be your cup of tea. Still, the task view includes epitool and Epi(https://cran.r-project.org/web/packages/Epi/) as core packages which offer useful tools for general epidemiology.

graphPAF(https://cran.r-project.org/web/packages/graphPAF/) implements a general approach to estimating population attributable fractions.


  1. This is the same data as birthwt in package MASS, so help("birthwt", package = "MASS") will show extra information and references. Note however that our data set is much nicer formatted (as an exercise, consider the R commands you would use to transform the data set in MASS to the one we are using).↩︎

  2. Ok, so technically, you can kind of read the interquartile range as the difference between the 75%- and the 25% quantile for continuous variables, as a robust counterpart to the standard deviation, much like the median is the robust counterpart to the mean, but frankly, that’s not great.↩︎

  3. Ok, so technically, we do move out of the region of purely descriptive statistics and into the region of statistical inference when we start talking about standard errors and confidence intervals. IMO, the general usefulness of these tools, eepcially for simple means and proportions, far outweighs the risks and burdens of having to keep in mind some very generic sampling model for the data, but the base R approach is not completely crazy either.↩︎

  4. Nothing, really; so yes, the fisher.test reports an odds ratio for 2x2 tables, but seriously?!↩︎

  5. Unlike the variance, which would be \(g^2\).↩︎

  6. For a technical discussion and some great examples for what kurtosis actually captures, see Westfall, Kurtosis as Peakedness, 1905 – 2014. R.I.P., American Statistician, 2014, especially Figure 2 and Table 1.↩︎

  7. Or what freq thinks is discrete, which by default is every variable with no more than 25 unique different values - in our birth weight example, this would include e.g. the mother’s age at birth, which may not be intended; this can be modified via argument freq.ignore.threshold to function st_options.↩︎

  8. As well as many other graphical and numerical descriptives - if you are not content with summarytools, this is not a bad place to look for alternatives, see e.g. vignette("DescToolsCompanion").↩︎

  9. Respective the standard-normal distribution if you specify a fixed (“known”) standard deviation, correspodnign to what is taught as a z-test in many introductory statistics courses.↩︎

  10. Should you be interested, this blog post is a quite readable introduction of the idea, and this online book chapter offers a slightly more formal description of the idea.↩︎

  11. Like invoking the central limit theorem to argue that the two means we want to compare via a t-test have an approximately normal sampling distribution.↩︎

  12. Except when the number of events and / or records is so low that the different approximations do not work properly, in which case the exact Clopper-Pearson intervals are a good (if conservative) choice; see `BinomCI for references and a brief discussion of the methods if you are concerned.↩︎