part | item | cond | resp |
---|---|---|---|
part_20 | item_46 | c2 | 6.55 |
part_16 | item_34 | c4 | 6.68 |
part_16 | item_6 | c2 | 6.38 |
part_1 | item_45 | c3 | 6.58 |
part_18 | item_25 | c2 | 6.49 |
Bayesian estimation in multiple comparisons
Submitted
1 Introduction
In recent years, studies in second language acquisition have slowly reduced their use of traditional ANOVAs (and potential post-hoc tests) and have instead adopted a wider range of methods, including mixed-effects regression models (e.g., Plonsky & Oswald, 2017). This is certainly good news, as the field has long relied on a narrow range of statistical methods (e.g., Plonsky, 2014). This change is especially welcome when we consider how often categorical responses are transformed into percentages, for example, to be later analyzed using ANOVAs (Jaeger, 2008). While an ANOVA is technically just a special type of (linear) regression, their outputs and goals tend to be interpreted slightly differently: while an ANOVA is mostly focused on differences between means of levels of a given factor (say, L1 group), a typical linear regression is focused on estimating (mean) responses, hence the different outputs from both methods. This difference might explain why analyses that employ ANOVAs sometimes carry out multiple comparisons to verify where specific differences may originate in their data. This practice tends to be accompanied by some type of \(p\)-value adjustment to reduce the rate of type I error.
In this paper, I explore a scenario where an analysis is based on a mixed-effects model, but where multiple comparisons can be useful and informative given our research question. More specifically, I show how to generate multiple comparisons using Bayesian estimation from a mixed-effects model. Besides the various advantages of Bayesian data analysis in general (Section 3.1), the context of multiple comparisons is especially instructive and beneficial as no \(p\)-value correction is needed once we assume non-flat priors (e.g., centred around zero) and a hierarchical structure that leads to shrinkage, naturally yielding more reliable estimates (Section 2). Finally, this paper is accompanied by an OSF repository (https://osf.io/37u56/) that contains all the necessary files to reproduce the analysis below.
2 A typical scenario
The data used in this paper is hypothetical: its structure is adapted from the danish
data set presented in Balling & Baayen (2008), which can be found in the languageR
package (Baayen, 2009). Noise has been added to a subset of the data such that the observations examined in the present paper do not reflect the original data set. Our data set d
, shown in Table 1, simulates a common scenario in language acquisition studies, where we are interested in a condition with multiple levels of interest (\(n = 5\)). By design, we only have 22 participants and 49 items. This situation emulates a small sample size when we consider the number of items per participant—this is certainly not uncommon in L2 or L3 studies (Cabrelli-Amaro et al., 2012; Garcia, 2023).
Our data set only has four (self-explanatory) variables: part
, item
, cond
and resp
. Ultimately, we want to verify if participants’ (part
) responses (resp
) are affected by our five conditions (cond
) taking into account the expected variation found across items (item
). There are five hypothetical conditions in the data set: c1
to c5
. We will start our analysis by visually exploring the patterns in the data, shown in Fig. 1.
Our response variable is continuous and approximately Gaussian (otherwise we should avoid box plots), so we can rely on our familiar parameters (e.g., our means for each condition) for statistical inference. Typically, we would now run a linear regression where cond
is our predictor variable using the lm()
function. By default, R will choose c1
as our reference level, which in turn means that the intercept (\(\hat\beta_0\)) of our model will represent our first condition, against which all other conditions will be compared (slopes; \(\hat\beta\)). The model in question is run in Code block 1 and its results are shown below in Table 2.
resp | |||
Predictors | Estimates | CI | p |
(Intercept) | 6.81 | 6.78 – 6.83 | <0.001 |
cond [c2] | -0.05 | -0.09 – -0.01 | 0.008 |
cond [c3] | 0.05 | 0.01 – 0.09 | 0.007 |
cond [c4] | 0.02 | -0.02 – 0.05 | 0.361 |
cond [c5] | -0.06 | -0.10 – -0.02 | 0.001 |
Observations | 1040 | ||
R^{2} / R^{2} adjusted | 0.045 / 0.041 |
The reader can verify that the linear regression just run is essentially an ANOVA, indicated in a comment in Code block 1. Nevertheless, the outputs of aov()
and lm()
(both of which are printed using the summary()
function) differ substantially. In many situations, choosing a particular level as reference and comparing all other levels to the intercept makes perfect sense. Sometimes, however, comparing multiple levels is exactly what we want in examining language acquisition data (e.g., proficiency levels, learners who belong to different L1 groups, delayed post-tests, etc.). This situation used to be quite common in the 90s and early 2000s, and was addressed by running ANOVAS and subsequently using a post-hoc test to locate the source(s) of an effect. Such a test would also adjust p-values in order to reduce the probability of type I error, also known as the familywise error rate (Tukey, 1953). Back then, the focus tended to be on p-values (null hypothesis significance testing), not on the estimation of effect sizes and confidence intervals, often referred to as “new statistics” (Cumming, 2014). In our context, having five conditions would mean ten pairwise comparisons. Assuming \(\alpha = 0.05\), the probability of making at least one type I error in these comparisons would be approximately 40% (\(1 - 0.95^{10}\)), which is obviously problematic.
One common method to adjust \(p\)-values is the Bonferroni correction, whereby \(p\)-value thresholds are divided by the number of comparisons. In our case, instead of using \(\alpha = 0.05\), we would use \(\alpha = 0.05/10 = 0.005\) as our threshold. Needless to say, this would reduce the statistical power in our analysis, making it much harder to reject the null hypothesis. Another method of correction is Tukey’s honest significance test, which compares the observed difference between means to a critical value based on the variability within groups, allowing us to determine if the observed difference is statistically significant. Let us examine one example of multiple comparisons using the TukeyHSD()
function in R—see also packages such as multcomp
(Hothorn et al., 2008) and emmeans
(Lenth, 2024) for multiple comparisons in general parametric models. Results are shown in Table 3.
diff | lwr | upr | p adj |
---|---|---|---|
-0.050 | -0.102 | 0.001 | 0.061 |
0.053 | 0.000 | 0.107 | 0.053 |
0.017 | -0.035 | 0.069 | 0.892 |
-0.061 | -0.113 | -0.009 | 0.012 |
0.104 | 0.050 | 0.157 | 0.000 |
0.068 | 0.016 | 0.119 | 0.003 |
-0.011 | -0.062 | 0.041 | 0.980 |
-0.036 | -0.089 | 0.017 | 0.351 |
-0.115 | -0.168 | -0.061 | 0.000 |
-0.079 | -0.130 | -0.027 | 0.000 |
Based on the multiple comparisons and the adjusted p-values, we can see that certain differences yield an adjusted p-value that is greater than 0.05. Two examples are c2-c1
and c3-c1
. It is interesting (but not surprising) to note that, had we decided a priori to only look at c2
and c1
conditions, we would find that a significant difference exists between them, which would lead us to reject the null hypothesis. Here, due to the multiple comparisons, we fail to reject the null. As we can see, our intentions may affect our conclusions (Kruschke & Liddell, 2018) exactly because adjustments are expected if we compare multiple levels in our factor—this specific situation is discussed in Lima Jr & Garcia (2021) for the danish
data set (Balling & Baayen, 2008).
Ultimately, if we do want multiple comparisons, using the aov()
and TukeyHSD()
functions is quite practical. There are, however, important problems with this approach. First and foremost, ANOVAs are overall much more limited than full-fledged regression models, especially when we consider the importance of mixed-effects (Section 4.1)—see Winter (2019) and Garcia (2021) for a comprehensive discussion. Furthermore, the use of ANOVAs becomes problematic if we examine categorical response variables, which are often modelled as percentages (see Jaeger (2008) on why we should not do that). Second, because analyses based on ANOVAs tend to be more strongly centred around \(p\)-values (not for technical reasons), adjusting them may lead us to numerous false negatives (because type I errors are typically seen as more problematic than type II errors, this is not usually judged as a serious issue by many given the alternative).
By using mixed-effects models (also known as hierarchical or multilevel models), we can address these (and other) issues. First, these models take into account the (often) hierarchical structure of our data. For example, we know that multiple data points from a given participant tend to be more similar relative to data points from other participants. This violates the independence assumption, which in turn can (and often does) affect the rates of type I error (e.g., Winter, 2011). Overall, we expect to see considerable variation across items and participants, so ignoring random effects is almost never a good idea (Barr et al., 2013). Second, by using mixed-effects models, not only do we not violate the independence assumption, we also drastically reduce the issue of multiple comparisons, since point estimates are shifted towards each other (shrinkage)—see Gelman & Tuerlinckx (2000), Gelman et al. (2012) and Kruschke (2015, pp. 567–568). This is very different from the typical methods of correcting \(p\)-values, in which estimates are stationary and only confidence intervals (or \(p\)-values themselves) are modified (e.g., Bonferroni correction).
The goal of the present paper is to demonstrate how to generate multiple comparisons from a hierarchical model by using Bayesian estimation (e.g., McElreath, 2020). As will be discussed below, Bayesian data analysis offers a wide range of advantages, and the method suggested here provides the reader with a flexible and powerful tool for statistical inference. Crucially, a hierarchical Bayesian model practically eliminates the problem of multiple comparisons, which in turn offers a valuable benefit to researchers.
3 Methods
In this section, I provide a brief introduction to Bayesian data analysis and review the packages that will be used in Section 4. I assume some level of familiarity with the R language (R Core Team, 2024) and with the tidyverse
package (Wickham et al., 2019).
3.1 A brief review of Bayesian data analysis
This section will focus on the minimum amount of information needed on Bayesian data analysis for the remainder of this paper. With such an immense topic, anything more than the basics would be unrealistic given the scope of the present paper. A user-friendly introduction to the topic is provided in Kruschke & Liddell (2018), Garcia (2021, ch. 10) and Garcia (2023). Comprehensive introductions to Bayesian data analysis are provided in McElreath (2020) and Kruschke (2015).
A typical statistical test or model generates the probability of observing data that is at least as extreme as the data we have given a particular statistic (e.g., \(\hat\beta\)) under the null hypothesis. This is exactly the definition of p-values. Simply put, we are after \(P(data|\pi)\), where \(\pi\) is any given statistic of interest. In this framework, we think of “probability” as the frequency of an event in a large sample—hence the name Frequentist and the importance of larger samples.
In a Bayesian framework, we are given the probability of a parameter given the data, i.e., the opposite of what we get in a Frequentist framework. In other words, we are given \(P(\pi|data)\), which allows us to estimate the probability of a given hypothesis. The essence of Bayesian data analysis comes from Bayes’ theorem (Bayes, 1763; also see McGrayne, 2011), shown below. The theorem states that the posterior, \(P(\pi|D)\), is equal to the likelihood of the data, \(P(D|\pi)\), times the prior \(P(\pi)\), divided by the marginal likelihood of the data, \(P(D)\).
\[P(\pi|D) = \frac{P(D|\pi) \times P(\pi)}{P(D)}\]
One crucial component in Bayes’ theorem is the prior, which allows us to incorporate our expectations into the model. These expectations are informed by the literature: if we know about a particular phenomenon, it is just natural that this knowledge should be included in a model. Not only is this aligned with the accumulation of knowledge assumed in science (e.g., Brand et al., 2019), but it also bridges a common gap between our theoretical assumptions and our data analysis. This is especially important in second language studies, given that we never assume that learners’ grammars start from scratch. Even though the notion of priors is intuitive, some may argue that priors can bias our model. This can be true: if we force our priors to be where we want them to be with a high degree of certainty, the posterior distribution will be strongly biased towards our priors. This, however, is not how priors are supposed to work—see Gelman (2008) on common objections to Bayesian data analysis and Gelman & Hennig (2017) on the notions of objectivity and subjectivity in statistics. Priors must be informed by the literature and by previous experiments. Furthermore, our certainty surrounding prior distributions must reflect the state of knowledge surrounding the object being examined. If we have over one hundred studies showing that an effect size is between 1.2 and 1.4, we could certainly assume a prior distribution centred around 1.3. Considering a Gaussian prior distribution, we could even set its standard deviation to a low number to incorporate the high degree of precision in the literature in this scenario.
Bayesian models estimate effects by sampling from the posterior distribution by using a wide range of algorithms, not by solving the theorem above analytically (which is computationally impossible even for relatively simple models). Think of this sampling as a “random walk” in the parameter space—this is indeed the typical metaphor employed for Markov-Chain Monte Carlo methods, which are commonly associated with posterior sampling. Suppose we are interested in the difference in means between two groups. Our model will consider a multitude of possible values for this difference. Each value will yield a probability: “take value 0.7 and calculate the probability of observing this value given the data being modelled”. After enough values are considered (i.e., iterations), the sampling algorithm will have visited values that are more likely (given the data) more frequently. This clustering gives us our posterior distribution for the parameter in question, i.e., the difference between the two groups. In reality, we will use a language called Stan (Carpenter et al., 2017), which employs a technique called Hamiltonian Monte Carlo. A “random walk” is not the best way to describe HMC, but the idea is still valid for our purposes here. Fig. 2 illustrates the end of a random walk for two parameters in a regression model, \(\hat\beta_0\) and \(\hat\beta\), which results in a joint posterior distribution where the most visited values for both parameters in question are at the peak of the distribution.
You may be wondering how we can know that the model has correctly converged onto the right parameter space, i.e., the right parameter values given the data. Our algorithms use multiple chains to explore said space. If all chains (typically four) converge by the end of the sampling, we know the model has converged. We can check this convergence in different ways. For example, we could visually inspect the chains to see if they have clustered in a particular region.
In this paper, we will use an intuitive R package to run our models, so you do not have to worry about learning the specifics of Stan to be able to run and interpret Bayesian models. Interestingly, on the surface, our models will look quite similar to their Frequentist counterparts. Deep down, however, they work in very different ways, and their interpretation is (fortunately) much more intuitive than what we are used to in Frequentist models. First, p-values no longer exist. Second, effects are not given as point-estimates. Rather, they are entire probability distributions of credible effects given the data, which substantially affects how we interpret and visualize our models’ results, as we will see later.
3.2 Data and packages
In Code block 3, you will find the data file as well as the necessary packages to reproduce the analysis in the paper. You will need to install all the packages below before proceeding. The package brms
(Bürkner, 2016) allows us to run Bayesian models by using the familiar syntax from functions such as lm()
and glm()
. I will assume that you have already used lme4
(Bates & Maechler, 2009) to run hierarchical models. You do not necessarily need this package for the remainder of the paper, but the package is here in case you wish to run equivalent models using lmer()
. Finally, tidybayes
(Kay, 2020) and bayestestR
(Makowski et al., 2019) will help us work with posterior samples.
4 Analysis
4.1 Statistical models
In this subsection, we will run two models, one non-hierarchical and one hierarchical. Both models will have a single predictor (cond
). I will assume you have never run a Bayesian model or used brms
. While specific details of our models are outside the scope of this paper, I will attempt to provide all the necessary information to reproduce and apply these methods to your own data. In the next subsection, we will work on our multiple comparisons based on the models we run and discuss below.
4.1.1 A non-hierarchical model
We will begin with a simple non-hierarchical linear regression identical to the regression model in Code block 1. The code for the equivalent Bayesian model is given in Code block 4. First, we define our model as resp ~ cond
using the same familiar syntax from the lm()
function. We then specify the type of model we want. Because we are running a linear regression, we assume that the response variable is normally distributed, hence family = "Gaussian"
in the code. We could also use another family value if we wanted to better accommodate outliers. For example, we could specify family = student()
, which would assume a student \(t\) distribution for our response variable resp
. This can be useful in situations where we have smaller sample sizes (higher uncertainty) and/or potential outliers.
By default, our model will use four chains to sample from the posterior. To speed up the process, we can use four processing cores. Finally, we set our priors (see below) and save the model (optionally) to have access to the compiled Stan code later. While this last step is not necessary, if you want to better understand the specifics of our model, it is useful to examine the raw Stan code later on.
The model in question will not run until we create the variable priors
referred to in the code. First, it should be clarified that we do not need to specify priors ourselves: brm()
will use its own set of default priors if we do not provide ours. Here, however, we will assume some mildly informative priors. The default priors used by brm()
will often be flat, which means the model starts by assuming that values are equally likely given the data. This, however, is neither realistic nor ideal: given the data at hand, we know that no responses are lower than 4 or higher than 8 given Fig. 1. Clearly (and unsurprisingly), not all values for our condition levels are equally likely. Imagine that we were estimating the height of a given person. A height of seven meters is clearly impossible, and our model should not start by assuming that such an extraordinary height is just as likely as, say, 1.75 meter.
In Code block 5, we first use the function get_prior()
to inspect the priors assumed by the model (default priors). We then set our own priors with the function set_prior()
. Here, we are assuming that all of slopes (c2
to c5
) follow a Gaussian (normal) distribution centred around zero with a standard deviation of 1: \(\mathcal{N}(\mu = 0, \sigma = 1)\). Thus, the priors in question contain no directional bias, and they also introduce a relatively high degree of uncertainty, given the standard deviation we assume for each distribution. For comparison, in our actual data, the standard deviation of our response variable resp
ranges from 0.16 to 0.21 for all five conditions cond
.^{1} These priors tell the model that absurd values are simply not plausible. Finally, our intercept (\(\hat\beta_0\)) is assumed to have a mean of 7 (which approximates the observed mean of c1
already visualized)—suppose that previous studies report similar means for conditions analogous to c1
.
In summary, we are telling our model that there should be no differences between the conditions, and we are constraining the range of potential differences by saying that values near zero are more likely than values far away from zero. This starting point is very distinct from assuming flat priors. We are merely telling the model that certain values are much more likely than others, and we are being conservative by assuming priors centred at zero with relatively wide distributions.
Since we have now defined our priors, we can finally run our model in Code block 4. Running a Bayesian model takes much longer than running an equivalent Frequentist model. For that reason, it is a good idea to save the model for later use: save(bFit0, file = "bFit0.RData")
will save the model in RData
format. You can later read it by using load("bFit0.RData")
. Importantly, we can save however many objects in a single RData
file: both Bayesian models run in the present paper are found in the bFits.RData
file found in the aforementioned OSF repository. The raw output of our model is displayed in Code block 6.
The most important part in our output can be examined under “Population-Level Effects”. The reader will notice that the estimates in our Bayesian model are not too different from the estimates in our Frequentist model, shown in Table 2. This makes sense, since our priors are only mildly informative. Ultimately, the posterior distribution in our model is (mostly) driven by the patterns in the data. In a more realistic application, we would attempt to set more precise priors given what we know from the literature on the topic of interest.
In the column Estimate
, we can see the mean estimate for each condition (recall that each estimate is in fact an entire posterior distribution). We also see the error for each estimate (i.e., the standard deviation of the posterior distribution), and the 95% highest density interval (HDI). Next, we see Rhat
(\(\hat{R}\) should be \(1.00\)),^{2} which tells us that the model has converged. Finally, we see two columns showing us the effective sample size (ESS) of our posterior samples. By default, each of the four chains draws 2,000 samples from the posterior. However, we use a warmup
of 1,000 samples. As a result, only 1,000 samples from each chain are actually used here, totalling 4,000 samples—you can see this information at the top of Code block 6. Of these 4,000, however, some “steps” in our random walk will be correlated with one another, which reduces the amount of information they provide about the parameter space. Therefore, we should prioritize uncorrelated steps, which results in a proper subset of the 4,000 steps in question. That’s why all the numbers in both ESS columns are less than 4,000. There is no universal rule of thumb as to how large the ESS should be (higher is always better, of course). If the ESS is much lower than the 4,000 steps in question, we could increase the number of steps by adding iter = 4000
, for example, which would give us 3,000 post-warmup
steps for each chain. We can also change the number of warmup
steps in the model, allowing it to “settle” in more informative regions of parameter values before extracting the samples we will actually use.
Although the output shown in Code block 6 is quite informative, visualizing posterior distributions is often the best way to inspect a Bayesian model. Let us examine Fig. 3, which plots posterior distributions for all the slopes in the model (c2
-c5
). Recall that every effect in question must be interpreted relative to the intercept (c1
), just like any other regression output (Frequentist or Bayesian). The distributions in question follow the same order seen in the output of the model. The interpretation of our results is straightforward: values closer to the peak of the distribution are more likely given the data. Under each distribution, the figure displays the 95% highest density intervals (HDIs), which are also listed on the right-hand side of the figure along with the mean of each posterior distribution. As a decision metric, we could say that HDIs containing zero do not allow us to categorically conclude that a credible effect is present—note that this is as arbitrary a threshold as Frequentist 95% intervals. We could, for example, set any credible interval (e.g., 92%) and said interval would be equally justified. Ultimately, we should not merely dismiss an effect because zero is included in the HDI, as the position of zero within each HDI matters: unlike confidence intervals, credible intervals are probability distributions. Finally, a shaded area is present in the figure around zero. The area in question is known as the region of practical equivalence (Kruschke, 2015), often represented by the acronym ROPE. If an entire posterior distribution is found within the ROPE, we could accept a null effect. To calculate the ROPE for a model, we can use the rope()
function from the bayestestR
package (Makowski et al., 2019). We could, and later we will, add to our figure the percentage of each HDI that is found within the ROPE, since that information is conveniently provided by rope()
.
To reproduce the figure above, the reader can visit the repository for this paper. The easiest way to generate a similar figure is to use the mcmc_areas()
function from the bayesplot
extension (Gabry & Mahr, 2024), which is loaded once we load brms
. The code is shown in Code block 7. You will notice the use of the argument regex_pars
, which allows the user to select which parameters should be plotted. Because we want to focus only on coefficients for our slopes, we can use a textual pattern to select only those posterior distributions. While the figure in question is simpler than Fig. 3, it does provide the same essential information. Furthermore, if the reader is familiarized with ggplot2
, it’s easy to customize the figure by adding more layers to it. Last but not least, Code block 7 also shows how to generate trace plots for each of the slopes. While this is a figure that we would not necessarily add to our papers, it does help us visually inspect the convergence of our chains. If all four chains are clustered together by the end of the our sampling (they are), our model has converged.
4.1.2 A hierarchical model
Now that we have a basic understanding of brms
models, we can run a hierarchical version of the model above. If you have already used lme4
, you will immediately recognize the syntax in the model below, which now adds by-participant random intercepts and slopes (cond
) as well as by-item random intercepts. Here, we use the same priors used above, since our population-level parameters are the same. Unsurprisingly, this model takes longer to run.
This time, besides population-level effects, we also get group-level effects. The reader should run bFit1
to explore the entire output of the model, as Table 4 only shows the population-level effects, our focus here. The means of our posterior distributions are again quite similar to those in our non-hierarchical model in Code block 6. This is not unusual, especially in situations where the random effects have relatively small variances and/or when the data is balanced, with consistent effects overall. However, if you pay close attention to the 95% HDIs in the model, you will notice that our intervals are wider now—this will be clear once we visually inspect both models in Fig. 5. Another important difference is the drastic reduction in both ESS columns. This is not surprising: with its hierarchical structure, our model is more complex. We now have many more parameters to estimate (population- and group-level effects), which means we have a higher probability of correlated samples in the joint posterior distributions. Nevertheless, our model has converged (\(\hat{R} = 1.00\)) and no warnings or errors are generated—the reader may try to run the equivalent model using lmer()
, which will yield a singular fit, suggesting that the model is probably too complex given the data at hand.
Estimate | Est.Error | l-95% CI | u-95% CI | Rhat | Bulk_ESS | Tail_ESS | |
---|---|---|---|---|---|---|---|
Intercept | 6.81 | 0.03 | 6.75 | 6.88 | 1.00 | 882 | 1718 |
condc2 | -0.06 | 0.04 | -0.13 | 0.02 | 1.00 | 1585 | 2196 |
condc3 | 0.05 | 0.04 | -0.03 | 0.12 | 1.00 | 1672 | 2368 |
condc4 | 0.01 | 0.04 | -0.06 | 0.08 | 1.00 | 1550 | 2355 |
condc5 | -0.07 | 0.04 | -0.14 | 0.01 | 1.00 | 1475 | 1394 |
We can visually inspect our posteriors in Fig. 4. The reader should note that all of our posterior distributions are wider relative to those in our non-hierarchical model shown in Fig. 3 (both figures use the same \(x\)-axis to make it easier to visually compare them). This should not be surprising: hierarchical models shift estimates (and their respective intervals) closer to each other (see, e.g., Gelman et al., 2012). Indeed, this is an important reason that hierarchical models are already better prepared to handle multiple comparisons than their non-hierarchical counterparts. In Fig. 4, all of our posterior distributions include zero in their 95% HDI, compared to only one distribution in Fig. 3. Note, however, the position of zero in each distribution. For example, as we compare the distributions of c5
and c4
, we can conclude that although zero is indeed a credible parameter value for both slopes, it is much more likely for c4
than for c5
. This, in turn, means that a difference between c1
and c5
is much more statistically credible than a difference between c1
and c4
(again: we are always comparing our distributions with c1
, our intercept).
As a didactic exercise, we could next compare both models using different criteria. For example, we could run waic(bFit0, bFit1)
to compare the models using the Watanabe-Akaike Information Criterion (Vehtari et al., 2017; Watanabe, 2013), also known as widely-applicable information criterion. In our situation, such a comparison is not very informative, since both models use the same fixed effects but differ greatly elsewhere (non-hierarchical vs. hierarchical models). In other words, it’s not surprising that our hierarhical model bFit1
has a better fit given the data, as it includes by-participant and by-item random effects. Model comparison is not our goal here, but it is worth noting that much like AIC (Akaike, 1974) is often used as a criterion to decide which model has the best fit among a set of models, WAIC can be used in a similar fashion for Bayesian models—although the best fit is not necessarily the best model given our object of analysis and what we know about it (see discussion in McElreath, 2020).
4.2 Multiple comparisons
In this section, we will use our hierarchical model bFit1
to extract all the multiple comparisons we wish. There are at least two ways of accomplishing this task. The easy (automatic) way involves a function called hypothesis()
from brms
. The less easy way involves extracting the posterior samples from the model and then subtracting them as we wish. For example, if we wish to compare c3
and c4
, we could simply subtract their posterior distributions and generate a figure with the resulting distribution—this is essentially what hypothesis()
does too, but the advantage of doing things manually is that we understand exactly what’s happening and we have complete control over what is shown in our figures.
In Code block 9, we first create the contrasts we wish to examine. These contrasts take the form of a hypothesis: here, we simulate three null hypotheses by stating that the posterior distributions of c3
and c2
are equal, as are the posteriors of c5
and c4
, and of c4
and c3
. We could also include directional hypotheses such as condc4 > condc2
^{3}. We then feed our variable contrasts
to the function hypothesis()
to generate the output shown in Table 5, with resulting estimates and credible intervals (HDIs).^{4} Because we have assigned the result to our variable h
, we can also plot the resulting posterior distributions simply by running plot(h)
.
hypothesis()
function from brms
. A star means that a given HDI does not include zero.
Hypothesis | Estimate | Est.Error | CI.Lower | CI.Upper | Star |
---|---|---|---|---|---|
c3-c2 = 0 |
0.11 | 0.04 | 0.03 | 0.18 | * |
c5-c4 = 0 |
-0.08 | 0.04 | -0.15 | 0.00 | * |
c4-c3 = 0 |
-0.04 | 0.04 | -0.11 | 0.04 |
In our comparisons table, we can see that two of the three hypotheses can be rejected (assuming \(\alpha = 0.05\)), namely, c3-c2
(\(\hat\beta = 0.11\)) and c5-c4
(\(\hat\beta = -0.08\)): in both cases, the HDI does not include zero. To be clear, the estimates in our table represent the means of the posterior distributions that result from each comparison of distributions. These are therefore single-point estimates that summarize entire posterior distributions (as is always the case in Bayesian models). For example, we saw in Table 4 that the mean estimates for c2
is \(\hat\beta = -0.06\) and for c3
is \(\hat\beta = 0.05\). Thus, it is not surprising that the posterior distribution of c2-c3
has a mean estimate of \(\hat\beta = -0.10\).
You may be wondering how c1
could be included in the multiple comparisons. It is true that all of the comparisons involving c1
are by definition in the result of bFit1
, since c1
is our intercept. However, if we want to list all comparisons, we will have to include those comparisons as well. There are two important details in this scenario: first, c1
does not exist in the model as it is called Intercept
. Thus, we cannot simply add condc1
to our contrasts
variable as the function hypothesis()
will not be able to find such a parameter in the model. Rather, you’d simply type Intercept
. Second, in a standard model slopes encompass only their differences relative to the intercept. As a result, if we want to test the hypothesis that c1
and c3
are identical, for example, we cannot simply write Intercept = c3
. Instead, you’d type Intercept = c3 + Intercept
. Indeed, c3 + Intercept
gives you the actual posterior draws for the absolute values of c3
. Note that this has nothing to do with the Bayesian framework that we are employing. This is just how a typical regression model works.
How could we report our results thus far assuming only the comparisons in Table 5? We can be more or less comprehensive on the amount of details we provide (see Kruschke, 2021). Here is a comprehensive example:
We have run a Bayesian hierarchical linear regression to estimate the effect of
cond
(our condition) on participants’ responses. Our model included random intercepts and effects (cond
) by participant as well as by-item random intercepts. The model specification included 2,000 iterations and 1,000 warm-up steps. Both \(\hat{R}\) and effective sample sizes were inspected, and trace plots were visually checked for convergence. By default, our model’s estimates assumec1
as our reference condition (intercept). Our posterior distributions (Fig. 4) suggest that conditionsc2
andc5
, whose posterior means are negative, are the most likely to have a statistically credible effect. Given that zero is included in their HDI, we cannot categorically conclude that an effect is credible assuming a 95% HDI as a threshold.^{5} Once we examine our multiple comparisons, however, we notice at least two comparisons^{6} whose 95% HDIs provide strong evidence for an effect:c3-c2
(\(\hat\beta = 0.11, [0.03, 0.18]\)) andc5-c4
(\(\hat\beta = -0.08, [-0.15, 0]\)). These effects would not be visible were we to consider only the slopes in Fig. 4.
In the remainder of this section, we will visualize all the comparisons for our variable cond
. We will, however, extract posterior samples manually to generate a custom figure that is informative, comprehensive, and that allows us to visually compare posterior distributions of both our models. This analysis does not imply that the same type of figure should necessarily be used. Instead, the goal here will be to better understand multiple comparisons in the process of generating a complete picture of what we can extract from our models. Using functions such as hypothesis()
, which automatically accomplish a task for us, can be extremely practical, but such functions can also conceal what is actually happening in the background, which can be a problem if we want to better understand the specifics.
We will start with our final product, i.e., a figure with all ten multiple comparisons for cond
. As mentioned above, we will use our hierarchical model as our main model below. That being said, our figure will also include our non-hierarchical model so that posterior distributions can be compared. Here are the elements shown in Fig. 5:
- Posterior distributions for all multiple comparisons based on both models run above. Wider distributions come from our hierarchical model, as do the HDIs under each distribution.
- Means and HDIs of our hierarchical model
bFit1
(right). - Percentage of each HDI from our hierarchical model contained within the ROPE (left).
- Random effects by participant, represented with grey circles around the means of the posterior distributions.
While in practice we would probably not visualize both models simultaneously in a paper (although this could certainly be useful), Fig. 5 allows us to see how our posteriors are affected by the hierarchical structure of model bFit1
(cf. bFit0
). While Fig. 5 is comprehensive in what it shows, it does require multiple steps to be generated. I recommend using the much simpler approach in Code block 9, which creates a similar result in very few steps. Nevertheless, as mentioned earlier, reproducing Fig. 5 can be useful to fully understand what exactly we are doing, which in turn can help us better understand the models we are using.
Below, I list the necessary steps involved in creating Fig. 5, some of which are required for both models. The reader can find a complete script to reproduce all the steps and the figure on the OSF repository mentioned at the beginning of the paper.
- Extract draws using
as_draws_df()
function - Select and rename variables (e.g., from
b_Intercept
to simplyc1
) - Create comparisons
- Long-transform the data
- Create statistical summaries for means and HDIs using
mean_qi()
function - Generate ROPE using
rope()
function - Extract random effects from
bFit1
- Create summary for mean (random) effects by participant
- Generate figure
4.2.1 Step-by-step guide to visualize comparisons
Each code block below represents one of the nine aforementioned steps. I will assume you have loaded all the necessary packages (Code block 3), and that you have already run the models, which are available as an RData
file in the project repository (bFits.RData
). The models are loaded in at the top of Code block 10. In this step, we extract the draws from the posterior for both models in question. We use the as_draws_df()
, which gives us a data.frame
object. We then turn both data frames into tibbles using the as_tibble()
function. The result from this step is a massive object (relative to our data set). For our hierarchical model, bFit1
, the table has 4,000 rows (our draws from the posterior) and 186 columns (our variables, which include the random effects in the model).
Next, we will select only the variables we want to visualize. If you inspect the tables we just created, you will notice different patterns of variables: some variables begin with b_
, some with sd_
, some with cor_
, etc. Our random variables, for example, begin with r_item
(by-item) or with r_part
(by-participant). You can see that yourself by running glimpse(post_raw_1)
, which will list all the columns in the tibble for our hierarchical model. In this step, we only want to extract our population-level effects, shown in Table 4. For that reason, we will only select columns that have the prefix b_
in them. We will also rename our variables to make our work easier. For example, instead of b_Intercept
and b_condc2
, we will simply call them c1
and c2
. This is what Code block 11 does (for both tibbles we created in step 1 above).
We are now ready to move on to step 3, where we create our multiple comparisons. Recall that we want to generate a figure containing both models, so here again we need to generate our comparisons for each model. The process is shown in Code block 12, where we create several columns. For example, the comparison between conditions 3 and 2 will now be a column called c3-c2
.^{7} The content of the column is simply c3
minus c2
. I have explicitly added the intercept (c1
) to the code to show that the actual predicted response values need to include the intercept for each slope (c2
to c5
), but you do not need that, of course, since ultimately we just need to take the difference between the conditions in each comparison. Let us call our new variables mc_0
and mc_1
, as they represent the multiple comparisons for both models.
In our previous step, each comparison is a column, which means our table has several columns, each of which contains the resulting posterior for our comparisons. This wide format is not ideal, so now we need to transform the data using pivot_longer()
to generate long tables (the notion of tidy data). This is what we do in Code block 13 below. The code also prints five random lines from the resulting table for model bFit1
(mc_1
).
We can see above that we now have a simple table with two columns, Comparison
and Estimate
. This already allows us to visualize the patterns very easily. Next, in step 5, we’ll need to create our summaries for each distribution. This step will calculate our 95% HDIs and their respective means, both of which are shown on the right in Fig. 5. Both summaries are shown in Code block 14 below.
Step 6 is the easiest step. This is where we calculate the region of practical equivalence (ROPE). As our ROPE is the same for both models, we only need to create one variable. This step is necessary if we want to add a shaded ROPE to our figure as well as the percentage of each HDI found within our ROPE, which are shown in parentheses on the left in Fig. 5.
Our next step is to extract the random effects of interest from our fit. If we want to add circles to the figure which represent by-participant effects for each comparison, this step is necessary. We will use the spread_draws()
function to accomplish our goal. Because our model (bFit0
) includes by-participant random intercepts as well as slopes, we will extract both population-level and group-level effects (ranef_1
). Later, we will apply pivot_longer()
again to generate a tidy table. Finally, we will add new columns for each of our five conditions. For each condition, we will add the population-level effects and the group-level effects. Only then can we calculate posterior distributions for multiple comparisons that take into account by-participant variation. These steps are detailed in Code block 16.
At last, in step 8, we calculate means for our random effects and actual percentages for our HDIs relative to the ROPE. Using means for our random effects is important because we want each circle in Fig. 5 to represent one participant, i.e., averaging across all values for any given participant-comparison combination. The percentage of each HDI that we find in the ROPE can be calculated using the ROPE
variable created in Code block 15 (step 6). In Code block 17, we accomplish both tasks.
We are now ready to create Fig. 5. The code using ggplot2
is shown in Code block 18. The reader will notice that several layers are necessary to achieve the desired result. The point here is not to reproduce the figure exactly—recall that Code block 9 shows how to generate a similar figure using only two functions. Rather, the goal is to understand the process involved in generating the figure in question. Being able to manually extract posterior samples and to work with them allows us to examine our data in much more detail. It also gives us much more flexibility when it comes to data visualization.
5 Discussion and conclusion
Not long ago, Plonsky (2013) pointed out that research in second language acquisition relies on a narrow range of statistical methods. Fortunately, it seems that the field is slowly moving towards more comprehensive methods, such as full-fledged regression analysis with random effects (Cabrelli & Pichan, 2021; Garcia, 2020). Consequently, as we abandoned ANOVAs and post-hoc tests, the use of multiple comparisons seems to have lost some of its popularity. In this paper, I hope to have shown how we can embrace better models and still make use of multiple comparisons when they make sense given our research questions. After all, sometimes multiple comparisons are exactly what we want in a second language study.
As discussed at the beginning of the paper, one key element involved in multiple comparisons is the notion of type I error, which is in turn connected to the assumption that the null hypothesis (\(H_0\)) is true. In our scenario, under \(H_0\), the effects of our conditions would be zero (\(\beta = 0\)). While the traditional type I error is not an issue in appropriate hierarchical Bayesian models, we should perhaps rethink the type I error paradigm in question. As pointed out in Gelman et al. (2012), our starting point under this pradigm (that the effect is exactly zero) is already suboptimal: we rarely believe that an effect is exactly zero, or that there’s absolutely no differences between groups. Instead, the authors argue that the direction of an effect is a more relevant point to consider: for example, if we assume that \(\beta > 0\) but, in reality, \(\beta < 0\)—this is what Gelman & Tuerlinckx (2000) call “type S errors” (S represents the sign of an effect). In addition to type S errors, the authors propose type M errors, which pertain to the magnitude of an effect: suppose an effect is very strong but we conclude it is close to zero in an experiment. Both type S and type M errors are proposed as replacements for the type I error paradigm, and both are, according to the proponents, “at least substantially ameliorated” in hierarchical Bayesian models (Gelman et al., 2012, p. 195).
Finally, this paper can be summarized in two recommendations for those wishing to include multiple comparisons in their analyses. First, when running regression models, we should favour hierarchical models (e.g., Barr et al., 2013). This decision alone already addresses several issues discussed at length in the literature of quantitative methods (Garcia, 2021; Gelman & Hill, 2006; Sonderegger, 2023; Winter, 2019), including the typical criticisms raised in the context of multiple comparisons and type I errors (Gelman et al., 2012; Gelman & Tuerlinckx, 2000). Second, researchers should explore multiple comparisons whenever such comparisons make sense given the research question under examination. With the appropriate model, this can be accomplished without the typical corrections employed in the literature. Importantly, researchers should not fear that a Bayesian approach may be less conservative than a typical Frequentist approach: as pointed out in Gelman (2016), “[…] with normal data and a normal prior centred at 0, the Bayesian interval is always more likely to include zero, compared to the classical interval; hence we can say that Bayesian inference is more conservative, in being less likely to result in claims with confidence”.
As the models discussed in this paper become more popular in the field of second language acquisition, researchers have a lot to gain in terms of flexibility in their statistical methods—the approach presented above can be easily adapted to any Bayesian model. Crucially, these models provide comprehensive results that are not only more reliable, but which can also directly inform future studies through the use of priors—just as they can be informed by past research in the field. Finally, by not needing to correct nor adjust p-values in multiple comparisons, we do not lose power in our analyses—yet another important benefit of the approach discussed in this paper.
Acknowledgments
The idea for this paper originated from a workshop I gave at the PhonLab group at Harvard during a research visit in March 2024. Thanks to Katie Franich for her comments and questions during the workshop. Thanks also to Ronaldo Lima Jr. for useful comments and suggestions. The visit was made possible by the Programme de mobilité professorale at Université Laval (grant number: DR140407
).
Copyright © 2024 Guilherme Duarte Garcia
References
‘brms’
: Bayesian regression models using Stan. R package version 1.5.0.
tidybayes
: Tidy data and geoms for Bayesian models. https://doi.org/10.5281/zenodo.1308151
Footnotes
While this is good news given the notion of homoscedasticity, Bayesian models make it much easier for us to assume difference variances across groups. This is not discussed here but is yet another advantage of Bayesian data analysis.↩︎
Also known as the Gelman-Rubin convergence diagnostic; see Brooks et al. (2011).↩︎
The reader can verify that this hypothesis would be confirmed (\(\hat\beta = 0.07\), 90% HDI = \([0.01, 0.13]\)).↩︎
By default, the function uses 90% credible intervals for one-sided and 95% credible intervals for two-sided hypotheses. The reader should consult the documentation for the
hypothesis()
function to fully understand its assumptions and its output.↩︎Just like any interval, this is an arbitrary threshold.↩︎
We will see more effects if we consider all possible comparisons, as shown in Fig. 5.↩︎
The grave accents are needed if we want to create a variable name with a hyphen, since hyphens are interpreted as a minus sign by R otherwise.↩︎
Citation
@article{garcia2024,
author = {Garcia, Guilherme D},
title = {Bayesian Estimation in Multiple Comparisons},
journal = {Submitted},
date = {2024},
url = {https://gdgarcia.ca/multcomp},
doi = {10.31219/osf.io/wzqxg},
langid = {en},
abstract = {Typical regression models do not automatically generate
estimates for all the comparisons of a given factor F. Instead, a
level is selected as a reference (i.e., intercept), and all other
levels are estimated relative to said reference (i.e., slopes).
While this situation is often aligned with our research questions
and hypotheses, we sometimes wish to generate multiple comparisons
involving different pairs of levels within F. This, however, is
typically accompanied by p-value corrections or adjustments to
minimize the rate of type I error. This paper shows how Bayesian
hierarchical models can be used to generate multiple comparisons of
entire posterior distributions. Crucially, these comparisons do not
require adjustments.}
}