Intro to Bayesian data analysis in R

Running and visualizing regression models

Guilherme D. Garcia

Université Laval • gdgarcia.ca

Workshop @ Harvard

March 5, 2024

What we’re doing today

Key goals1

  1. Compare Frequentist and Bayesian data analysis
  2. Discuss some practical implications
  3. Run, inspect and visualize models using brms + other packages

What we won’t have time to do

  • Discuss priors, compare models using information criteria
  • Consider multiple types of distributions
  • Perform a sensitivity analysis—but see Garcia (2023)

Basics

How we understand probabilities

Frequenstist perspective

  • We examine the frequency of events over large samples
  • Q: What’s the probability of the data given a parameter \(P(D|\pi)\)?
  • This is our \(p\)-value (Null Hypothesis Significance Testing, NHST)

How we understand probabilities

Bayesian perspective

  • We start with an assumption, then we examine the data
  • Q: What’s the probability of the parameter given the data \(P(\pi|D)\)?

An analytical example: A test for disease \(D\) has 95% accuracy in positive results. \(D\) occurs in 0.1% of the population. What’s \(P(D|+)\)?

Bayes’ rule

\[ {P(\pi|data)} = \frac{{P(data|\pi)}{P(\pi)}}{P(data)} \]

Defining our values

\[\begin{align*} P(D) &= \frac{1}{1000} \\ P(\neg D) &= 1 - P(D) = \frac{999}{1000} \\ P(+|D) &= 0.95 \quad P(+|\neg D) = 0.05\\ \end{align*}\]

How we understand probabilities

Bayesian perspective

  • We can now use Bayes’ rule to answer the question

Calculating our posterior probability

\[\begin{align*} P(D|+) &= \frac{P(+|D) \times P(D)}{{P(+)}} \\ {P(+)} &= P(+|D) \times P(D) + P(+|\neg D) \times P(\neg D) \\ &= 0.95 \times 0.001 + 0.05 \times 0.999 = 0.0509 \\ P(D|+) &= \frac{0.95 \times 0.001}{0.0509} \\ &\boxed{\approx 1.87\%} \end{align*}\]

  • Crucial: our prior, \(P(D)\), makes all the difference A second test?

Intuition

  • The stronger our priors, the harder it is to change our conclusions given the data. Let’s see this visually

  • Our methods should ideally incorporate what we already know about a given phenomenon (what the literature has already shown)

Illustrative example where our conclusion (posterior) is a compromise between our initial assumptions (prior) and the data (likelihood). Figure from Garcia (2021, p. 213)

Realistic applications

  • Assume a model with 7 variables (parameters)
  • If we want to consider the probability of 1,000 values for all 7 \(\rightarrow\) 1,000\(^7\) combinations of parameter values (!)
  • So even simple models can’t really be calculated analytically
  • That’s why we sample values from the posterior instead
  • Different samplers available: MCMC1 models (e.g., Metropolis, JAGS, Stan)

A random walk in the parameter space

A simplistic illustration of our random walk

A simple algorithm

  1. Initial guess: pick a candidate value \(A\) for a parameter
  2. Calculate \(P(A|D)\)
  3. Propose a new value, \(B\), and calculate \(P(B|D)\)
  • \(P(B|D) > P(A|D) \rightarrow\) move to \(B\)
  • \(P(B|D) < P(A|D) \rightarrow\) move to \(B\) with \(P = \frac{P(B|D)}{P(A|D)}\)
  1. Decide whether to accept/reject the proposal based on some threshold
  2. Repeat steps 1-4 for \(n\) iterations
  • Parameter values that are more likely will be “visited” more often
  • Everything in Bayesian data analysis is a distribution

A random walk in the parameter space

How do we know we have converged to the right parameter values?

  • We use more than one chain in our random walk
  • If all our chains end up in the same parameter space, that’s good
  • We can use traceplots to inspect out search:

Chains converged. Fig. from Garcia (2023, p. 749)

Chains did not converge. Fig. from Garcia (2023, p. 749)

Example: a simple linear model

  • Parameter estimation (e.g., \(\hat\beta\)) by sampling from the posterior
  • Here we see a joint posterior distribution (\(\hat\beta_0\) and \(\hat\beta_1\))

Example of a joint posterior distribution for an intercept and a slope in a regression model. Figure from Garcia (2021, p. 220)

  • Important: estimates are distributions (cf. single-point estimates)

How to run a Bayesian model in R

A simple model using Stan directly

data {
  int<lower=0> N;         // number of data points
  vector[N] x;            // predictor variable
  vector[N] y;            // outcome variable
}

parameters {
  real alpha;             // intercept
  real beta;              // slope
  real<lower=0> sigma;    // standard deviation of the errors
}

model {
  // Priors
  alpha ~ normal(0, 10);  // Gaussian prior for intercept
  beta ~ normal(0, 10);   // Gaussian prior for slope
  sigma ~ normal(0, 1);    // Gaussian prior for standard deviation
  
  // Likelihood
  y ~ normal(alpha + beta * x, sigma); // Likelihood function for linear regression
}

Practice

Our packages

  • We’ll use tidyverse + some specific packages to help us run and visualize Bayesian models
library(tidyverse)
library(sjPlot)
library(bayestestR)
library(brms)
library(bayesplot) # not necessary, technically
library(tidybayes) # helps with manipulation of samples

Our point of reference

load("danish.RData")
Sample of 5 rows from the data
Subject Word Affix LogRT RT
2s20 vE6ltede ede 6.538169 691.02
2s16 ruinere ere 6.664702 784.23
2s16 dyrkede ede 6.370569 584.39
2s01 vandrende ende 6.566841 711.12
2s18 lyttede ede 6.485597 655.63

Visualize data

  • Reaction times as a function of 5 affixes in Danish
ggplot(data = dan, 
       aes(x = fct_reorder(Affix, LogRT), 
           y = LogRT)) +
  geom_boxplot() +
  stat_summary(color = "darkorange") +
  theme_classic(base_family = "Futura",
                base_size = 15) +
  labs(x = "Affix",
       y = "RT in log(ms)",
       title = "Reaction times across 5 affixes")


  • Let’s run a linear model to examine the relationship in question

Frequentist linear regression

  • We’d typically use lm() to run our model here
fit0 = lm(LogRT ~ Affix, data = dan)
# In this scenario, this is equivalent to an ANOVA using aov()
  • By definition, a suffix is picked as our reference level (bar)
  • That’s our intercept and all other affixes are compared to it

Our model

\[\hat{y_i} = \hat\beta_0 + \hat\beta_{ede}x_i + \hat\beta_{ende}x_i + \hat\beta_{ere}x_i + \hat\beta_{lig}x_i + \hat{e}_i\]

  Log RT
Predictors Estimates CI p
(Intercept) 6.80 6.77 – 6.82 <0.001
Affix [ede] -0.05 -0.09 – -0.01 0.008
Affix [ende] 0.05 0.02 – 0.09 0.006
Affix [ere] 0.02 -0.02 – 0.06 0.354
Affix [lig] -0.06 -0.10 – -0.02 0.001
Observations 1040
R2 / R2 adjusted 0.045 / 0.041

Frequentist ANOVA

  • What if we want multiple comparisons?
  • Different methods. Here, we could simply run a post-hoc test (e.g., Tukey HSD), which will affect \(p\)-values and therefore our conclusions
aov(LogRT ~ Affix, data = dan) |> 
  TukeyHSD()
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = LogRT ~ Affix, data = dan)

$Affix
                diff           lwr          upr     p adj
ede-bar  -0.05099340 -1.034311e-01  0.001444266 0.0612021
ende-bar  0.05413199 -5.440768e-05  0.108318388 0.0503757
ere-bar   0.01781355 -3.468447e-02  0.070311574 0.8863835
lig-bar  -0.06136498 -1.139239e-01 -0.008806100 0.0127046
ende-ede  0.10512539  5.129924e-02  0.158951542 0.0000012
ere-ede   0.06880695  1.668084e-02  0.120933061 0.0029895
lig-ede  -0.01037158 -6.255898e-02  0.041815820 0.9827775
ere-ende -0.03631844 -9.020339e-02  0.017566517 0.3500510
lig-ende -0.11549697 -1.694412e-01 -0.061552722 0.0000001
lig-ere  -0.07917853 -1.314266e-01 -0.026930484 0.0003598

Multiple comparisons

Figure 1: Four comparisons have p > 0.05

Bayesian model

  • Here’s what the same model looks like using brms:
fit1a = brm(LogRT ~ Affix, data = dan,
           family = "Gaussian",
           cores = 4,
           save_model = "fit1a.stan")

save(fit1a, file = "fit1a.RData")
  • This takes much longer to run, so it’s useful to save its output
  • You can also save the model specification (Stan model)
  • And you can use multiple cores (one per chain)

Bayesian model

Typical output

load("fit1a.RData") # Assuming we have saved the model
fit1a
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: LogRT ~ Affix 
   Data: dan (Number of observations: 1040) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     6.80      0.01     6.77     6.82 1.00     2383     2828
Affixede     -0.05      0.02    -0.09    -0.01 1.00     2808     2998
Affixende     0.05      0.02     0.01     0.09 1.00     2878     2880
Affixere      0.02      0.02    -0.02     0.06 1.00     2889     2780
Affixlig     -0.06      0.02    -0.10    -0.02 1.00     2883     3171

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.20      0.00     0.19     0.21 1.00     4633     2965

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Bayesian model: visualizing output

# Visualize results (most standard fig)
plot(fit1a) # quick and easy visualization method: posterior + chains

Posterior distributions + traceplots
# Visualizing means of posterior only:
mcmc_plot(fit1a, variable = "b_Affix", regex = TRUE) 

# Visualizing posterior areas using regex to select variables of interest:
mcmc_plot(fit1a, type = "areas", variable = "b_Affix", regex = TRUE)

Bayesian model: increasing complexity

  • So far we’ve assumed a non-hierarchical model (not ideal)

  • In our final part, we will manually extract posterior samples, examine multiple comparisons, and run a hierarchical model

To RStudio

Extracting samples

Code

#
post1_raw = as_draws_df(fit1a) |>
  as_tibble()

post1_raw = post1_raw |>                    
  select(contains("b_")) |>
  rename("bar*" = b_Intercept,
         "ede" = b_Affixede,
         "ere" = b_Affixere,
         "ende" = b_Affixende,
         "lig" = b_Affixlig)

post1 = post1_raw |>
  pivot_longer(names_to = "Parameter",
               values_to = "Estimate",
               cols = `bar*`:lig)

post1_summary = post1 |>                    
  group_by(Parameter) |>
  mean_qi(Estimate) |>
  ungroup()

ROPE1 = rope(fit1a) |>
  as_tibble()
1
Extract samples and transform output into tibble
2
Select only a subset of columns of interest
3
Rename columns
4
Wide-to-long transformation
5
Create summary for each parameter (mean, etc.)
6
Calculate ROPE for model

Why do this?

  • More control over comparisons
  • More personalization in figures

Hover over numbers for annotations

Visualizing posterior

#
ggplot(data = post1 |> filter(Parameter != "bar*"), 
       aes(x = Estimate, y = Parameter)) + 
  annotate("rect", ymin = -Inf, ymax = +Inf,
           xmin = ROPE1$ROPE_low,
           xmax = ROPE1$ROPE_high,
           alpha = 0.3,
           fill = "gray90",
           color = "white") +
  stat_halfeye(fill = "#B6B7DB", .width = c(0.5, 0.95)) +
  theme_classic(base_family = "Futura") +
  coord_cartesian(xlim = c(-0.2, 0.2)) +
  labs(y = NULL,
       x = "Posterior distribution",
       title = "Effects relative to intercept (affix **bar**)") +
  theme(plot.title = ggtext::element_markdown()) +
  geom_label(data = post1_summary |> filter(Parameter != "bar*"),
             aes(x = Estimate, y = Parameter, label = Parameter),
             family = "Futura",
             # position = position_nudge(y = -0.25),
             label.padding = unit(0.4, "lines")) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  geom_text(data = post1_summary |> 
              filter(Parameter != "bar*") |>  
              mutate(across(where(is_double), ~round(., 2))),
            aes(label = glue::glue("{Estimate} [{.lower}, {.upper}]"), x = Inf),
            hjust = "inward", vjust = -0.5, family = "Futura", color = "gray60") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black")

Multiple comparisons

Manually generate comparisons from posterior samples

post1_raw_mc = post1_raw |> 
  mutate(ede_ende = (`bar*` + ede) - (`bar*` + ende),
         ede_ere = (`bar*` + ede) - (`bar*` + ere),
         ede_lig = (`bar*` + ede) - (`bar*` + lig),
         ere_lig = (`bar*` + ere) - (`bar*` + lig),
         ere_ende = (`bar*` + ere) - (`bar*` + ende),
         ende_lig = (`bar*` + ende) - (`bar*` + lig)) |> 
  select(-`bar*`) |> 
  rename("bar_ede" = ede,
         "bar_ende" = ende,
         "bar_lig" = lig,
         "bar_ere" = ere)

# Now, wide-to-long transform:
post1_mc = post1_raw_mc |> 
  pivot_longer(names_to = "Comparison",
               values_to = "Estimate",
               cols = bar_ede:ende_lig)

# Create some point and interval summaries for our posteriors:
post1_summary_mc = post1_mc |> 
  group_by(Comparison) |> 
  mean_qi(Estimate) |> 
  ungroup()

Visualize multiple comparisons

ggplot(data = post1_mc, 
       aes(x = Estimate, y = fct_reorder(Comparison, Estimate))) + 
  annotate("rect", ymin = -Inf, ymax = +Inf,
           xmin = ROPE1$ROPE_low,
           xmax = ROPE1$ROPE_high,
           alpha = 0.3,
           fill = "gray90",
           color = "white") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  stat_halfeye(fill = "#B6B7DB", .width = c(0.5, 0.95)) +
  theme_classic(base_family = "Futura") +
  coord_cartesian(xlim = c(-0.2, 0.4)) +
  labs(y = NULL,
       x = "Posterior distribution for each comparison",
       title = "**Multiple comparisons** of posterior distributions",
       subtitle = "All but **two** comparisons display a credible statistical difference",
       caption = "Cf. ANOVA + TukeyHSD, where half of all comparisons had p > 0.05") +
  theme(plot.title = ggtext::element_markdown(),
        plot.subtitle = ggtext::element_markdown()) +
  geom_label(data = post1_summary_mc,
             aes(x = Estimate, y = Comparison, label = Comparison),
             family = "Futura", size = 3,
             # position = position_nudge(y = -0.3),
             label.padding = unit(0.4, "lines")) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  geom_text(data = post1_summary_mc |> 
              mutate(across(where(is_double), ~round(., 2))),
            aes(label = glue::glue("{Estimate} [{.lower}, {.upper}]"), x = Inf),
            hjust = "inward", vjust = -0.5, family = "Futura", color = "gray60", size = 3)

A more realistic model?

  • We should now include random effects to our model
  • Thanks to brms, we can use the same familiar syntax from lme4
fit2 = brm(LogRT ~ Affix +
             (1 + Affix | Subject) +
             (1 | Word),
           data = dan,
           family = "Gaussian",
           cores = 4,
           save_model = "fit2.stan")
  • What happens to our posterior distributions now?
  • Let’s go straight to our multiple comparisons

Hierarchical model: multiple comparisons

Figure 2: Before, without random effects
Figure 3: After, with random effects

Final considerations

Great

  • Bayesian models are much more customizable
  • No \(p\)-values
  • Results are intuitive (e.g., CI) and comprehensive
  • We can easily incorporate our assumptions (priors)

Not so great

  • These models are more computationally intensive
  • They’re often unfamiliar (e.g., reviewers, priors)
  • If we run simple models without informed priors, similar results
  • Not having \(p\)-values may be an issue to many

Materials & Readings

All the materials on OSF

Readings

  • McElreath (2020)
  • Kruschke (2015), Kruschke (2021) and Kruschke & Liddell (2018)
  • Gelman (2008) (on common objections to Bayesian statistics)
  • Garcia (2021) and Garcia (2023)

References

Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017
Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., … Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1), 1–32. https://doi.org/10.18637/jss.v076.i01
Garcia, G. D. (2021). Data visualization and analysis in second language research. New York NY: Routledge.
Garcia, G. D. (2023). Statistical modeling in L3/ln acquisition. In J. Cabrelli, A. Chaouch-Orozco, J. González Alonso, S. M. Pereira Soares, E. Puig-Mayenco, & J. Rothman (Eds.), The Cambridge Handbook of Third Language Acquisition (pp. 744–770). Cambridge: Cambridge University Press. https://doi.org/10.1017/9781108957823.030
Gelman, A. (2008). Objections to Bayesian statistics. Bayesian Analysis, 3(3), 445–449.
Kruschke, J. K. (2015). Doing bayesian data analysis: A tutorial with r, JAGS, and stan (2nd ed.). London: Academic Press.
Kruschke, J. K. (2021). Bayesian analysis reporting guidelines. Nature Human Behaviour, 5(10), 1282–1291.
Kruschke, J. K., & Liddell, T. M. (2018). Bayesian data analysis for newcomers. Psychonomic Bulletin & Review, 25(1), 155–177.
McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan (2nd ed.). Boca Raton: Chapman & Hall/CRC.
Winther Balling, L., & Harald Baayen, R. (2008). Morphological effects in auditory word recognition: Evidence from Danish. Language and Cognitive Processes, 23(7-8), 1159–1190.

Appendix

A second test

Updating our posterior probability

\[\begin{align*} P(D|+) &= \frac{P(+|D) \times P(D)}{{P(+)}} \\ {P(+)} &= P(+|D) \times P(D) + P(+|\neg D) \times P(\neg D) \\ &= 0.95 \times \boxed{0.0187} + 0.05 \times 0.9813 = \boxed{0.06683} \\ P(D|+) &= \frac{0.95 \times \boxed{0.0187}}{0.06683} \\ &\boxed{\approx 26.58\%} \end{align*}\]

Go back