Data visualization and analysis

Publisher’s description

Data visualization and analysis in second language research This introduction to visualization techniques and statistical models for second language research focuses on three types of data (continuous, binary, and scalar), helping readers to understand regression models fully and to apply them in their work. Garcia offers advanced coverage of Bayesian analysis, simulated data, exercises, implementable script code, and practical guidance on the latest R software packages.

The book, also demonstrating the benefits to the L2 field of this type of statistical work, is a resource for graduate students and researchers in second language acquisition, applied linguistics, and corpus linguistics who are interested in quantitative data analysis.

Files

Access book files files at http://osf.io/hpt4g.

Highlights

Intro to R
Focus on data visualization
Linear, logistic, and ordinal regression
Hierarchical (mixed-effects) models
Chapter on Bayesian data analysis
Comprehensive code that can be fully reproduced by the reader
File organization with RProjects

Reviews

Highly recommended as an accessible introduction to the use of R for analysis of second language data. Readers will come away with an understanding of why and how to use statistical models and data visualization techniques in their research.

Lydia White, James McGill Professor Emeritus, McGill University

Curious where the field’s quantitative methods are headed? The answer is in your hands right now! Whether we knew it or not, this is the book that many of us have been waiting for. From scatter plots to standard errors and from beta values to Bayes theorem, Garcia provides us with all the tools we need—both conceptual and practical—to statistically and visually model the complexities of L2 development.

Luke Plonsky, Professor, Northern Arizona University

This volume is a timely and must-have addition to any quantitative SLA researcher’s data analysis arsenal, whether you are downloading R for the first time or a seasoned user ready to dive into Bayesian analysis. Guilherme Garcia’s accessible, conversational writing style and uncanny ability to provide answers to questions right as you’re about to ask them will give new users the confidence to make the move to R and will serve as an invaluable resource for students and instructors alike for years to come.

Jennifer Cabrelli, Associate Professor, University of Illinois at Chicago

[…] this book’s strength lies in giving readers just enough to enable them to quickly apply their newly acquired knowledge and skills to their own data in order to produce complex, journalworthy analyses. The book is timely, with increasing expectations for more refined accounts of the diverse populations and intricate results stemming from studies of second language acquisition and bi/plurilingualism, as well as other fields of linguistic research.

Senécal & Sabourin (2023)

News & updates

Here are some updates and additional info related to the code used in the book. Some of these are based on questions I get about this code. This page will change from time to time to reflect updates in relevant packages and functions used in the book (e.g., mutate_...(); see here).

The function mutate_if() has been superseded by across().

Before: mutate_if(is.character, as.factor)
Now: mutate(across(where(is_character), as_factor))

Besides using scale_x_discrete(label = abbreviate) to abbreviate axis labels, you can also use scale_x_discrete(labels = c(...)), which allows you to choose how labels are abbreviated.
For guidelines regardings Bayesian analyses, see Kruschke’s recent paper Bayesian Analysis Reporting Guidelines.
With R 4.1+, the native pipe |> can replace %>% (read more here).
guide = FALSE is now deprecated and should be replaced by guide = "none".
Instead of select(vars), where vars is a vector containing multiple columns of interest, you should now use select(all_of(vars)).
Check out the useful changes to vector functions in dplyr 1.1.0 here
You can use read_csv() and bind_rows() (together with list.files() and full.names = T) to combine multiple csv files in a directory (no for-loop is needed)—thanks to Natália B. Guzzo for pointing this out.
When you’re working with factors, the function fct_relevel() from the forcats package offers a lot more flexibility than the function relevel().

Useful packages and functions not mentioned in the book

The dtplyr package provides the power of data.table with the familiar tidyverse syntax. Check it out here.
case_when() (from dplyr) is a great function to avoid using multiple if_else()s. See documentation here.
sample_n() will print four (by default) random rows of your data.
dplyr 1.1.0+ now offers pre-operation grouping with the .by argument within functions such as mutate() and summarize(). A key advantage is that we no longer need to ungroup() variables after applying the function. Check out my blog posts on pre-operation grouping and on snippets to automate your tasks.

# Before:
my_data |> 
  group_by(group, condition) |> 
  mutate(new_column = mean(number_column)) |> 
  ungroup()

# Now:
my_data |> 
  mutate(new_column = mean(number_column), .by = c(group, condition))

R 4.2.0 and 4.3.0 also provide some neat features, such as the use of _ as a placeholder for the native pipe |> (basically, the equivalent to . when you use %>%). In addition, you can extract specific values that are output in a pipeline in a clean and elegant way. For example, the code below extracts coefficients of a linear model.

data |> lm(response ~ predictor, data = _) |> _$coef

Errata and clarifications

Typo on p. 152, paragraph 2: “we already know the probability”.
Typo on p. 218. For some mysterious reason, the published version of the book has “Monte Carlos Markov Chain” (first paragraph), which should obviously read “Markov Chain Monte Carlo”. This is, in fact, what is listed in the glossary at the end of the book on p. 254.
Clarification on p. 225, paragraph 2 (interpreting code block 55): “First, we have our estimates and their respective standard errors”. Bear in mind that in Bayesian models, the standard error of the coefficients corresponds to the standard deviation of the posterior distribution.
Clarification regarding the function mean_cl_boot mentioned on p. 95, paragraph 1: this function (from the Hmisc package) bootstraps confidence intervals, which are defined as $\bar{x} \pm z \cdot \frac{s}{\sqrt{n}}$ (where $z$ is the desired confidence level, e.g., 1.96). Thus, even though standard errors are calculated in the process, the resulting error bars represent confidence intervals (and will therefore always be wider than the error bars representing standard errors, by definition).
On p. 20, “Simply go to RStudio > Preferences (or hit Cmd + , on a Mac)”. In more recent versions of RStudio, this has changed to “Tools > Global Options…” (same shortcut as before).
Chapter 10 (Going Bayesian): Note that the output of a brm() model shows the two-sided 95% credible intervals (l-95% CI and u-95% CI) based on quantiles. If the posterior is symmetrical (i.e., approximately normal), this interval will practically coincide with the highest density interval, HDI (this is the case in the chapter). However, in asymmetrical distributions, the interval shown in the output brm() will not coincide with HDIs.

Garcia, G. D. (2021). Data visualization and analysis in second language research. New York, NY: Routledge.

@book{garcia_2021_dvaslr,
    title = {Data visualization and analysis in second language research},
    author = {Garcia, Guilherme Duarte},
    year = {2021},
    address = {New York, NY},
    publisher = {Routledge},
    isbn={9780367469610}}

Funding

Part of this project benefited from an ASPiRE Junior Faculty Award at Ball State University (2020–2021).

References

Senécal, A., & Sabourin, L. (2023). Review of "Data Visualization and Analysis in Second Language Research" by Guilherme D. Garcia. Canadian Journal of Linguistics/Revue Canadienne de Linguistique, 1–4. https://doi.org/10.1017/cnj.2023.25