Teaching stats to undergrads in linguistics

My experience with a newly developed course

teaching
stats
R
Author

Guilherme D. Garcia

Published

December 30, 2023

This past fall, I taught a brand new data analysis course here at Université Laval for the first time: LNG-1100 Méthodes expérimentales et analyse de données. This is a course I had wanted to develop for years: an up-to-date course focused on applied quantitative data analysis using R for undergraduates. Having just arrived at Laval, I was quite happy to learn that I would be able to design and teach the course starting 2023. On top of that, I was fortunate to be awarded an internal grant that helped me focus my teaching this term exclusively on the development of LNG-1100.1 The course is now part of our new undergraduate program Sciences du langage.

The goals

Besides discussing research questions and experimental design, the course also focused on three aspects of data analysis.

  1. Data cleaning, preparation and visualization with tidyverse
  2. Regression analysis (intro; no mixed-effects), both linear and logistic
  3. Document preparation using Quarto in both PDF and HTML formats — this also included reference management using bib files

Simply put, my aim was to shift the focus from traditional \(t\)-tests and ANOVAs in SPSS to more contemporary and applicable methods. I wanted to provide students with more useful, up-to-date, and powerful methods and more marketable skills. Whether or not they decide to go for an MA later on, some coding in R and Quarto coupled with some basic data analysis will certainly be useful skills to have.

  • Formuler et tester des hypothèses de recherche [en linguistique]
    • proposer des questions de recherche [pertinentes à la linguistique]
    • contraster des designs expérimentaux appropriés étant donnée une hypothèse spécifique
  • Se familiariser avec les éléments de base de l’analyse de données quantitatives
    • manipuler des données en utilisant le langage R et l’extension tidyverse
    • décrire des patrons pertinents dans des données linguistiques en utilisant des tableaux et des figures
    • appliquer des modèles de régression aux données linguistiques
  • Interpréter et synthétiser des résultats statistiques dans un rapport scientifique
    • associer des résultats statistiques aux objectifs de recherche
    • développer des rapports de recherche en utilisant Quarto et R

Considering that this is a first-year course, the goals were ambitious.2 Students were assumed to have no background in coding (or R), and no background in stats — this is, after all, an intro course. Simply put, the main goal of the course was to give students enough skills to grab a csv file, clean it, explore the data as needed, analyze it using regression models, report and interpret results in a polished report using Quarto with the appropriate reference management (including crossed references for sections, figures and tables). Fortunately, all of this can be accomplished within a single environment (RStudio), which made our lives easier.

The challenges (and suggested solutions)

Before starting the course, I’d say the two main challenges would be:

  1. the coding in R, which is new to practically all incoming students in the program;
  2. the stats behind regression analysis, especially logistic regression.

Both points above were less challenging than I anticipated. Besides its official page (using monPortail here at Laval), the course was accompanied by a website with interactive apps and pages that supplemented each class. This certainly helped, as each page was comprehensive and offered commented solutions to exercises. These pages made use of animated GIFs whenever a particular sequence of steps was needed to set up RStudio, for example. In addition, interactive code was all over the place to make the points (1) and (2) more user-friendly. The course did have suggested/recommended readings in both French and English (all available digitally), but these materials were mostly seen as extras throughout the course, since the slides and the complementary pages were enough to successfully complete most of the assignments.

One challenging aspect in the first quarter of the course was the ability to locate files and navigate through folders, especially by using the command line, which is completely new to the vast majority of students in the program. This difficulty to locate files has been noted in different articles online. The use of RProj files did help, as did the tutorial on file organization early on in the course,3 but the difficulty was certainly there. About 20% of the students struggled for several classes until they were able to navigate to a particular file by typing paths and using Tab in RStudio.4

Dealing with technical challenges

This is one key challenge: when we have 20, 30 students in a room, all installing R and tidyverse at once, we know something will go wrong. Students were instructed to install R, RStudio, and the packages before class (through a custom tutorial I provided). They were also instructed to create accounts in posit.cloud to have access to the online version of RStudio as a backup (about 10-15% of the students ended up using the online version for technical reasons). These measures certainly alleviated the problem.

Finally, a dedicated page for troubleshooting was added during the course. The page would have been quite useful early on, but it was still used later in the course.

Chat GPT

AI was part of the course. Virtually no one doing data analysis will avoid using Google or AI to improve their coding. As a result, prohibiting such tools in this type of course is not only impractical, it’s also a disfavour to students, who should view these tools as complementary to their own ability to analyse data.

Chat GPT was used in some classes to demonstrate how answers can be completely wrong or outdated (e.g., functions that no longer work as intended). Crucially, the discussion was centred around two skills: (a) knowing how to ask the right questions and (b) critically assessing technical answers. This topic was also incorporated into quizzes, where students were asked to explain why ChatGPT’s responses were not always appropriate or correct, especially in coding contexts.

For example, GPT (version 4) struggled to identify the issue when a csv file using ; as separator wasn’t read correctly by the function read_cvs(). It then went on to provide overly complex solutions instead of simply suggestions read_csv2(). Likewise, as the course spent some time on data transformation, GPT wasn’t able to help students when the data file was originally in a wide format that required wide-to-long transformation, as the naming of the variables wasn’t transparent enough for GPT to perform the transformation itself.

The results and final thoughts

The course had two projects (in groups) and four quizzes. Here’s the project of one of the groups — the students (Auriane Bergeron, Émy Bouchard and Marianne Paradis) were kind enough to publicly share their excellent project.

All in all, the course met its goals. It did require a lot of work to develop, but I do think it was worth it: slides, scripts, simulated data, Quarto files, and a complementary external website with a dozen supplemental pages. Some aspects of the course, especially the coding in R, were particularly challenging to many students, but they certainly rose to the occasion. Later in the program, students also take a course in computational linguistics (where they learn some Python), and LNG-1100 can certainly serve as a good foundation for them. In addition, as noted throughout the course, the use of Quarto to produce academic reports and essays is an added benefit of the course: for most students, this was the first time they managed references and produced academic reports. These skills are easily applicable to most other courses they will take in the program.

I will revisit this post in the future, especially when I have access to the course evaluation and the students’ feedback.


Copyright © 2024 Guilherme Duarte Garcia

Footnotes

  1. Programme d’appui à l’innovation pédagogique (PAIP).↩︎

  2. Bear in mind that undergraduate students in Québec go through CEGEP before starting their programs.↩︎

  3. File organization is also a point I emphasize in my book.↩︎

  4. The issue above was particularly problematic because some students did not follow the suggested file structure and instead decided to leave all the course files in their Desktop, a practice that seems to be very common. For future reference, I think these file organization “suggestions” should become obligatory.↩︎