Pre-operation grouping in dplyr

dplyr

A simpler way to group variables using dplyr that can save you some lines of code.

Author

Guilherme D. Garcia

Published

April 23, 2023

If you use R, you also use tidyverse, and you probably use dplyr all the time. I constantly employ a combination of group_by() and summarize() or mutate(). Of course, I tend to always ungroup() the variables at the end of the pipeline (even if/when I don’t have to). The recent changes to dplyr simplifies that. I’ll use some dummy code below with variables Group, Proficiency, and Score. In the code below, I’m using R’s native pipe operator (|>), which for out purposes here is the same as %>% from magrittr. You can easily type it with the shortcut Cmd-Shift-M.

# Before:
myData |> 
  group_by(Group, Proficiency) |> 
  summarize(Mean = mean(Score)) |> 
  ungroup()

# Now:
myData |> 
  summarize(Mean = mean(Score), .by = c(Group, Proficiency))

If you have never grouped variables in R, it may be helpful to visualize what’s happening here. This is what the code above is doing assuming two groups of participants (Portuguese and Spanish) and three levels of proficiency (it all results in six mean scores):

flowchart TB
  A{myData} --> B[Portuguese]
  A --> C[Spanish]
  C --> E[Beginner] --> J(Mean score)
  C --> F[Intermediate] --> I(Mean score)
  C --> G[Advanced] --> H(Mean score)
  B --> K[Beginner] --> N(Mean score)
  B --> L[Intermediate] --> O(Mean score)
  B --> M[Advanced] --> P(Mean score)

This pre-operation grouping with .by also automatically ungroups the variables, which means it basically removes two lines of code in the example above.

Here’s something I’d like: we often need to calculate proportions based on groupings. To do that, we could do:

# Option 1 (more explicit)
myData |> 
  group_by(Group, Proficiency, Score) |> 
  count() |> 
  group_by(Group, Proficiency) |> 
  summarize(prop = n / sum(n))

# Option 2 (more concise)
myData |> 
  group_by(Group, Proficiency, Score) |> 
  summarize(n = n()) |> 
  mutate(prop = n / sum(n))

Here, we’re doing two operations on two different groupings, so we could do:

# Option 3 (pre-op grouping: even more concise?)
myData |> 
  summarize(n = n(), .by = c(Group, Proficiency, Score)) |> 
  mutate(prop = n / sum(n), .by = c(Group, Proficiency))

The only difference between option 3 and the other two is the order in which results are presented, but you can always arrange() it however you like. What I like about option 3 is that it’s concise, sure, but it’s also readable, since the groupings are explicit and local, so it’s easy to see their scopes—not to mention, of course, that we don’t need to ungroup anything at the end.

For a final example, let’s consider a situation where we want to group the data by one variable, Proficiency, generate counts, and then calculate individual proportions relative to the whole data set.

# Before (using count(), not summarize()):
myData |> 
  group_by(Proficiency) |> 
  count() |> 
  ungroup() |> 
  mutate(Prop = n / sum(n))

# Now:
myData |> 
  summarize(n = n(), .by = Proficiency) |> 
  mutate(Prop = n / sum(n))

There’s always a trade-off between being concise and being clear. While I think spelling out steps individually will tend to be clearer to those new to R and tidyverse, the more concise version is still quite readable.