Iterate functions with map

When your functions are too slow because they’re not vectorized, map() can be a great alternative to for-loops.

Author

Guilherme D. Garcia

Published

April 24, 2023

Modified

October 4, 2023

I’m often in a situation where I have a function and I need to apply it iteratively to a lot of data. This is especially necessary for the package I maintain (Fonology). We want functions to be fast, of course. In R, this means we want it to be vectorized. Not coming from computer science, I find this topic quite interesting.

Not all functions can be vectorized, and that’s the issue. So what can we do? A common option is to run a for-loop. Here’s quick example: suppose you want to write a sequence of numbers where each number n repeats n times. Here’s one way to do that with a for-loop:

numbers = 1:5

for(i in numbers){
  rep(i, i) |> 
    print()
}

[1] 1
[1] 2 2
[1] 3 3 3
[1] 4 4 4 4
[1] 5 5 5 5 5

For-loops tend to do a great job if you don’t have too much data. They also tend to be intuitive, so if you’re not familiar with more exoteric functions, they are a very good place to start. That being said, it’s usually a good idea to avoid for-loops if there’s a better option out there (for-loops tend to be much slower). One common alternative is to use the apply() family of functions in R.

numbers = 1:5

lapply(numbers, function(x){
  rep(x, x)}) |> 
  unlist()

 [1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5

A more recent option is to use the map() function from the purrr package, which is extremely useful. Here’s the same idea with map():

library(purrr)
map(1:5, \(x) rep(x, x)) |> unlist()

 [1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5

So, if you’ve created a non-vectorized function and now need to apply it to several inputs at once (say, to a whole column of data), you can use map() or apply() to speed up the process.

Using multiple functions

Here’s another example showing how useful the map() function can be. Suppose we want to sample from a population $n$ times and take the mean of each sample to store it in a vector (dbl). We could do an easy for-loop to accomplish this task, but map_dbl() is a much more efficient way to do that.

library(tidyverse)

# Simulate a population:
set.seed(1)
pop = rnorm(n = 20000, mean = 5, sd = 5)

# Take 100 samples, each of size 50:
set.seed(1)
means = map_dbl(1:100, ~mean(sample(pop, size = 50)))

ggplot(data = data.frame(x = means), aes(x = x)) + 
  geom_histogram(color = "white", fill = "darkorange2", alpha = 0.5, bins = 8) + 
  theme_classic() +
  labs(x = "Means", y = NULL)

Sampling distribution of the sample mean