Non-standard evaluation of arguments
One difference with purrr is that furrr has to evaluate the arguments passed through ...
in functions such as future_map()
. This has to happen before they can be serialized and shipped off to the worker sessions. It is guaranteed that these arguments are evaluated only once, but this prevents some “lazy” behavior that is possible with purrr. For example:
filter_for_dogs <- function(data, col) {
filter(data, {{ col }} == "dog")
}
df1 <- tibble(
pets = c("dog", "cat"),
names = c("Floofy", "Buttercup")
)
df2 <- tibble(
pets = c("horse", "dog", "mouse"),
names = c("Stalone", "Fido", "Cheesy")
)
dfs <- list(df1, df2)
map()
delays evaluation as long as possible, and pets
is evaluated in a context where the data frame exists, so it can detect that pets
is a column in the data frame.
map(dfs, filter_for_dogs, col = pets)
#> [[1]]
#> # A tibble: 1 × 2
#> pets names
#> <chr> <chr>
#> 1 dog Floofy
#>
#> [[2]]
#> # A tibble: 1 × 2
#> pets names
#> <chr> <chr>
#> 1 dog Fido
future_map()
has to evaluate each argument early, so pets
is evaluated in a context where the data frame doesn’t exist, so we get an error.
future_map(dfs, filter_for_dogs, col = pets)
#> Error in get_globals_and_packages(options$globals, options$packages, map_fn, : object 'pets' not found
One alternative is to pass the column through as a character string, and to use the .data
pronoun to retrieve the column.
filter_for_dogs2 <- function(data, col) {
filter(data, .data[[col]] == "dog")
}
future_map(dfs, filter_for_dogs2, col = "pets")
#> [[1]]
#> # A tibble: 1 × 2
#> pets names
#> <chr> <chr>
#> 1 dog Floofy
#>
#> [[2]]
#> # A tibble: 1 × 2
#> pets names
#> <chr> <chr>
#> 1 dog Fido
Argument evaluation
In both purrr and furrr, there is a difference between passing arguments through ...
and specifying arguments in the anonymous function directly. Arguments passed through ...
are evaluated just once. If you want the argument to be evaluated at each iteration, you’ll need to put it inside the anonymous function. For example:
x <- rep(0, 3)
plus <- function(x, y) x + y
set.seed(123)
map_dbl(x, plus, runif(1))
#> [1] 0.2875775 0.2875775 0.2875775
map_dbl(x, ~ plus(.x, runif(1)))
#> [1] 0.7883051 0.4089769 0.8830174
This is the case with both furrr and purrr, but is a common question.
Note that in the furrr case, when you evaluate the argument in the anonymous function it will be evaluated on the worker itself. This means that to control the reproducibility, you should pass an options
argument with a specified seed.
plan(multisession, workers = 2)
options <- furrr_options(seed = 123)
future_map_dbl(x, plus, runif(1))
#> [1] 0.9404673 0.9404673 0.9404673
future_map_dbl(x, ~ plus(.x, runif(1)), .options = options)
#> [1] 0.1552317 0.4877356 0.5330014
plan(sequential)
Grouped data frames
A common source of frustration is swapping a map()
for a future_map()
and realizing that your computation is proceeding massively slower than it was with map()
. One possible reason for this is that you have called future_map()
on a column of a grouped data frame. For example it is possible for the following data frame to arise naturally if you have nested a grouped data frame.
set.seed(123)
df <- tibble(
g = 1:100,
x = replicate(100, runif(10), simplify = FALSE)
)
df <- group_by(df, g)
df
#> # A tibble: 100 × 2
#> # Groups: g [100]
#> g x
#> <int> <list>
#> 1 1 <dbl [10]>
#> 2 2 <dbl [10]>
#> 3 3 <dbl [10]>
#> 4 4 <dbl [10]>
#> 5 5 <dbl [10]>
#> 6 6 <dbl [10]>
#> 7 7 <dbl [10]>
#> 8 8 <dbl [10]>
#> 9 9 <dbl [10]>
#> 10 10 <dbl [10]>
#> # … with 90 more rows
#> # ℹ Use `print(n = ...)` to see more rows
If you’d like to map over this and perform some computation on each element of x
, you might try and use future_map_dbl()
, but you’ll be surprised about how slow it can be.
plan(multisession, workers = 2)
t1 <- proc.time()
df %>%
mutate(y = future_map_dbl(x, mean))
#> # A tibble: 100 × 3
#> # Groups: g [100]
#> g x y
#> <int> <list> <dbl>
#> 1 1 <dbl [10]> 0.578
#> 2 2 <dbl [10]> 0.523
#> 3 3 <dbl [10]> 0.616
#> 4 4 <dbl [10]> 0.538
#> 5 5 <dbl [10]> 0.345
#> 6 6 <dbl [10]> 0.433
#> 7 7 <dbl [10]> 0.554
#> 8 8 <dbl [10]> 0.425
#> 9 9 <dbl [10]> 0.559
#> 10 10 <dbl [10]> 0.415
#> # … with 90 more rows
#> # ℹ Use `print(n = ...)` to see more rows
t2 <- proc.time()
plan(sequential)
t2 - t1
#> user system elapsed
#> 1.631 0.046 4.594
The issue here is that the grouped nature of the data frame prevents furrr from doing what it is good at - sharding the x
column into equally sized groups and sending them off to the workers to process them in parallel.
Instead, because this data frame is grouped, and each group corresponds to 1 row of the data frame, dplyr hands future_map_dbl()
1 element of x
at a time to operate on. So future_map_dbl()
is actually being called 100 times here!
The easy solution is to just ungroup the data frame before calling future_map_dbl()
.
plan(multisession, workers = 2)
t1 <- proc.time()
df %>%
ungroup() %>%
mutate(y = future_map_dbl(x, mean))
#> # A tibble: 100 × 3
#> g x y
#> <int> <list> <dbl>
#> 1 1 <dbl [10]> 0.578
#> 2 2 <dbl [10]> 0.523
#> 3 3 <dbl [10]> 0.616
#> 4 4 <dbl [10]> 0.538
#> 5 5 <dbl [10]> 0.345
#> 6 6 <dbl [10]> 0.433
#> 7 7 <dbl [10]> 0.554
#> 8 8 <dbl [10]> 0.425
#> 9 9 <dbl [10]> 0.559
#> 10 10 <dbl [10]> 0.415
#> # … with 90 more rows
#> # ℹ Use `print(n = ...)` to see more rows
t2 <- proc.time()
plan(sequential)
t2 - t1
#> user system elapsed
#> 0.065 0.000 0.635
Graphics devices
If you use a multicore plan, you shouldn’t try to generate and save plots with any graphics devices, which includes using ggplot2. This can cause an X11 fatal error because it can’t be safely run in a forked environment, which can crash your R session. Instead, you should use plan(multisession)
to avoid these issues. See this issue for more details.
Package development
When developing a package that imports and calls functions from furrr, you’ll likely be using devtools::load_all()
as part of your development process. It is likely that unless you install your package, you might run into issues where functions internal to your package aren’t being exported to your workers (see issue #95).
Specifically, if you do the following, you will probably have issues:
Your package has not yet been installed on your machine, or you have an old version installed.
You call
devtools::load_all()
.You set up a multisession or multicore strategy for furrr.
You call
future_map()
or any other furrr function from inside your package, and.f
contains a function specific to your package.
In this example, the underlying globals package will likely think that the function you called from .f
is part of a package that is installed on your machine, so it won’t try and export it to the workers. Instead, it will just try and load up that package on the worker to get access to the function. Since the package hasn’t been installed on your machine yet (load_all()
just mocks a fake installation) the workers will fail to attach it.
The solution is just to install your package with devtools::install()
or using the RStudio Build pane, and then to restart R. Make sure that you re-install whenever you make any additional changes to the package.
Function environments and large objects
When future_map()
and friends are called from within another function, you have to be extremely careful about the .f
function that you pass through. If the function that future_map()
is called from contains a large object, it is possible for that object to get captured by the function environment of .f
and be exported to the worker, even if you never used it in the function itself.
my_fast_fn <- function() {
future_map(1:5, ~.x)
}
my_slow_fn <- function() {
# Massive object - but we don't want it in `.f`
big <- 1:1e8 + 0L
future_map(1:5, ~.x)
}
plan(multisession, workers = 2)
system.time(
my_fast_fn()
)
#> user system elapsed
#> 0.033 0.000 0.598
system.time(
my_slow_fn()
)
#> user system elapsed
#> 0.701 0.697 2.255
plan(sequential)
In the above example, big
is captured in the function environment of the anonymous function ~.x
and is exported. Note that the problem isn’t that big
is identified as a global by furrr. We can even prove that it is on the workers using get()
to look for an object called "big"
in the current and surrounding environments. I’ll use a smaller object here, but the concept is the same.
plan(multisession, workers = 2)
my_slow_fn2 <- function() {
big <- "can you find me?"
future_map(1:2, ~get("big"))
}
my_slow_fn2()
#> [[1]]
#> [1] "can you find me?"
#>
#> [[2]]
#> [1] "can you find me?"
plan(sequential)
One solution to this is to create .f
somewhere where it won’t capture that massive object in its surrounding environment. For example:
fn <- function(x) {
x
}
my_not_so_slow_fn <- function() {
big <- 1:1e8 + 0L
future_map(1:5, fn)
}
plan(multisession, workers = 2)
system.time(
my_not_so_slow_fn()
)
#> user system elapsed
#> 0.273 0.262 1.127
plan(sequential)
Here lexical scoping is used to find fn
, but you could also pass it in as an argument to my_not_so_slow_fn()
. This works naturally in a package development environment, where fn()
would just be a helper function in your package that you can call from anywhere else in the package without issue.
Again, we can prove that the object doesn’t make it onto the workers:
plan(multisession, workers = 2)
fn2 <- function(x) {
# does an object called `"big"` exist anywhere we can find it?
exists("big")
}
my_not_so_slow_fn2 <- function() {
big <- "can you find me?"
future_map(1:2, fn2)
}
my_not_so_slow_fn2()
#> [[1]]
#> [1] FALSE
#>
#> [[2]]
#> [1] FALSE
plan(sequential)
Another alternative to help with this issue is to use the carrier package to crate the function. To learn more about this, see the article entitled Alternative to automatic globals detection.