Advanced furrr: Remote connections

Introduction

At some point, you might need to go beyond your local computer for increased performance and scalability. Luckily, because furrr depends on the powerful underlying future framework, this can be done with relative ease. In this vignette, you will learn how to scale furrr to be used with AWS EC2 in two cases:

Running code remotely on a single EC2 instance
Additionally running code in parallel on each EC2 instance

AWS EC2? What?

If you know exactly what AWS EC2 is, and what AMI’s are, feel free to skip this section!

AWS EC2 = Amazon Web Services, Elastic Compute Cloud

AWS is Amazon’s network of web services they offer to help companies scale to the cloud. This includes hosting databases (RDS), providing domains for websites (Route 53), and, of course, EC2.

EC2 is a way for people like you and me to essentially rent a computer (or multiple) in the cloud for a variable amount of time. The computer can be incredibly powerful, or really weak (and cheap!). It can run Linux, or Windows. With furrr, we will run our code on these EC2 “instances” pre-loaded with R.

How do we get an instance pre-loaded with R? Great question. We will use an AMI. AMI’s are “Amazon Machine Images”, in other words, a custom computer that already has software loaded onto it, rather than one that starts with nothing. A kind soul, Louis Aslett, keeps up-to-date RStudio AMI’s on his website. We will use this for our instance.

At this point, I encourage you to look elsewhere for exactly how to set up an AWS instance based on this AMI. I have a blog post dedicated to this, located at my website blog.davisvaughan.com.

Running code remotely on a single EC2 instance

Imagine you have some models that you want to run, but you don’t want to run them on your local computer. You could test it on your computer, but then you’d ideally like to run them on a more powerful EC2 instance. What if you could change one line of your modeling code to switch between local and EC2? That’s what you’ll learn how to do here.

Modeling code

First, we need code to run that we want to run in parallel. For simplicity, say we want to run 3 separate linear models on mtcars, split up by gear.

library(dplyr)
library(purrr)

cars2_mod <- mtcars %>%
  split(.$gear) %>%
  map(~lm(mpg ~ cyl + hp + wt, data = .))

cars2_mod

## $`3`
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = .)
## 
## Coefficients:
## (Intercept)          cyl           hp           wt  
##    30.48956     -0.31883     -0.02326     -2.03083  
## 
## 
## $`4`
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = .)
## 
## Coefficients:
## (Intercept)          cyl           hp           wt  
##    43.69353      0.05647     -0.12331     -3.20537  
## 
## 
## $`5`
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = .)
## 
## Coefficients:
## (Intercept)          cyl           hp           wt  
##    45.29099     -2.03268      0.02655     -6.42290

With furrr, we can run this in parallel locally using the following:

library(furrr)

## Loading required package: future

plan(multiprocess)

cars2_mod_future <- mtcars %>%
  split(.$gear) %>%
  future_map(~lm(mpg ~ cyl + hp + wt, data = .))

Note that this is NOT faster than the sequential code, this is just to demonstrate how one might run the models in parallel.

Connecting to an EC2 instance

Now, what if these models took hours to run? Maybe we’d want to run them on a different or more powerful computer and then have the results returned right back into your local R session. In that case, go start up your favorite AWS EC2 instance, preloaded with R, and come back when you’ve finished. Then, you’ll need to:

Get the Public IP of your EC2 instance. This is located under the Instances section of the EC2 console. Specifically it is the IPv4 Public IP of your instance.
Make sure that your Security Group allows for SSH access either from Anywhere or My IP.
Find the path to your .pem file that is used to connect to the EC2 instance. This was created when you created the EC2 instance, and hopefully you know where you saved it!

# A t2.micro AWS instance
# Created from http://www.louisaslett.com/RStudio_AMI/
public_ip <- "18.206.46.236"

# This is where my pem file lives (password file to connect).
ssh_private_key_file <- "~/Desktop/programming/AWS/key-pair/dvaughan.pem"

With all of this in hand, the next step is to use future::makeClusterPSOCK() to connect to the instance. Traditionally, one would use parallel::makePSOCKcluster() to connect, but the future version has a few additional helpful arguments that allow us to add extra options when connecting to the worker. If the connection is successful, the code below should start outputting package installation messages into your local console.

cl <- makeClusterPSOCK(

  # Public IP number of EC2 instance
  workers = public_ip,

  # User name (always 'ubuntu')
  user = "ubuntu",

  # Use private SSH key registered with AWS
  rshopts = c(
    "-o", "StrictHostKeyChecking=no",
    "-o", "IdentitiesOnly=yes",
    "-i", ssh_private_key_file
  ),

  # Set up .libPaths() for the 'ubuntu' user and
  # install furrr
  rscript_args = c(
    "-e", shQuote("local({p <- Sys.getenv('R_LIBS_USER'); dir.create(p, recursive = TRUE, showWarnings = FALSE); .libPaths(p)})"),
    "-e", shQuote("install.packages('furrr')")
  ),

  # Switch this to TRUE to see the code that is run on the workers without
  # making the connection
  dryrun = FALSE
)

cl

## socket cluster with 1 nodes on host ‘18.206.46.236’

Let’s step through this a little.

workers - The public ip addresses of the workers you want to connect to. If you have multiple, you can list them here.
user - Because we used the RStudio AMI, this is always "ubuntu".
rshopts - These are options that are run on the command line of your local computer when connecting to the instance by ssh.
- StrictHostKeyChecking=no - This is required because by default when connecting to the AWS instance for the first time you are asked if you want to “continue connecting” because authenticity of the AWS instance can’t be verified. Setting this option to no means we won’t have to answer this question.
- IdentitiesOnly=yes - This is not necessarily required, but specifies that we only want to connect using the identity we supply with -i, which ends up being the .pem file.
rscript_args - This very helpful argument allows you to specify R code to run when the command line executable Rscript is called on your worker. Essentially, it allows you to run “start up code” on each worker. In this case, it is used to create package paths for the ubuntu user and to install a few packages that are required to work with furrr.
dryrun - This is already set to FALSE by default, but it’s useful to point this argument out as setting it to TRUE allows you to verify that the code that should run on each worker is correct.

Running the code

Great! Now we have a connection to our EC2 instance running R. The next step is the one line change I promised earlier. We just need to set our plan() to use the EC2 instance, rather than our local computer. Rather than using the multiprocess plan, we use the cluster plan with the extra argument, workers set to the cluster connection (see ?future::cluster for more info).

library(furrr)
plan(cluster, workers = cl)

cars2_mod_future <- mtcars %>%
  split(.$gear) %>%
  future_map(~lm(mpg ~ cyl + hp + wt, data = .))

And that’s it! Your code just ran on the more powerful EC2 instance!

Running code in parallel on each EC2 instance

Let’s crank it up a notch. Right now, you have 1 EC2 instance, and the code being run on that instance is running sequentially. What if you had multiple EC2 instances, and each of those instances had multiple cores? For maximum efficiency, you’d want to:

First, parallelize across the EC2 instances.
Then, parallelize across the cores of each EC2 instance.

A concrete example might be having 2 t2.xlarge instances, each with 4 cores. This could give you a potential max of ~8x speedup (in reality, it will be more like ~4-6x speed up because 4 of those 8 cores are virtual cores from hyperthreading so you likely won’t get a full linear speed up from them when doing any substantial work).

In the future world, this is dubbed “future topology”, but I also like the term “multi-level parallel processing”.

Connecting to multiple EC2 instances

So, just like before, start up your EC2 instance. Except this time, start up multiple of them and make sure to check out the EC2 instance type reference to see how many vCPUs each one has.

Hint) To launch multiple, after clicking on the AMI you want to use from Louis’s page, under “Configure Instance” change the “Number of instances” box to whatever you require. You might also consider using the Purchasing option, “Request Spot instances” for cheaper instances if you don’t mind the possibility that Amazon could take the instance away from you temporarily at any time.

Note that you now have a vector of public ip addresses.

# Two t2.xlarge AWS instances
# Created from http://www.louisaslett.com/RStudio_AMI/
public_ip <- c("34.205.203.220", "52.3.235.148")

# This is where my pem file lives (password file to connect).
ssh_private_key_file <- "~/Desktop/programming/AWS/key-pair/dvaughan.pem"

Otherwise, the code remains the same to make the connection!

cl_multi <- makeClusterPSOCK(

  # Public IP number of EC2 instance
  workers = public_ip,

  # User name (always 'ubuntu')
  user = "ubuntu",

  # Use private SSH key registered with AWS
  rshopts = c(
    "-o", "StrictHostKeyChecking=no",
    "-o", "IdentitiesOnly=yes",
    "-i", ssh_private_key_file
  ),

  # Set up .libPaths() for the 'ubuntu' user and
  # install furrr
  rscript_args = c(
    "-e", shQuote("local({p <- Sys.getenv('R_LIBS_USER'); dir.create(p, recursive = TRUE, showWarnings = FALSE); .libPaths(p)})"),
    "-e", shQuote("install.packages('furrr')")
  ),

  # Switch this to TRUE to see the code that is run on the workers without
  # making the connection
  dryrun = FALSE
)

cl_multi

## socket cluster with 2 nodes on hosts ‘34.205.203.220’, ‘52.3.235.148’

Running multi-level parallel code

Now for the fun part. How do we tell future to first distribute our code over the 2 instances, and then run in parallel on each instance? You pass in a list of plans to plan(), where you also have the option to tweak() each plan individually (which will be required to set the workers argument!).

library(furrr)
# The outer plan tells future to distribute over the 2 instances
# The inner plan says to run in parallel on each instance
plan(list(tweak(cluster, workers = cl_multi), multiprocess))

How do we know this is working? Let’s try doing something that would require a fixed amount of time when run locally, then try it in parallel. We are just going to wait for 5 seconds on each iteration, and then return the instance we are on and the core we are using. In total this should take 40 seconds.

plan(sequential)

t1 <- proc.time()

res <- future_map(

  # Map over the two instances
  .x = c(1, 2),

  .f = ~ {

    outer_idx <- .x

    future_map(

      # Each instance has 4 cores we can utilize
      .x = c(1, 2, 3, 4),

      .f = ~ {
        inner_idx <- .x
        Sys.sleep(5)
        paste0("Instance: ", outer_idx, " and core: ", inner_idx)
      }
    )

  }
)

t2 <- proc.time()

res

## [[1]]
## [[1]][[1]]
## [1] "Instance: 1 and core: 1"
## 
## [[1]][[2]]
## [1] "Instance: 1 and core: 2"
## 
## [[1]][[3]]
## [1] "Instance: 1 and core: 3"
## 
## [[1]][[4]]
## [1] "Instance: 1 and core: 4"
## 
## 
## [[2]]
## [[2]][[1]]
## [1] "Instance: 2 and core: 1"
## 
## [[2]][[2]]
## [1] "Instance: 2 and core: 2"
## 
## [[2]][[3]]
## [1] "Instance: 2 and core: 3"
## 
## [[2]][[4]]
## [1] "Instance: 2 and core: 4"

t2 - t1

##   user  system elapsed 
##  0.227   0.122  40.146

Now, in parallel with our cluster. The outer future_map() call distributes over the two instances, and the inner future_map() call distributes over the cores of each instance. This should take ~5 seconds, with some overhead.

plan(list(tweak(cluster, workers = cl_multi), multiprocess))

t1 <- proc.time()

res <- future_map(

  # Map over the two instances
  .x = c(1, 2),

  .f = ~ {

    outer_idx <- .x

    future_map(

      # Each instance has 4 cores we can utilize
      .x = c(1, 2, 3, 4),

      .f = ~ {
        inner_idx <- .x
        Sys.sleep(5)
        paste0("Instance: ", outer_idx, " and core: ", inner_idx)
      }
    )

  }
)

t2 <- proc.time()

t2 - t1

##   user  system elapsed 
##  0.126   0.029   8.099

Not bad! The extra few seconds are due to the overhead of communicating with the AWS workers, but with a large model this would not be relevant.

Conclusion

In this vignette, you learned how to distribute your code over AWS EC2 instances, and run code in parallel on each instance using future and furrr. Note that the code used here can also be used to run code on platforms such as Google Cloud Compute, or other remote clusters. You will just have to figure out the correct way to connect to those clusters. Additionally, once you have the connection in place you could just run basic future() commands to distribute code as well. This has the added benefit of not locking up your computer until you request the result with value(). Happy parallelizing!

Davis Vaughan

2020-07-28