High performance pipelines

1. Introduction
2. Refresher course on functions
3. Parallel computing
- The classics : parallel and foreach
- The new kid : future
4. benchmarking (optionnal)
5. Managing pipelines with package : targets
- [Optionnal section about progressr / beepr]

1. Introduction

This workshop is inspired by talks given by Bryan Lewis, Henrik Bengtsson and Will Landau who worked on the foreach, future and targets R libraries. Their talks will be referenced across the workshop.

What are pipelines?

Pipelines represent queues of tasks given to a computer to process. In their most basic form, they are just series of functions that bring data from point A to point B. In higher forms of pipelines, they can link functions as meaningful steps conditional to one another that can be ran in parallel, on one or multiple computing clusters, both locally and remotely, synchronously or asynchronously.

What is high performance computing (HPC)?

High performance computing refers to the capacity of using all available resources to reduce the time required for a task. When linking tasks in a pipeline where each task requires large amounts of computing resources, it becomes vital to make your processing environment as efficient as possible.

A few rules can make this easier:
- Use all of you resources (parallel programming)
- Make your code easily scalable (future)
- Only process what you need (target)
- Use the fastest known route (benchmark)
- Always test on small subsets when possible

Should you care?

If you spend multiples hours waiting for your scripts to finish, re-run the same scripts (or parts of them) often or end up with long and complex scripts with a lot of back and forth, some of these tricks could save you a few headaches.

2. Refresher course on functions

Let’s steal some material from Charles Martin’s Numerilab on functions and iteration.

When you need to copy and paste >2 times, make a function! It’s even more important in pipelines because it makes tracking changes much easier.

Writing a basic function

A basic function requires a name, one or multiple arguments and some kind of R expression to evaluate.
Let’s build a function that returns Shannon’s diversity index for a given community.

# generate a random community of 5 individuals
com1 = runif(5)

# define our function
div_shannon = function(community){
  
  
  -sum(community*log(community))
}

# call our function on our community
div_shannon(com1)

## [1] 1.440253

Which is equivalent to calculating the index manually

-sum(com1*log(com1))

## [1] 1.440253

Notes about defensive programming, conditions and exit points

Since functions can quickly become complex and hard to understand (especially for other users, which includes future you), it can be beneficial to write your functions in ways that either prevent errors or inform about their nature. This can be done by knowing the type of data expected by the function and adding conditions and exit points to your functions.

For example, let’s look at vegan’s diversity function, which basically does the same as our function, but also allows to obtain Simpson’s index and its inverse.

print(vegan::diversity)

## function (x, index = "shannon", MARGIN = 1, base = exp(1))
## {
##     x <- drop(as.matrix(x))
##     if (!is.numeric(x))
##         stop("input data must be numeric")
##     if (any(x < 0, na.rm = TRUE))
##         stop("input data must be non-negative")
##     INDICES <- c("shannon", "simpson", "invsimpson")
##     index <- match.arg(index, INDICES)
##     if (length(dim(x)) > 1) {
##         total <- apply(x, MARGIN, sum)
##         x <- sweep(x, MARGIN, total, "/")
##     }
##     else {
##         x <- x/(total <- sum(x))
##     }
##     if (index == "shannon")
##         x <- -x * log(x, base)
##     else x <- x * x
##     if (length(dim(x)) > 1)
##         H <- apply(x, MARGIN, sum, na.rm = TRUE)
##     else H <- sum(x, na.rm = TRUE)
##     if (index == "simpson")
##         H <- 1 - H
##     else if (index == "invsimpson")
##         H <- 1/H
##     if (any(NAS <- is.na(total)))
##         H[NAS] <- NA
##     H
## }
## <bytecode: 0x000000001fd28a50>
## <environment: namespace:vegan>

Here, we can see a long sequence of conditions that result in multiple exit points, custom error messages if conditions are not met by input data and multiple types of input depending on the value given to arguments.

For more information on functions, refer to Charles Martin’s Numerilab on functions and iteration

Iterations with for loops and purrr

The biggest advantage to writing functions is that they can be used to evaluate the same expression a large number of time for different arguments.

The simplest way is probably the for loop, where for every element in a sequence, the same actions will be performed.

for(i in 1:5){
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

It can be used to perform the same action multiple times:

for(i in 1:3){
  com = runif(10)
  ds = div_shannon(com)
  print(ds)
}

## [1] 2.676359
## [1] 2.482739
## [1] 2.729525

Or to run over a number of objects (list, vector, rows) to populate an output.

com1 = runif(100)
com2 = runif(100)
com3 = runif(100)
coms = list(com1,com2,com3)
names(coms) = c("com1","com2","com3")

# over a list
for(i in seq_along(coms)){
  di = div_shannon(unlist(coms[i]))
  print(di)
}

## [1] 26.57868
## [1] 24.64214
## [1] 24.77425

# over a list to populate a dataframe
df_shannon = data.frame(NULL)
for(i in seq_along(coms)){
  ds = data.frame(community = i, shannon = div_shannon(unlist(coms[i])))
  df_shannon = rbind(df_shannon,ds)
}
df_shannon

##   community  shannon
## 1         1 26.57868
## 2         2 24.64214
## 3         3 24.77425

Similarly, a function can be applied to multiple objects by using the apply (base R) and map (tidyverse) function families (which are very similar).

lapply(coms, div_shannon) # apply function over a list and return a list

## $com1
## [1] 26.57868
##
## $com2
## [1] 24.64214
##
## $com3
## [1] 24.77425

sapply(coms, div_shannon) # apply function over a list and return an array of the same length

##     com1     com2     com3
## 26.57868 24.64214 24.77425

library(purrr)
library(dplyr)

coms %>% 
  map(div_shannon)

## $com1
## [1] 26.57868
##
## $com2
## [1] 24.64214
##
## $com3
## [1] 24.77425

coms %>% 
  map_dbl(div_shannon)

##     com1     com2     com3
## 26.57868 24.64214 24.77425

library(tidyr)
coms %>% 
  map_df(div_shannon) %>% 
  pivot_longer(cols = everything())

## # A tibble: 3 x 2
##   name  value
##   <chr> <dbl>
## 1 com1   26.6
## 2 com2   24.6
## 3 com3   24.8

3. Parallel computing

When computing larger tasks, processing times will increase sometimes exponentially and make it almost impossible to resolve some scripts sequentially by using a single processor. Luckily, nowadays most processors have multiple cores able to handle multiple tasks at once.
On the other hand, R has long been considered ill-fitted for parallel computing, with parallel functions not working on all operating systems (i.e. Windows) and too often not being adaptable to the users resources.
More recently, multiple wrappers have been developed to allow parallel computing on a wider array of functions and platforms.

The classics : parallel and foreach

parallel: R’s built-in solution for parallel computing is the use of the parallel library.

The parallel library does not work on Windows, but allows for very simple parallelisation, using “multicore” functions designated by the mc prefix, such as mclapply() where users can define the number of processing cores to use.

library(parallel)
# locally
detectCores()
mclapply(coms, div_shannon, mc.cores = detectCores()-1) # does not work on windows

# can also register a cluster using makeCluster() and clusterApply()

But, different structures of HPC use different protocols (different OSs, cluster access structure, etc.), which can cause problems because the same code won’t be able to run in different environments.

This reveals major scaling issues :
- Will my code run if I get more cores?
- Will my code run if I have 5 computers?
- Will my code run if I have access to a remote cluster?
- When writing functions, do I need to maintain multiple code bases? one for sequential, one for local parallel, one for clusters…

foreach :

To answer these questions, the R community’s response was the foreach package, which gave a better structure to parallel function writing.
Using foreach, developers can decide which parts of the code can be run in parallel, and the users decide whether it is run in sequential or parallel, and specify their preferred parallel back-end using the do[…] libraries (doParallel, doSNOW, doMC, doMPI, doFuture).

The structure of a foreach call is somewhere between a for loop and the apply functions, combining a for call with a .combine argument acting as the reduce function.

Let’s compare the time required to run a for loop in sequential and in parallel when working with heavier datasets.

n = 10000000
com1 = runif(n)
com2 = runif(n)
com3 = runif(n)
coms = list(com1,com2,com3)
names(coms) = c("com1","com2","com3")


system.time(
  for(i in seq_along(coms)){
  di = div_shannon(unlist(coms[i]))
  print(di)
  }
)

## [1] 2499836
## [1] 2500226
## [1] 2500127

## utilisateur     système      écoulé
##       20.87        0.31       21.28

library(foreach)

env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
  
system.time(
  foreach(i = seq_along(coms)) %dopar% {
  di = div_shannon(unlist(coms[i]))
  }
)

## Warning: executing %dopar% sequentially: no parallel backend registered

## utilisateur     système      écoulé
##       19.42        0.22       20.07

Ran like this, the code knows it can run in parallel, but since no parallel backends are registered, it runs in sequential.
Also, notice how the default exit is more like lapply, but can be adjusted using the .combine argument.
Now, let’s register a parallel backend using doParallel.

library(doParallel)

cores = detectCores()
myCluster = parallel::makeCluster(cores)

doParallel::registerDoParallel(cl = myCluster)

system.time(
  foreach(i = seq_along(coms), .combine=rbind) %dopar% {
  di = div_shannon(unlist(coms[i]))
  }
)

## utilisateur     système      écoulé
##        2.45        1.17       17.41

stopCluster(cl = myCluster) #technically not required but recommended

While we don’t see a large gain in speed (due to the time required to actually start the workers), we still see some gains in computing time.

foreach is now a dependency for hundreds of R packages used for computationnaly heavy tasks

Here is a link to a great talk by Brian Lewis telling about the foreach philosophy, its use and its contribution to the R community

The new kid : future

https://github.com/HenrikBengtsson/future

To further simplify development and use of parallel functions, the future package was created to simplify even more asynchronous, parallel and distributed processing. The big idea behind future was to create a simple unifying solution for parallel APIs. “Write once, run anywhere”.

It is not necessarily meant to replace foreach, but to make its principles more accessible, as such, it can be inserted into classic foreach writing using the doFuture backend.

See this talk by Henrik Bengtsson for a more complete presentation of the future package.

To summarize the idea of a future, it can be presented as 3 building blocks :
1. the future() function, designing an r expression to run in parallel in the background (does not freeze the R session)
2. the resolved() function that checks if the future is complete
3. the value() function that checks if the future is resolved and wait to return the results

To exemplify this, let’s steal this function from Henrik Bengtsson’s talk to make lapply into a future function

future_lapply = function(X, FUN, ...){
  futures = lapply(X, function(x) future(FUN(x, ...)))
  lapply(futures, value)
}

Which means use lapply function to create a list of futures (values to be computed in the background) and then give me their values as a list when they are ready. Workers will jump from a future to another when they are available.

These functions are already built for the apply family in the package future_apply. The same type of package has been written for different parallel synthaxes, so if you already code for parallel computing, you can stay with the same coding style that you like.

If you like using the apply family, use the future_apply library where

lapply(x, function)

becomes

future_lapply(x, function)

If you like the purrr / map family, use the furrr library where

x %>% map(function)

becomes

x %>% future_map(function)

or, if you are familiar with foreach, register your parallel backend with the doFuture library use %dopar% as usual. As a bonus, you won’t have to specify .export functions to export global variables into your foreach functions.

doFuture::registerDoFuture()
foreach(z = x) %dopar% function(z)

Now that your code knows to call future, the user simply has to specify how he wants to distribute the processing load using the plan() function at the top of their code.

plan(sequential)
plan(multisession)
plan(cluster,
workers = c("n1","m2.uni.edu","vm.cloud.org"))
plan(batchtools_slurm) # for types of HPC


future_lapply(x, function)

In every cases: the code remains the same for the developer and it is the user that is responsible of assigning the available resources.

Future is notably used in asynchronous shiny apps to prevent the UI from freezing when running heavier processes and is also an important part of some pipeline tools such as drake/targets

4. benchmarking (optionnal)

profvis

5. Managing pipelines with package : targets

https://github.com/ropensci/targets

Now, what if you have a very complex pipeline with numerous steps and dependencies that each require multiple adjustments for calibration, such as bayesian data analysis, deep/machine learning, statistical genomics, etc.?

targets was built on the premise that R currently falls behind in the pipeline community, mainly because of its different culture : work in a single or few files and going through it from top to bottom, versus building your function ecosystem and compiling.

Running your functions in parallel might win you some time, but the biggest time sink in complex computing problems is often the fact that you always need to re-run code while making sure to track your changes everywhere, leading to reproducibility issues and headaches.

ex: re-calibrate a training set, change a prior, etc.

In those cases, to favor performance and reproducibility, it might be interesting to use a workflow manager such as targets, which can feel highly different to literate progamming styles often used in Rmarkdown.

The strength of this approach is that it de-clutters complex scripts, while mapping the dependencies between each function, tracking what needs and doesn’t need to be ran depending on changes made to the pipeline.

Mapping the depencies between functions allows the creation of multiple branches through the workflow which can be run in parallel by multiple workers using future, which is especially useful if you have access to multiple clusters, of if some of your processing does not allow for within-function parallelism.

For a further dive into the philosophy of targets, see this talk by Will Landau presenting the targets library where he summarises the strengths of this approach as follow :
- scale up the work you need
- skip the work you don’t
- see evidence of reproducibility

By building relevant targets based on functions, you will:
- define relationships among inputs and outputs
- help targets decide which tasks to run or skip
- break down complicated ideas into manageable pieces
- make the code easier to read

So how do we build a target pipeline?

Structure: The structure is very flexible, the only real requirement is a _targets.R script, or a target markdown document. Then you need to define your functions, which can be done in another script to prevent clutter. And finally you need to tell targets which libraries are needed and give it a list of targets.

Let’s build a pipeline to investigate the penguins dataset from the palmerpenguins library.

First, create a new R script called _targets.R and load the targets library. If you want to visualize your pipeline, load visNetwork and if you want to use future, also include future and set your plan() function.

library(targets)

## Warning: le package 'targets' a été compilé avec la version R 4.1.2

library(visNetwork)

## Warning: le package 'visNetwork' a été compilé avec la version R 4.1.2

library(future)

## Warning: le package 'future' a été compilé avec la version R 4.1.2

plan(multisession, workers = 2)

Then, either define your custom functions or load them from a separate script script (ex. functions.R) using the source() function.

# clean data
data_clean = function(x,
                      species = NULL,
                      island = NULL){
  
  if(!identical(x,penguins)){
    return("The only acceptable input is penguins")
  }
  
  res = x %>%
    drop_na()
  
  if(!is.null(species)){
     if(species %in% c("Adelie","Chinstrap","Gentoo")){
       res = res %>%
        filter(species == species)
       }else{
         return("Accepted species are Adelie, Chinstrap and Gentoo")
       }
    }
  
  if(!is.null(island)){
    if(island %in% c("Biscoe","Dream","Torgersen")){
      res = res %>% 
        filter(island == island)
    }else{
        return("Accepted islands are Biscoe, Dream and Torgersen")
      }
  }
  
  return(res)
}

## Establish _targets.R and _targets_r/globals/function1.R.

Specify to targets which libraries are required to run the pipeline

tar_option_set(packages = c("tidyverse","palmerpenguins","future","furrr"))

## Establish _targets.R and _targets_r/globals/options.R.

Now define your targets within a list.
To create a target, use the tar_target() function, where the first argument is the name of the target to create, and the second is the function that will be used to create this target. Notice that you should simply call a prior target’s name in the call for the next ones. This is how targets sees dependencies. This list should be the final element in the _targets.R file.

list(
  tar_target(data, palmerpenguins::penguins),
  tar_target(clean, data_clean(data))
)

## Establish _targets.R and _targets_r/targets/list_def.R.

After saving this file, you should be able to see your pipeline by running tar_visnetwork() in the R console (do not write it in your _target.R file). You will see that targets automatically knows about the dependencies in your pipeline and maps them accordingly.

tar_visnetwork()

## Message d'avis :
## le package 'targets' a été compilé avec la version R 4.1.2

Then, to compile this pipeline, simply use tar_make() and it will run all required targets, skipping all the ones that are already up-to-date.

tar_make()

## * start target data
## * built target data
## * start target clean
## * built target clean
## * end pipeline
## Messages d'avis :
## 1: le package 'targets' a été compilé avec la version R 4.1.2
## 2: le package 'palmerpenguins' a été compilé avec la version R 4.1.2
## 3: le package 'future' a été compilé avec la version R 4.1.2
## 4: le package 'furrr' a été compilé avec la version R 4.1.2
## 5: 1 targets produced warnings. Run tar_meta(fields = warnings) for the messages.

Now, you can visualize your network again to see if all your pipeline was able to run without issues.

tar_visnetwork()

## Message d'avis :
## le package 'targets' a été compilé avec la version R 4.1.2

When targets are correctly executed, their results can be called as regular R objects using the tar_read function

tar_read(clean)

## # A tibble: 333 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # ... with 323 more rows, and 2 more variables: sex <fct>, year <int>

Let’s add a few steps to the pipeline to make it more complex.
Let’s look at how body mass influences bill length for different species and islands in the dataset. We will make a different pipeline branch for two species and make a figure from each branch. All branches will be ran in parallel using the tar_make_future() function.

First, let’s build our plot function and add it to our functions script.

tar_option_set(packages = c("tidyverse","palmerpenguins","future","furrr"))

## Establish _targets.R and _targets_r/globals/options2.R.

# clean data
data_clean = function(x,
                      sp = NULL,
                      isl = NULL){
  
  if(!identical(x,penguins)){
    return("The only acceptable input is penguins")
  }
  
  res = x %>%
    drop_na()
  
  if(!is.null(sp)){
     if(sp %in% c("Adelie","Chinstrap","Gentoo")){
       res = res %>%
        filter(species == sp)
       }else{
         return("Accepted species are Adelie, Chinstrap and Gentoo")
       }
    }
  
  if(!is.null(isl)){
    if(isl %in% c("Biscoe","Dream","Torgersen")){
      res = res %>% 
        filter(island == isl)
    }else{
        return("Accepted islands are Biscoe, Dream and Torgersen")
      }
  }
  
  return(res)
}

# plot data
plot_species = function(x){
  x %>% 
    ggplot(aes(x=body_mass_g,
               y=bill_length_mm,
               col=island)) +
            geom_point()+
            geom_smooth(method="lm",se=F)+
            theme_minimal()+
            ggtitle(paste(unique(x$species), collapse = " "))
          
}

## Establish _targets.R and _targets_r/globals/function2.R.

Then adjust our pipeline

list(
  tar_target(data, palmerpenguins::penguins),
  
  tar_target(adelie, data_clean(data,
                                sp="Adelie")),
  
  tar_target(chinstrap, data_clean(data,
                                   sp="Chinstrap")),
  
  tar_target(adelie_plot, plot_species(adelie)),
  
  tar_target(chinstrap_plot, plot_species(chinstrap))
)

## Establish _targets.R and _targets_r/targets/list_def2.R.

tar_visnetwork()

## Message d'avis :
## le package 'targets' a été compilé avec la version R 4.1.2

tar_make_future()

## v skip target data
## * start target adelie
## * built target adelie
## * start target adelie_plot
## * built target adelie_plot
## * start target chinstrap
## * built target chinstrap
## * start target chinstrap_plot
## * built target chinstrap_plot
## * end pipeline
## Messages d'avis :
## 1: le package 'targets' a été compilé avec la version R 4.1.2
## 2: le package 'palmerpenguins' a été compilé avec la version R 4.1.2
## 3: le package 'future' a été compilé avec la version R 4.1.2
## 4: le package 'furrr' a été compilé avec la version R 4.1.2
## 5: 1 targets produced warnings. Run tar_meta(fields = warnings) for the messages.

We can now look at our targets and start visualising data.

tar_read(adelie_plot)

## `geom_smooth()` using formula 'y ~ x'

tar_read(chinstrap_plot)

## `geom_smooth()` using formula 'y ~ x'

Targets is also compatible with within-function parallelism, so we can build functions using future as seen previously, and the workers will distribute between tasks and branches to minimize downtime in the code.

Let’s add a new function to calculate the linear coefficient between the bill length and body mass of our species, using within-function parallelism

tar_option_set(packages = c("tidyverse","palmerpenguins","future","furrr"))

## Establish _targets.R and _targets_r/globals/options3.R.

# clean data
data_clean = function(x,
                      sp = NULL,
                      isl = NULL){
  
  if(!identical(x,penguins)){
    return("The only acceptable input is penguins")
  }
  
  res = x %>%
    drop_na()
  
  if(!is.null(sp)){
     if(sp %in% c("Adelie","Chinstrap","Gentoo")){
       res = res %>%
        filter(species == sp)
       }else{
         return("Accepted species are Adelie, Chinstrap and Gentoo")
       }
    }
  
  if(!is.null(isl)){
    if(isl %in% c("Biscoe","Dream","Torgersen")){
      res = res %>% 
        filter(island == isl)
    }else{
        return("Accepted islands are Biscoe, Dream and Torgersen")
      }
  }
  
  return(res)
}

# plot data
plot_species = function(x){
  x %>% 
    ggplot(aes(x=body_mass_g,
               y=bill_length_mm,
               col=island)) +
            geom_point()+
            geom_smooth(method="lm",se=F)+
            theme_minimal()+
            ggtitle(paste(unique(x$species), collapse = " "))
          
}

# model data

mass2bill_coef = function(...){
  coefs = future_map(list(...),function(X){
    fit = lm(bill_length_mm~body_mass_g, data=X)
    return(data.frame(sp = X$species[1], coef = fit$coefficients[2]))
  })
  return(coefs)
}

## Establish _targets.R and _targets_r/globals/function3.R.

Then adjust our pipeline

list(
  tar_target(data, palmerpenguins::penguins),
  
  tar_target(adelie, data_clean(data,
                                sp="Adelie")),
  
  tar_target(chinstrap, data_clean(data,
                                   sp="Chinstrap")),
  
  tar_target(adelie_plot, plot_species(adelie)),
  
  tar_target(chinstrap_plot, plot_species(chinstrap)),
  
  tar_target(mass2bill, mass2bill_coef(adelie,chinstrap))
)

## Establish _targets.R and _targets_r/targets/list_def3.R.

We should now see that our new target depends on 2 prior targets, so changing anything in the definition of them will force the coefficients to be re-estimated.

tar_visnetwork()

## Message d'avis :
## le package 'targets' a été compilé avec la version R 4.1.2

tar_make_future()

## v skip target data
## v skip target adelie
## v skip target adelie_plot
## v skip target chinstrap
## v skip target chinstrap_plot
## * start target mass2bill
## * built target mass2bill
## * end pipeline
## Messages d'avis :
## 1: le package 'targets' a été compilé avec la version R 4.1.2
## 2: le package 'palmerpenguins' a été compilé avec la version R 4.1.2
## 3: le package 'future' a été compilé avec la version R 4.1.2
## 4: le package 'furrr' a été compilé avec la version R 4.1.2
## 5: 1 targets produced warnings. Run tar_meta(fields = warnings) for the messages.

tar_read(mass2bill)

## [[1]]
##                 sp        coef
## body_mass_g Adelie 0.003159889
##
## [[2]]
##                    sp        coef
## body_mass_g Chinstrap 0.004462694

Because targets understands dependencies among functions, it is easier to prove that all outputs are synchronized with the latest version of your functions, parameters, etc.

This can be a major asset for reproducibility by giving a proof that the data really goes from point A to point B versus literate programming such as RMarkdown that is more prone to errors in complexe iterative coding exercises.

Sadly, there are downsides to every tools.

Most of us wouldn’t consider ourselves programmers or developpers… We are environmental scientists. Is it really reasonnable to write almost a complete packages worth of functions everytime we do something?

There is a reason why we often use R as we do, so its really your responsibility to question whether or not this would be beneficial to your work.

Still, to make the democratisation of targets easier, the developpers behind the project are working on what they call the R Targetopia. A group of extensions to targets to make large pipelines easier to deploy, adapting targets to other packages like Rstan using stantargets, or even using target factories that allows users to deploy massive pipelines as single targets, nesting multiple others.

Those approaches are especially efficient if you or your group or your community plans on using similar pipelines often, simply recycling most of the work and saving huge amounts of time.

Finally, recent targets development includes the target markdown extension that brings the target universe into the literate style of rmarkdown, allowing to narrate pipeline construction and interact with them in a notebook interface. target markdown was used in constructing this markdown document. Specific information can be found here

[Optionnal section about progressr / beepr]

progressr: unifying API for progress updates, working with futures, purrr, lapply, foreach, for/while loops

easy functions for developpers and end users that works with shiny beepr, add sounds to progress evaluation

my_function = function(x){
  p = progressor(along(x))
  y = future_lapply(x, function(z){
    p(paste0("z=",z, " by ", Sys.getpid()))
    some_function(z)
  })
  sum(y)
}

handlers("progress", "beepr")
plan(multisession)
with_progress(y = my_function(x))