sequential tar_map [help] #150

wsteenhu · 2023-06-21T15:41:23Z

wsteenhu
Jun 21, 2023

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I have pipeline I would like to run over three studies ('study1', 'study2'and 'study3'), with different numbers of batches (Batch1-7 for study 1, Batch1-2 for study2 and Batch 1 for study3).
I am able to create a small tibble, including study and batch as columns to tar_map over. So far so good.

# _targets.R file:
library(targets)
library(tarchetypes)

values <- list(
     study1 = paste0("Batch", 1:7), 
     study2 = paste0("Batch", 1:2), 
     study3 = "Batch1") %>% 
  enframe(value = "batch") %>% 
  unnest(batch)

mapped <- tar_map(
  unlist = TRUE,
  values,
  tar_target(name = x, 
             command = paste0(study, batch)
             )
  )

After a couple of preprocessing steps, I would like to split the pipeline further, additionally looping over a third variable (mode, which can be 'full' or 'downsampled'). I do not want to repeat the steps in 'mapped' (which are shared between modes).

A nested map (#132) does not seem to solve this issue, as I need to be able to access the values in 'values' ('study' and 'batch'), in addition to 'new' values in 'mode'.
Would a sequential tar_map work?

In addition, steps are largely shared between studies, but some steps are different. Is it possible to include if() statements within the targets-list based on the values I map over (e.g. if(study == "study1") { tar_target(...) }. This currently results in the error 'object study not found'.

I guess an alternative strategy would be to run the first steps ('mapped' in the example above) and define a pipeline for each study separately.

Any advice would be very welcome.

wlandau · 2023-06-21T20:58:59Z

wlandau
Jun 21, 2023
Maintainer

A nested map (#132) does not seem to solve this issue, as I need to be able to access the values in 'values' ('study' and 'batch'), in addition to 'new' values in 'mode'. Would a sequential tar_map work?

A nested map can provide access to study and batch. Please see the _targets.R file below.

library(targets)
library(tarchetypes)
library(dplyr)
library(tidyr)
library(tibble)

values <- tribble(
  ~study, ~batch,
  "study1", "batch1",
  "study1", "batch2",
  "study1", "batch3",
  "study1", "batch4",
  "study1", "batch5",
  "study1", "batch6",
  "study1", "batch7",
  "study2", "batch1",
  "study2", "batch2",
  "study3", "batch1"
)

tar_map(
  values,
  tar_target(
    name = x, 
    command = paste(study, batch)
  ),
  tar_map(
    values = list(mode = c("full", "downsampled")),
    tar_target(
      y,
      list(
        x = x,
        study = study,
        batch = batch
      )
    )
  )
)

In addition, steps are largely shared between studies, but some steps are different. Is it possible to include if() statements within the targets-list based on the values I map over (e.g. if(study == "study1") { tar_target(...) }. This currently results in the error 'object study not found'.

You can handle this logic by setting to NULL any targets you don't want in the output of tar_map(unlist = FALSE). Also, you can include/exclude combinations of targets in a single tar_map() in the values grid.

4 replies

wsteenhu Jun 22, 2023
Author

Dear wlandau, thank you very much for this helpful answer.

Another, related question would be: assume I would want to run the full and downsampled modes separately, would it be possible to use the same preprocessing (generated through tar_map) for both?

Here's an example:

library(targets)
library(tarchetypes)
library(dplyr)
library(tidyr)
library(tibble)

values <- tribble(
  ~study, ~batch,
  "study1", "batch1",
  "study1", "batch2",
  "study1", "batch3",
  "study2", "batch1",
  "study2", "batch2",
  "study3", "batch1"
)

preprocessing <- tar_map(
  values,
  tar_target(
    name = a, 
    command = list(study, batch)),
  tar_target(
    name = b, 
    command = list(a, study, batch))
)

mapped <- tar_map(
  values,
  tar_target(
    name = c,
    command = preprocessing %>% tar_select_targets(contains("b"))
  ),
  tar_target(
    name =d ,
    command = list(c, study, batch)
  )
)

list(preprocessing, mapped)

I want to create a separate mapped-target list for the 2 modes (specifying mode-specific resources for each).
I however find that preprocessing and mapped are not connected at the moment (running tar_visnetwork())

I am familiar with the use of tar_combine after tar_map, but not sure how to be able to connect targets from two tar_map-calls (avoiding rerunning targets from the first tar_map).

wlandau Jun 22, 2023
Maintainer

If the goal is to combine "full" and "downsampled" separately, you can use the named groups of the tar_map() list to access the relevant sections of the pipeline. For more complicated groups, you can use tar_select_targets() as you did above. Does that help?

# _targets.R file
library(targets)
library(tarchetypes)
library(dplyr)
library(tidyr)
library(tibble)

values <- tribble(
  ~study, ~batch,
  "study1", "batch1",
  "study1", "batch2",
  "study1", "batch3",
  "study1", "batch4",
  "study1", "batch5",
  "study1", "batch6",
  "study1", "batch7",
  "study2", "batch1",
  "study2", "batch2",
  "study3", "batch1"
)

mapped <- tar_map(
  values,
  tar_target(
    name = x, 
    command = paste(study, batch)
  ),
  tar_map(
    values = list(mode = c("full", "downsampled")),
    tar_target(
      y,
      list(
        x = x,
        study = study,
        batch = batch
      )
    )
  )
)

list(
  mapped,
  tar_combine(downsampled, mapped$y_downsampled),
  tar_combine(full, mapped$y_full)
)

wsteenhu Jun 23, 2023
Author

Sorry if I was unclear and sorry I keep on bothering you with this.

I am currently aiming for the following; I would like to perform a few preprocessing steps (using tar_map, mapping over study and batch). After that, I would like to define a separate 'mapped' object for each mode (so a mapped_full and mapped_downsampled. These should be also created using tar_map, but I of course would like to use the targets created using the preprocessing tar_map call in the mapped_full and mapped_downsampled-tar_maps.

The following code sort of seems to be a start (at least showing what I would like in the DAG) (inspired by this discussion: #123):

library(targets)
library(tarchetypes)
library(dplyr)
library(tidyr)
library(tibble)

values <- tribble(
  ~study, ~batch,
  "study1", "batch1",
  "study1", "batch2",
  "study1", "batch3",
  "study2", "batch1",
  "study2", "batch2",
  "study3", "batch1"
)

preprocessing <- tar_map(
  values,
  tar_target(
    name = a, 
    command = list(study, batch)),
  tar_target(
    name = b, 
    command = list(a, study, batch))
)

source_name = values %>% 
  mutate(source = paste0("b_", study, "_", batch),
         name = paste0("c_", study, "_", batch))

mapped_full = tar_eval(
  values = source_name,
  tar_combine(name,
              preprocessing %>%
                tar_select_targets(all_of(source)))
)

list(preprocessing, mapped_full)

DAG:

As you can see, at least the correct targets of the first tar_map (preprocessing), seem to be connected to the next step.
However, I currently do not know how to add any subsequent steps (I again would like to use output c, mapping over study and batch (as I did in preprocessing) and continue the pipeline.

So long story short: I need a way to connect two sequential tar_map() calls, using the targets of the first tar_map as input to the second tar_map. Reason being the repeating the steps in the first tar_map() would be a waste of resources, as they are shared by subsequent tar_maps. Hope this makes sense.

wlandau Jun 23, 2023
Maintainer

Sorry, I am still having trouble understanding why it is necessary to connect two separate tar_map()s. It is usually easier to accomplish this within a single tar_map(), and this approach avoids the wasted resources you describe which would come with defining superfluous targets. For example, the following target list should produce the graph you posted above:

tar_map(
  values,
  tar_target(
    name = a, 
    command = list(study, batch)
  ),
  tar_target(
    name = b, 
    command = list(a, study, batch)
  ),
  tar_target(
    name = c, 
    command = list(b, study, batch)
  )
)

Alternatively, if you are looking for a different kind of graph and are not sure how to show it here, then tools like https://mermaid.live would let you to draw it by hand.

wsteenhu · 2023-06-23T11:52:56Z

wsteenhu
Jun 23, 2023
Author

Dear Will, thanks once more.
I felt the need for combining two tar_maps is that fact that I have one preprocessing workflow which should then be reused by 1) tar_map in full mode and 2) tar_map in downsampled mode.
I guess specifying the preprocessing steps twice (for in each tar_map for each mode) results in code being ran twice.

I got quite far combining the full and downsampled modes in a nested tar_map() as discussed before, but the problem with that was my inability to specify resources separately for both modes. In my use-case, I used a resources.yaml file, specifying resources for each of the steps (separetely for each mode, as full mode requires more resources). Although 'values' of tar_map can be accessed within the command-option of a tar_target, they seem not available to the resources-option.

E.g. assume a resources.yaml resulting in the following object:

resources <- list(downsampled = c(walltime = "00:10:00", memory = "8G"), 
                  full =  c(walltime = "01:00:00", memory = "8G"))

I defined a function to set resources on our HPC system:

set_resources <- function(resources) {
  # helper-function to set resources for SLURM.
  tar_resources(
    future = tar_resources_future(
      plan = tweak(
        batchtools_slurm,
        template = 'batchtools.slurm.tmpl', # batchtools.slurm.tmpl
        resources = resources
      )
    )
  )
}

Now I combined the tar_map's as suggested before (preprocessing = a; shared between modes and b; separate per mode). I try to specify resources for each of the modes separately.

values <- tribble(
  ~study, ~batch,
  "study1", "batch1",
  "study1", "batch2",
  "study1", "batch3",
  "study2", "batch1",
  "study2", "batch2",
  "study3", "batch1"
)

mapped <- tar_map(
  values,
  tar_target(
    name = a, 
    command = list(study, batch)
  ),
    tar_map(
      values = list(mode = c("downsampled", "full")),
      tar_target(
        name = b,
        command = list(mode, a),
        resources = resources[[mode]]
      )
  )
)
list(mapped)

This gives the error Last error: invalid subscript type 'closure' (as 'mode' is not available outside the command-option of tar_target).

Sorry I keep on bugging you with this, would be really helpful to me to sort out. Also hope it makes sense there are several options to solve (as said, sequential maps, or this nested approach, especially if I am able to define per-mode resources).

0 replies

wlandau · 2023-06-23T16:58:01Z

wlandau
Jun 23, 2023
Maintainer

The bit about resources helps, thanks for explaining. If there are few targets in each of full and downsampled mode, you could use a single tar_map()

tar_map(
  values,
  tar_target(
    name = a, 
    command = list(study, batch)
  ),
  tar_target(
    name = b_downsampled,
    command = list("downsampled", a),
    resources = resources[["downsampled"]]
  ),
  tar_target(
    name = b_full,
    command = list("full", a),
    resources = resources[["full"]]
  )
)

This could get inconvenient of there are many targets specific to the full and downsampled modes, so one workaround could be an inner tar_eval() similar to one of your earlier comments. Sketch:

library(targets)
library(tarchetypes)
library(dplyr)
library(tibble)

values <- tribble(
  ~study, ~batch,
  "study1", "batch1",
  "study1", "batch2",
  "study1", "batch3",
  "study2", "batch1",
  "study2", "batch2",
  "study3", "batch1"
)

resources <- list(
  downsampled = tar_resources(
    clustermq = tar_resources_clustermq(
      template = list(mode = "downsampled")
    )
  ),
  full = tar_resources(
    clustermq = tar_resources_clustermq(
      template = list(mode = "full")
    )
  )
)

tar_map(
  values,
  tar_target(
    name = a, 
    command = list(study, batch)
  ),
  tar_eval(
    expr = tar_target(
      name = name,
      command = list(mode, a),
      resources = resources[[mode]]
    ),
    values = tibble(mode = c("downsampled", "full")) %>%
      mutate(name = rlang::syms(mode))
  )
)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequential tar_map [help] #150

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

sequential tar_map [help] #150

wsteenhu Jun 21, 2023

Help

Description

Replies: 3 comments · 4 replies

wlandau Jun 21, 2023 Maintainer

wsteenhu Jun 22, 2023 Author

wlandau Jun 22, 2023 Maintainer

wsteenhu Jun 23, 2023 Author

wlandau Jun 23, 2023 Maintainer

wsteenhu Jun 23, 2023 Author

wlandau Jun 23, 2023 Maintainer

wsteenhu
Jun 21, 2023

Replies: 3 comments 4 replies

wlandau
Jun 21, 2023
Maintainer

wsteenhu Jun 22, 2023
Author

wlandau Jun 22, 2023
Maintainer

wsteenhu Jun 23, 2023
Author

wlandau Jun 23, 2023
Maintainer

wsteenhu
Jun 23, 2023
Author

wlandau
Jun 23, 2023
Maintainer