Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog post: Moving workflows from single files to collections - a case study #2050

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
115 changes: 115 additions & 0 deletions content/news/2023-06-from-single-files-to-collections/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
title: "Moving workflows from single files to collections - a case study"
date: "2023-06-20"
tease: "Allowing a complex workflow to be used on multiple datasets using collections."
hide_tease: false
authors: 'Paul Zierep, Engy Nasr'
authors_structured:
- github: paulzierep
- github: EngyNasr
tags: [EU]
subsites: [all-eu]
main_subsite: eu
---

Collections are a great way to bundle multiple dataset into single entities (as shown in the histroy) that can be
processed collectively. In fact, when the amount of datasets rises up to 1000+ it becomes essential to use collections.
Galaxy can also use collections in tools that are not specifically designed to process
collections using the mapping-over strategy (run the tool for each of the elements in a collection).
Therefore, it should be a peace of cake to port complete workflows that
were based on processing single files to use collections as well.
However, when applying this idea on our latest metagenomics workflow [Foodborn Pathogen detection](https://training.galaxyproject.org/training-material/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html) we encountered some problems
that arise when switching from single files to collections.
In the following we would like to present some of those issues and how we solved them, in the hopes that these strategies can help
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you also want to write them as FAQs in the GTN it could be useful, just sort of "how to do X" type FAQs, then we can easily link users to them when they encounter those issues later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the post is finalized, we will condense into a FAQ, thanks for the Idea !

others to port their workflows to collections.

# Case 1 - Simple inputs for workflow logic

It is often useful to add simple inputs to a workflow such as integers or text to specify specific parameters of tools. Galaxy can also use the output of a tool as input
parameter for another tool. Details are described in the tutorial [Using Workflow Parameters](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/workflow-parameters/tutorial.html). In the case of the `single file` Foodborn Pathogen Detection Workflow a text input `Sample ID` is used downstream by multiple tools
as input.

* `TODO Image of the singel file workflow`
* `TODO Link to WF in IWC`

This could not be transformed straight forward into a collection logic, since the input in this case would need to be a list with matching `Sample IDs`.
The problem was solved by using the name of the collection elements as `Sample IDs` and transforming them into workflow parameters using the following
set of tools: `TODO describe`.

* `TODO Image of the collection workflow`
* `TODO Link to WF in IWC`

In case the `Sample IDs` does not match the element identifies a list with matching IDs could
be provided by the user that is then processed similarly to the described approach.

# Case 2 - Failing or empty elements in a collection

Even if a workflow is well designed, in some cases in can happen that only few elements of a collection fail. This happened to us rather randomly in case of Kraken2, since
it requires large amounts of memory (>70 GB), which were not assigned to every run of the tool by the server. That issue was solved by increasing the minimum memory required by the tool on the EU server (`TODO how was that done`). But there are various other scenarios where the failure of the tool can be attributed e.g. to specific input data. In other cases only a few elements of a collection are empty (e.g. if an assembly can not be made due to not overlapping reads).

If an element of a collection is failed or empty the entire downstream processing is stopped, which can be rather annoying if one want to process large amount of data and got stuck due
to a few elements. Two solutions are proposed to handle such cases.

## Intermediate workflow specific solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to point out though that you could have taken your single file workflow and mapped it over a collection, and I think that would have solved all of the issues mentioned in this section. The workarounds here are nice, but they make the problem appear much more complicated than it should be. I would also appreciate a qualifier saying that you can ask these sort of questions in the IWC channel.


Collections can be filtered for failed or empty datasets using collection tools such as [filter empty datasets](https://usegalaxy.eu/?tool_id=__FILTER_EMPTY_DATASETS__&version=1.0.0)
and [filter failed datasets](https://usegalaxy.eu/?tool_id=__FILTER_FAILED_DATASETS__&version=1.0.0).

<div class="center">
<div class="img-sizer" style="width: 100%">

![Empty collection](./figs/empty_collection.png)

</div>
<figcaption>
Filter failed collection elements
</figcaption>
</div>

<div class="center">
<div class="img-sizer" style="width: 100%">

![Empty collection](./figs/failed_collection.png)

</div>
<figcaption>
Filter empty collection elements
</figcaption>
</div>


Although this can solve the issue in some cases immediately, further considerations need to be made.
First, often one cannot really know at which step the collection will be affected.
To cover all cases one would need to add the filter steps for every produced collection, which will increase the workflow steps unreasonably.
Secondly, the filtering will change the size of the collection. If downstream tools depend on a specific collection size, which is always the case if a tool
takes two or more collections as input, the tool will also fail. That's basically a follow up problem of the first problem.
This case can still be solved by a intermediate step where the second collection is basically reindexed by the same element identifiers then the collection with
missing elements.

<div class="center">
<div class="img-sizer" style="width: 100%">

![Empty collection](./figs/reindex_collection.png)

</div>
<figcaption>
Proposed solution to reindex second collection using elements of the first collection
</figcaption>
</div>

However, if this logic needs to be applied to multiple collection or again for multiple steps, the workflow becomes even more unreasonably large and complex.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also run the workflow again and enable the job cache, this might be the easiest and most clean solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek do you mean to run the workflow again without the failed elements ? Then, if another collection fails, run the complete WF again and so on ? How do you enable the job cache and what does it do ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just run the whole workflow again. Everything that had already been run successfully will be skipped. If enabled by the admin go to User > Prefernces -> Manage Information. There were some additional fixes in 23.1 and usegalaxy.eu isn't running this yet. If you want to test this you can try on usegalaxy.org.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this info, cached jobs are indeed really cool, but I don't really see how it can solve the current issue.
The same elements will still fail again. Only in the case of randomly failing, this could help ... which should not be the case in any way.

Therefore, an additional step would still be required, by:

  • Filtering the elements in the collection (-> reindexing problem)
  • Run WF again without the Datasets that fail (-> laborious)
  • Work on the tool to overcome the issue

Or am I missing something ?

Copy link
Member

@mvdbeek mvdbeek Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example said

Even if a workflow is well designed, in some cases in can happen that only few elements of a collection fail. This happened to us rather randomly in case of Kraken2, since
it requires large amounts of memory (>70 GB), which were not assigned to every run of the tool by the server.

so it should help with that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, sorry I thought you referred to the complete paragraph, for that particular example your solution can help, although since it's random, one could get some failed ones for the new run as well and ultimately would need to repeat the workflow multiple times, which is also not ideal

Copy link
Member

@mvdbeek mvdbeek Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, those will be picked up by the cache. And it's generally a good thing if you have a single complete invocation without partially failed items.


## Global tool dependent solution

Since this issue cannot be solved satisfactory on the workflow level, one can still aim to improve the problem for the community by solving it on the tool level.
In general the aim should be that the tool neither fails nor produces empty output. This is much more work but will ultimately have a benefit for all users.
Two things need to be considered here.

* Why does the tool fail, can it be solved? In our case we had to increase the memory of kraken for the server. In other cases it could be necessary to inspect the tool wrapper
and find individual solutions.
* Why does the tool produce empty elements? If the tool produces empty elements when in fact it has an output but the output it nothing (e.g. if a detection tools detects nothing), it
might be as simple as adding an empty table or fasta file for this tool, to allow follow up tools to work.

# Case 3 - Collection workflow logic does not fully comply with single file logic

`TODO explain unziop problem, how it was solved`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a bug, would really welcome a reproduction here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we think so too, and also double-checked, we will write an Issue, but I think its still a good finding for the blog post, and we can link the issue here. Idea is to motivate users to write issues for such cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will write a PR for that but for the blog post I think its enough to state that it is being investigated