Blog post: Moving workflows from single files to collections - a case study #2050

paulzierep · 2023-06-20T14:01:34Z

Collections in workflows pitfalls and strategies from foodborne pathogen detection

hexylena · 2023-06-20T18:58:42Z

content/news/2023-06-from-single-files-to-collections/index.md

+were based on processing single files to use collections as well.
+However, when applying this idea on our latest metagenomics workflow  [Foodborn Pathogen detection](https://training.galaxyproject.org/training-material/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html) we encountered some problems 
+that arise when switching from single files to collection. 
+In the following we would like to present some of those issues and how we solved them, in the hopes that these strategies can help


if you also want to write them as FAQs in the GTN it could be useful, just sort of "how to do X" type FAQs, then we can easily link users to them when they encounter those issues later

when the post is finalized, we will condense into a FAQ, thanks for the Idea !

mvdbeek · 2023-06-21T10:15:57Z

content/news/2023-06-from-single-files-to-collections/index.md

+</figcaption>
+</div>  
+
+However, if this logic needs to be applied to multiple collection or again for multiple steps, the workflow becomes even more unreasonably large and complex. 


You can also run the workflow again and enable the job cache, this might be the easiest and most clean solution.

@mvdbeek do you mean to run the workflow again without the failed elements ? Then, if another collection fails, run the complete WF again and so on ? How do you enable the job cache and what does it do ?

No, just run the whole workflow again. Everything that had already been run successfully will be skipped. If enabled by the admin go to User > Prefernces -> Manage Information. There were some additional fixes in 23.1 and usegalaxy.eu isn't running this yet. If you want to test this you can try on usegalaxy.org.

Thank you for this info, cached jobs are indeed really cool, but I don't really see how it can solve the current issue.
The same elements will still fail again. Only in the case of randomly failing, this could help ... which should not be the case in any way.

Therefore, an additional step would still be required, by:

Filtering the elements in the collection (-> reindexing problem)

Run WF again without the Datasets that fail (-> laborious)

Work on the tool to overcome the issue

Or am I missing something ?

Your example said

Even if a workflow is well designed, in some cases in can happen that only few elements of a collection fail. This happened to us rather randomly in case of Kraken2, since
it requires large amounts of memory (>70 GB), which were not assigned to every run of the tool by the server.

so it should help with that ?

Ah ok, sorry I thought you referred to the complete paragraph, for that particular example your solution can help, although since it's random, one could get some failed ones for the new run as well and ultimately would need to repeat the workflow multiple times, which is also not ideal

No, those will be picked up by the cache. And it's generally a good thing if you have a single complete invocation without partially failed items.

mvdbeek · 2023-06-21T10:16:57Z

content/news/2023-06-from-single-files-to-collections/index.md

+
+# Case 3 - Collection workflow logic does not fully comply with single file logic 
+
+`TODO explain unziop problem, how it was solved`


That sounds like a bug, would really welcome a reproduction here.

we think so too, and also double-checked, we will write an Issue, but I think its still a good finding for the blog post, and we can link the issue here. Idea is to motivate users to write issues for such cases.

we will write a PR for that but for the blog post I think its enough to state that it is being investigated

…ion workflows and explaining the problem of the decompressing

EngyNasr · 2023-06-21T16:09:27Z

I have added some parts in this PR

Adding to the collection news

mvdbeek · 2023-06-30T08:43:27Z

content/news/2023-06-from-single-files-to-collections/index.md

+Even if a workflow is well designed, in some cases in can happen that only few elements of a collection fail. This happened to us rather randomly in case of Kraken2, since
+it requires large amounts of memory (>70 GB), which were not assigned to every run of the tool by the server. That issue was solved by increasing the minimum memory required by the tool on the EU server. But there are various other scenarios where the failure of the tool can be attributed for example to specific input data. In other cases only a few elements of a collection are empty (e.g. if an assembly can not be made due to not overlapping reads).
+
+If an element of a collection is failed or empty the entire downstream processing is stopped, which can be rather annoying if one want to process a large amount of data and got stuck due to a few elements. Two solutions are proposed to handle such cases.


This is not correct, generally speaking. As long as you're not reducing the elements the workflow will proceed until the end.

mvdbeek · 2023-06-30T08:44:38Z

content/news/2023-06-from-single-files-to-collections/index.md

+### Rerun the workflow with rerun activated 
+
+In some cases a rerun of the workflow can also solve the issue, for example in our use case where the failure was rather randomly.
+For this strategy it is beneficial to activate the rerun option of galaxy (User/Preferences/Manage Information) on instance with version > 23.0. 


Is this meant to refer to the job cache ? re-running single elements has been possible for many years.

Is this not the cache option: "Do you want to be able to re-use equivalent jobs ?" Thats what I found on .org ... or did you mean another option ?

Do you want to be able to re-use equivalent jobs -> this is the job cache.
I don't see how that is related to Rerun the workflow with rerun activated ?

mvdbeek · 2023-06-30T08:46:14Z

content/news/2023-06-from-single-files-to-collections/index.md

+
+Some of the tools, used in the *Foodborne pathogen detection workflow using collections*, failed when we run the workflow but succeed when the tool did run alone without a workflow, for example `Krakentools: Extract Kraken Reads By ID` and `Filter Sequence by ID`.
+
+After a bit of investigation we noticed that when these tools run alone (without being in a workflow) or in a workflow with single files galaxy performs an implicit decompressing of the zipped input files, which then channels the correct file to the underlying wrapper tool. However when the same tools run in a collection in a workflow this implicit decompressing does not take place, which cause the output to fail. 


It's unclear what you're doing there. Implicit converters sure run as part of workflows, using collections or not. Overall I don't think this is a statement that should be published.

Removed, but still an issue, will raise next week.

mvdbeek · 2023-06-30T08:48:10Z

content/news/2023-06-from-single-files-to-collections/index.md

+However, this initial solution is not optimal, since the data size will increase in the user's history (due to file decompression) by running the workflow. Therefore, we have proposed another solution by updating the tools wrappers themselves to perform the decompression internally without the need to use the `Convert compressed file to uncompressed` tool (https://github.com/galaxyproject/tools-iuc/pull/5360). In fact the tool did not require a internal decompression step, since it can indeed work with zipped files, it only needed
+to know that the input is zipped.
+
+Although the problem could be solved again on the tool wrapper level, the general problematic of inconsistent conversion logic between single files and collections is most likely a bug which is currently investigated for following galaxy releases.


Please, please open issues for this when you encounter them, and before writing a blog post. We do have test cases that exercise this, so this is rather surprising.

I will remove that part and we will do the PR

mvdbeek · 2023-06-30T08:55:03Z

content/news/2023-06-from-single-files-to-collections/index.md

+
+If an element of a collection is failed or empty the entire downstream processing is stopped, which can be rather annoying if one want to process a large amount of data and got stuck due to a few elements. Two solutions are proposed to handle such cases.
+
+## Intermediate workflow specific solution


I do want to point out though that you could have taken your single file workflow and mapped it over a collection, and I think that would have solved all of the issues mentioned in this section. The workarounds here are nice, but they make the problem appear much more complicated than it should be. I would also appreciate a qualifier saying that you can ask these sort of questions in the IWC channel.

content/news/2023-06-from-single-files-to-collections/index.md

…llection

mvdbeek · 2023-06-30T10:32:13Z

content/news/2023-06-from-single-files-to-collections/index.md

+Furthermore it needs to be considered, that filtering empty or failed elements from a collection
+can hide the root cause of the problem. The question why some tool runs fail or produce empty output should always be further investigated.
+
+### Rerun the workflow with rerun activated 


Suggested change

### Rerun the workflow with rerun activated

### Rerun the workflow with the re-use equivalent jobs option

is that what you meant ? I read this as the normal replace element in collection option

mvdbeek · 2023-06-30T10:34:26Z

content/news/2023-06-from-single-files-to-collections/index.md

+
+In some cases a rerun of the workflow can also solve the issue, for example in our use case where the failure was rather randomly.
+For this strategy it is beneficial to activate the rerun option of galaxy (User/Preferences/Manage Information) on instance with version > 23.0. 
+This allows to only rerun failed elements, which saves computing resources and time.


That's also true if you click on rerun and replace collection element, I think I would say

This creates a new workflow run where successful outputs are copied from the previous execution instead of being run again, and only failed jobs and jobs depending on the failed outputs will be submitted again.

EngyNasr · 2023-07-02T10:52:53Z

I have created an issue on tools-iuc for the implicit dataset conversion that doesnot occur while using the tool in a workflow with an input collection of zipped files.

We can now link it to the blog post

EngyNasr · 2023-08-07T11:48:28Z

@paulzierep is it ready to be merged?

init

bd6792f

hexylena reviewed Jun 20, 2023

View reviewed changes

3 cases

57bd1a2

paulzierep changed the title ~~init~~ Collections in workflows pitfalls and strategies from foodborne pathogen detection Jun 21, 2023

paulzierep changed the title ~~Collections in workflows pitfalls and strategies from foodborne pathogen detection~~ Plog post: Moving workflows from single files to collections - a case study Jun 21, 2023

changed todo

c76e9ff

mvdbeek reviewed Jun 21, 2023

View reviewed changes

hexylena changed the title ~~Plog post: Moving workflows from single files to collections - a case study~~ Blog post: Moving workflows from single files to collections - a case study Jun 21, 2023

adding examples to show how we converted from single files to collect…

991c526

…ion workflows and explaining the problem of the decompressing

paulzierep added 5 commits June 27, 2023 17:05

Merge branch 'collection-news' into addingtocollectionnews

c9cf070

Merge pull request #1 from EngyNasr/addingtocollectionnews

12aeba0

Adding to the collection news

* added todo

a42f9b9

added WF links, minor text changes

d17a2f4

added cached option

79e8e03

paulzierep marked this pull request as ready for review June 30, 2023 08:40

paulzierep requested review from hexylena and mvdbeek June 30, 2023 08:40

mvdbeek reviewed Jun 30, 2023

View reviewed changes

content/news/2023-06-from-single-files-to-collections/index.md Show resolved Hide resolved

paulzierep added 2 commits June 30, 2023 11:56

removed last part

f1836a3

failed collection element only problematic downstream for Collapse Co…

daa2a81

…llection

mvdbeek reviewed Jun 30, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog post: Moving workflows from single files to collections - a case study #2050

Blog post: Moving workflows from single files to collections - a case study #2050

paulzierep commented Jun 20, 2023

hexylena Jun 20, 2023

paulzierep Jun 21, 2023

mvdbeek Jun 21, 2023

paulzierep Jun 22, 2023 •

edited

mvdbeek Jun 27, 2023

paulzierep Jun 28, 2023

mvdbeek Jun 28, 2023 •

edited

paulzierep Jun 28, 2023

mvdbeek Jun 28, 2023 •

edited

mvdbeek Jun 21, 2023

paulzierep Jun 22, 2023

paulzierep Jun 30, 2023

EngyNasr commented Jun 21, 2023

mvdbeek Jun 30, 2023

mvdbeek Jun 30, 2023

paulzierep Jun 30, 2023

mvdbeek Jun 30, 2023

mvdbeek Jun 30, 2023 •

edited

paulzierep Jun 30, 2023

mvdbeek Jun 30, 2023

paulzierep Jun 30, 2023

mvdbeek Jun 30, 2023

mvdbeek Jun 30, 2023

mvdbeek Jun 30, 2023

EngyNasr commented Jul 2, 2023

EngyNasr commented Aug 7, 2023


		# Case 3 - Collection workflow logic does not fully comply with single file logic

		`TODO explain unziop problem, how it was solved`


		Some of the tools, used in the Foodborne pathogen detection workflow using collections, failed when we run the workflow but succeed when the tool did run alone without a workflow, for example `Krakentools: Extract Kraken Reads By ID` and `Filter Sequence by ID`.

		After a bit of investigation we noticed that when these tools run alone (without being in a workflow) or in a workflow with single files galaxy performs an implicit decompressing of the zipped input files, which then channels the correct file to the underlying wrapper tool. However when the same tools run in a collection in a workflow this implicit decompressing does not take place, which cause the output to fail.


		If an element of a collection is failed or empty the entire downstream processing is stopped, which can be rather annoying if one want to process a large amount of data and got stuck due to a few elements. Two solutions are proposed to handle such cases.

		## Intermediate workflow specific solution

	### Rerun the workflow with rerun activated
	### Rerun the workflow with the re-use equivalent jobs option

Blog post: Moving workflows from single files to collections - a case study #2050

Are you sure you want to change the base?

Blog post: Moving workflows from single files to collections - a case study #2050

Conversation

paulzierep commented Jun 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulzierep Jun 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdbeek Jun 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdbeek Jun 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EngyNasr commented Jun 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdbeek Jun 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EngyNasr commented Jul 2, 2023

EngyNasr commented Aug 7, 2023

paulzierep Jun 22, 2023 •

edited

mvdbeek Jun 28, 2023 •

edited

mvdbeek Jun 28, 2023 •

edited

mvdbeek Jun 30, 2023 •

edited