Check InterProScan seqtype #5891

TomHarrop · 2024-03-20T02:36:59Z

FOR CONTRIBUTOR:

I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
License permits unrestricted use (educational + commercial)
This PR adds a new tool or tool collection
This PR updates an existing tool or tool collection
This PR does something else (explain below)

This tool is causing a bit of support work when users input nucleotide sequences but select protein as the sequence type.

This greps for anything that is not an IUPAC nucleotide and checks the result against the selected seqtype.

It won't always be correct because a protein consisting only of amino acid residues that are in the nucleotide alphabet is valid (but probably not common). It could also take a long time to search through valid nucleotide input. I could limit it to the first 1000 lines or something?

Ping @igormakunin

TomHarrop · 2024-03-20T02:39:10Z

@neoformit please LMK if you have any comments on the UX or anything else.

neoformit · 2024-03-20T03:12:26Z

Looks good to me, is it possible to add "detected nucleotide when you selected protein" to the error message, just to be a bit more specific for the user?

TomHarrop · 2024-03-20T04:44:23Z

@neoformit that would be better, but the values of the seqtype variable are "p" and "n" so it makes the command section a bit nastier. See 7a25359 - worth it?

neoformit · 2024-03-20T05:05:39Z

Since you already have the if grep -q '[^[:space:]]' <<< \${match}; then line, yeah I think this is worth it. Looks good, thanks Tom!

bgruening · 2024-03-20T07:22:15Z

Thanks! @TomHarrop can you look at the failing lint, please?

How is interproscan failing and how fast? We could also add a stdio, catch this error and provide your better error description.

Some thoughts for the future, because this is now coming up more and more often:

do we need separate datatypes for nt, and prot
do we need a library/script that checks the content and can be shared?
should this be part of the fasta sniffer (it will have performance implications)

TomHarrop · 2024-03-20T10:30:56Z

Thanks @bgruening . The lint failure is because of InterProScan's non-standard version numbering.

How is interproscan failing and how fast? We could also add a stdio, catch this error and provide your better error description.

I believe it doesn't fail but creates huge jobs that run for days, and there is a similar problem with blastp. My understanding is that @igormakunin has to cancel jobs manually to keep the queue moving. Hopefully he can comment.

do we need separate datatypes for nt, and prot

I don't think there is any technical difference in FASTA that we can rely on, e.g. "ATGC" is a valid peptide string.

do we need a library/script that checks the content and can be shared?

This seems like a good option to me. I could look into that if you point me in the right direction.

should this be part of the fasta sniffer (it will have performance implications)

As above we might not detect them reliably, but I don't know how the sniffer works under the hood.

neoformit · 2024-03-20T22:47:36Z

do we need a library/script that checks the content and can be shared?

Ages ago I did start developing a Python lib fastkit for that reason: https://github.com/neoformit/fastkit

It has a validator for DNA, protein, sequence count etc. and some common formatting for FASTA files. Could also implement for FASTQ but haven't yet. It raises stderr that results in user-friendly error messages in Galaxy.

Maybe there are existing libraries but I don't know of one that does this exactly. It wraps BioPython etc. under the hood.

neoformit · 2024-03-20T22:48:36Z

should this be part of the fasta sniffer (it will have performance implications)

This seems like a nice solution but the logic for checking and raising error messages still has to be re-implemented in every wrapper, right?

TomHarrop added 5 commits March 20, 2024 11:31

grep for non-nucleotide characters in the input

0811df7

test sequence type check

ecb7838

add param to disable check

3f1414d

bump

acecf66

fix indent

3e40c78

add more detailed message

7a25359

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check InterProScan seqtype #5891

Check InterProScan seqtype #5891

TomHarrop commented Mar 20, 2024

TomHarrop commented Mar 20, 2024

neoformit commented Mar 20, 2024

TomHarrop commented Mar 20, 2024 •

edited

neoformit commented Mar 20, 2024

bgruening commented Mar 20, 2024

TomHarrop commented Mar 20, 2024

neoformit commented Mar 20, 2024

neoformit commented Mar 20, 2024

Check InterProScan seqtype #5891

Are you sure you want to change the base?

Check InterProScan seqtype #5891

Conversation

TomHarrop commented Mar 20, 2024

TomHarrop commented Mar 20, 2024

neoformit commented Mar 20, 2024

TomHarrop commented Mar 20, 2024 • edited

neoformit commented Mar 20, 2024

bgruening commented Mar 20, 2024

TomHarrop commented Mar 20, 2024

neoformit commented Mar 20, 2024

neoformit commented Mar 20, 2024

TomHarrop commented Mar 20, 2024 •

edited