Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAFFT error when running ppanggolin MSA #210

Open
jvfe opened this issue Apr 12, 2024 · 8 comments
Open

MAFFT error when running ppanggolin MSA #210

jvfe opened this issue Apr 12, 2024 · 8 comments
Assignees
Labels

Comments

@jvfe
Copy link

jvfe commented Apr 12, 2024

  • PPanGGOLIN v2.0.5

Hi, after running and creating the pangenome file with this command, for about ~1600 GFFs:

ppanggolin \
    workflow \
    --cpu 24 \
    --anno ppanggolin_samplesheet.tsv \
    --output ppanggolin

I started running the MSA command:

ppanggolin \
    msa \
    --cpu 24 \
    --pangenome pangenome.h5 \
    --output ppanggolin_msa \
    --partition all

But I'm always getting this MAFFT error:


Command output:
  2024-04-04 12:56:08 utils.py:l168 INFO	Command: /usr/local/bin/ppanggolin msa --cpu 24 --pangenome copied_pangenome.h5 --output ppanggolin_msa --partition all
  2024-04-04 12:56:08 utils.py:l169 INFO	PPanGGOLiN version: 2.0.5
  2024-04-04 12:56:08 utils.py:l722 INFO	3 parameters have a non-default value.
  2024-04-04 12:56:09 readBinaries.py:l94 INFO	Getting the current pangenome status
  2024-04-04 12:56:09 readBinaries.py:l715 INFO	Reading pangenome annotations...
  2024-04-04 13:03:59 readBinaries.py:l722 INFO	Reading pangenome gene dna sequences...
  2024-04-04 13:14:16 readBinaries.py:l730 INFO	Reading pangenome gene families...
  2024-04-04 13:16:38 writeMSA.py:l310 INFO	Doing MSA for all families...
  2024-04-04 13:16:38 writeMSA.py:l203 INFO	Preparing input files for MSA...
  2024-04-04 13:42:51 writeMSA.py:l212 INFO	Computing the MSA ...

Command error:
   20%|██        | 91094/446396 [3:30:14<10:15:22,  9.62family/s]
   20%|██        | 91096/446396 [3:30:16<21:15:50,  4.64family/s]
   20%|██        | 91098/446396 [3:30:16<17:11:41,  5.74family/s]
   20%|██        | 91100/446396 [3:30:16<14:35:28,  6.76family/s]
   20%|██        | 91103/446396 [3:30:16<10:45:47,  9.17family/s]
   20%|██        | 91105/446396 [3:30:16<12:04:14,  8.18family/s]
   20%|██        | 91109/446396 [3:30:17<11:07:37,  8.87family/s]
   20%|██        | 91111/446396 [3:30:17<10:04:28,  9.80family/s]
   20%|██        | 91113/446396 [3:30:17<12:38:48,  7.80family/s]
   20%|██        | 91115/446396 [3:30:18<13:31:27,  7.30family/s]
   20%|██        | 91116/446396 [3:30:18<14:02:16,  7.03family/s]
   20%|██        | 91117/446396 [3:30:18<17:43:26,  5.57family/s]
   20%|██        | 91118/446396 [3:30:18<20:16:36,  4.87family/s]
   20%|██        | 91119/446396 [3:30:18<18:58:24,  5.20family/s]
   20%|██        | 91120/446396 [3:30:19<17:36:03,  5.61family/s]
   20%|██        | 91122/446396 [3:30:19<14:24:48,  6.85family/s]
   20%|██        | 91124/446396 [3:30:19<14:07:42,  6.98family/s]
   20%|██        | 91128/446396 [3:30:19<8:11:13, 12.05family/s] 
   20%|██        | 91130/446396 [3:30:20<10:02:37,  9.83family/s]
   20%|██        | 91131/446396 [3:30:20<13:39:58,  7.22family/s]
  multiprocessing.pool.RemoteTraceback: 
  """
  Traceback (most recent call last):
    File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 125, in worker
      result = (True, func(*args, **kwds))
    File "/usr/local/lib/python3.9/site-packages/ppanggolin/formats/writeMSA.py", line 182, in launch_multi_mafft
      launch_mafft(*args)
    File "/usr/local/lib/python3.9/site-packages/ppanggolin/formats/writeMSA.py", line 172, in launch_mafft
      subprocess.run(cmd, stdout=open(outname, "w"), stderr=subprocess.DEVNULL, check=True)
    File "/usr/local/lib/python3.9/subprocess.py", line 528, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['mafft', '--thread', '1', '/tmp/tmp8ye30tt_/BLONNJ_19495.fasta']' returned non-zero exit status 1.
  """

I couldn't figure out why as I can't access the mafft error directly so I'm kind of out of ideas at the moment. I had managed to run both of these commands previously on a smaller subset of this dataset (~200 gffs). I'm using the PPanGGOLIN docker image from biocontainers, in case that's relevant.

Thanks!

@axbazin
Copy link
Member

axbazin commented Apr 12, 2024

Hi,

Thank you for your issue, there is definitely an error reporting from ppanggolin on that specific case that we can improve.

To find out what is happening, ideally I'd run this command again and check the content of "/tmp/tmp8ye30tt_/BLONNJ_19495.fasta", and rerun mafft on it outside of PPanGGOLiN, but I don't remember if it's easy to keep the tmp files as I have not used "msa" in a while now.

I assume there is something odd with this family, BLONNJ_19495. Is there something different or strange about it ?
Alternatively, if you can't do what I suggested above, the things that you can check:

  • How many members does it have? (number of lines with this id in gene_families.tsv should give that answer)
  • Is it a multigenic family? (=> often has more than 1 gene among the genomes of your pangenome, I think mean_persistent_duplication.tsv can tell you if this is a persistent family)
  • Is it mostly made of fragments ? ( number of lines with this id and a F in the third column in gene_families.tsv should give that answer)
  • Is there unexpected characters in the fasta sequence of its genes ?

Adelme

@jvfe
Copy link
Author

jvfe commented Apr 15, 2024

Hi, so I ended up removing the genome that had 'BLONNJ_19495' and tried running it again. Unfortunately I overwrote the old results, but got the exact same error, this time for a different ID, 'ELALCC_40030'. So I'll answer your questions based on this last run, since it should point to the same issue.

I can't access ELALCC_40030.fasta directly since it ran in a tmp directory under a docker container, but I can give you the gff3 file it came from (It's in contig 76, see below).
SAMEA2267045.gff3.txt

  • 2 lines with this ID in gene_families.tsv, one of them does contain F:
ELALCC_40030    NOFIGB_22955    F
ELALCC_40030    ELALCC_40030
  • It's not a multigenic family, as far as I could gather.
  • Couldn't see any unexpected characters either.

@axbazin
Copy link
Member

axbazin commented Apr 16, 2024

Alright I see, thank you. Maybe the fact that it's the only non-fragment member of the family is linked to the problem?
I will try to replicate the error and get back to you if there is something.

@axbazin axbazin self-assigned this Apr 17, 2024
@axbazin axbazin added the bug label Apr 17, 2024
@axbazin
Copy link
Member

axbazin commented Apr 22, 2024

Hi,

I did not manage to reproduce this problem using our testing dataset, nor with a real dataset I was working on, nor using the genome you uploaded.

Would it be possible for you to share a (possibly small-ish) "pangenome.h5" file that resulted in a problem like this?

Adelme

@axbazin
Copy link
Member

axbazin commented Apr 22, 2024

Just in case, if you have no means of sharing your pangenome.h5 file, if you share with us your email address someone from the ppanggolin dev team can provide you with a link where you can upload the file.

@jvfe
Copy link
Author

jvfe commented May 6, 2024

Just in case, if you have no means of sharing your pangenome.h5 file, if you share with us your email address someone from the ppanggolin dev team can provide you with a link where you can upload the file.

Sorry for such a late reply, but here is the pangenome.h5 file that is failing in the way I described above. Unfortunately it's not that small (~3GB). I'll see if I can create a smaller one that returns this same error.

@axbazin
Copy link
Member

axbazin commented May 22, 2024

Hi

After some testing I managed something that looks like your error... accidently.

For me, it was actually unrelated to ppanggolin directly but linked to a lack of permission to the TMPDIR of the system in which you are executing PPanGGOLIN. When mafft tries to access it, it fails and this makes it crash. The error given for me was the same as this one: https://forum.qiime2.org/t/plugin-error-from-phylogeny/19519

This was however impossible to guess with the way ppanggolin prints out the mafft stderr. The PR linked to this issue improves this.
I'll close this issue once this gets into a release.

Adelme

@jvfe
Copy link
Author

jvfe commented May 30, 2024

I see! Thank you so much for your patience. I managed to fix the issue on my side as well after changing the TMPDIR singularity was using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants