Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid alleles and invalid genes #189

Open
lskatz opened this issue Dec 18, 2023 · 8 comments
Open

Invalid alleles and invalid genes #189

lskatz opened this issue Dec 18, 2023 · 8 comments
Assignees
Labels
Status: In Progress Has been assigned and is being worked on.

Comments

@lskatz
Copy link

lskatz commented Dec 18, 2023

Hi, I am using ChewBBACA v3 and imported the EnteroBase Salmonella cgMLST scheme and got invalid genes. These are the steps I took.

# get scheme: fasta + trn files
wget https://enterobase.warwick.ac.uk/schemes/Salmonella.cgMLSTv2/ --recursive --level 1 --continue # or something similar -- I think I lost this exact command.
wget 'https://chewbbaca.online/api/NS/api/download/prodigal_training_files/961572d47338721d4312af253fab174d3ef974f32e3af1e2e94e3cdf7dd459e98f6ebe04f1372a3ddfb120d2c0457772fd19d264c963c78e6a421606e81bb3bf' -O Salmonella_enterica.enterobase/Salmonella_enterica.trn # download URL determined by clicking the link from chewbbaca.online/stats for the trn file
# avoid the references file being seen as a locus by moving it to a subfolder
mkdir Salmonella_enterica.enterobase/_ref 
mv Salmonella_enterica.enterobase/cgMLST_v2_ref.fasta.ref Salmonella_enterica.enterobase/_ref/cgMLST_v2_ref.fasta

And then I prepped the scheme with

nohup chewie PrepExternalSchema -i Salmonella_enterica.enterobase -o /scicomp/groups/OID/NCEZID/DFWED/EDLB/projects/validation/mlstComparison/MLST.db/Salmonella_enterica.enterobase --ptf Salmonella_enterica.enterobase/Salmonella_enterica.trn --cpu 12 >& Salmonella_enterica.enterobase.chewie.log &
cat Salmonella_enterica.enterobase.chewie.log
nohup: ignoring input

chewBBACA version: 3.1.0
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==================================
  chewBBACA - PrepExternalSchema
==================================
Started at: 2023-12-18T16:34:15

Adapting schema in the following directory:
/scicomp/groups-pure/OID/NCEZID/DFWED/Projects_DFWED/cgMLST-comparison/MLST.db/Salmonella_enterica.enterobase
Prodigal training file:
Salmonella_enterica.enterobase/Salmonella_enterica.trn
Number of cores: 12
BLAST Score Ratio: 0.6
Translation table: 11
Minimum accepted sequence length: 0
Size threshold: 0.2
Number of genes to adapt: 3002

Determining the total number of alleles and allele mean length per gene...

Adapting 3002 genes...

 [====================] 100%

Number of invalid genes: 2
Number of invalid alleles: 332651

Successfully adapted 3000/3002 genes present in the input schema.

Finished at: 2023-12-18T16:47:05
Took  12m 49s.

Why do I see invalid genes and alleles? How can I fix it? Or is there anything to fix?

@rfm-targa rfm-targa self-assigned this Dec 19, 2023
@rfm-targa rfm-targa added the Status: In Progress Has been assigned and is being worked on. label Dec 19, 2023
@rfm-targa
Copy link
Contributor

Hello @lskatz!

Thank you for your interest in chewBBACA and for sharing the steps to adapt the schema.
Regarding the loci and alleles that are excluded during the adaptation. chewBBACA excludes any allele that is not a complete coding sequence (CDS) (1 start and stop codon, length multiple of 3, no ambiguous bases, etc.). The S. enterica schema includes many alleles that are not complete CDSs and are removed during schema adaptation. There are no complete CDSs in the FASTA files for two loci, which is why those loci are not included in the adapted schema.
Feel free to let us know if there is anything else.

Kind regards,

Rafael

@lskatz
Copy link
Author

lskatz commented Dec 19, 2023

Then, are you saying that these two loci are never going to be called when using ChewBBACA even though they are in the curated scheme? How do I identify which two these are? And which alleles they are?

@lskatz
Copy link
Author

lskatz commented Dec 19, 2023

Does this also mean that the chewbbaca.online wgMLST scheme excludes these two cgMLST loci?

@lskatz
Copy link
Author

lskatz commented Dec 19, 2023

The loci seem to have alternative start sites. Is it possible to specify alternative start sites when running PrepExternalSchema? I translated them and then show the first allele for each below.

grep -A 1 ">" -m 1 -h STMMW_09661.faa STMMW_11461.faa
>STMMW_09661_1_1
LAHYPAARCQVTPIVYFTVPLYVALMSIHQAVVSGKRCITRRCRVSRNHVLNLVCCVLNA
--
>STMMW_11461_1_1
LLSIAITTGILSGIWGWVAVSLGLLSWAGFLGCTAYFACPQGGFKGLLISACTLLSGMVW

@rfm-targa
Copy link
Contributor

Yes, those two loci will not be called. They are excluded during schema adaptation, meaning that the adapted schema will not contain any FASTA files or information about those loci. You can define the genetic code/table used by Prodigal (chewie <3.3.0)/Pyrodigal (chewie >= 3.3.0) and for translation, but I doubt that will help with those two loci. The default is genetic code 11 (The Bacterial, Archaeal and Plant Plastid Code). I have just checked, and as expected, those two loci are not in the wgMLST schema available on Chewie-NS.

The PrepExternalSchema creates three output files, which are the following:

  • <output_directory_basename>_invalid_alleles.txt: contains the list of alleles that were excluded and a small description of the issues found for each allele (will include a line for every single allele of the loci that were excluded).
  • <output_directory_basename>_invalid_loci.txt: contains the list of loci that were excluded (will include the identifiers of the two loci that were excluded, STMMW_09661 (inner membrane protein, partial) and STMMW_11461 (putative membrane protein)).
  • <output_directory_basename>_summary_stats.tsv: table with the number of total alleles in the original schema, the number of valid/accepted alleles and the number of representative alleles selected per locus.

These files are created in the parent directory of the output directory you pass to the -o parameter. I have also noticed that the process does not print information about these files or output anything about the excluded loci and alleles. This has to change. #190 suggests some changes to print the list of invalid alleles and loci. I think printing some information about the output files would be enough to avoid printing too much information if there are a lot of invalid alleles, such as in the case of the S. enterica schema. This is something that will be added to the next version.

Also, I noticed that you are using Chewie v3.1.0. The latest version, v3.3.1, includes new features and several fixes, which you might find important. You can check the list of changes in the Changelog.
Let me know if there is anything else.

Rafael

@lskatz
Copy link
Author

lskatz commented Dec 20, 2023

Thank you! These insights are helpful! One more question is, what if I want to force these loci into the scheme? Is there a method to bring in one locus at a time?

@ramirma
Copy link
Member

ramirma commented Dec 28, 2023

Not easily Lee, we have no option to do that. But even if we had the problem then becomes how and if would prodigal (or pyrodigal in the latest chewBBACA versions) recognize the loci in the query genomes. If these are not valid ORFs for pyrodigal, they will not be passed onto chewBBACA and will not be called. We do not seem to have a straightforward and simple solution to forcefully include the loci that would guarantee they would be called even if included.
Do not forget that this is an "adapted" schema which was not created to chewBBACA's requirements so we are bound to have differences when making it suitable for chewBBACA.
Mario

@lskatz
Copy link
Author

lskatz commented Jan 4, 2024

Thank you for helping me understand it so well! I understand where you are coming from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Progress Has been assigned and is being worked on.
Projects
None yet
Development

No branches or pull requests

3 participants