Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems with schema creating #107

Open
KasiaTluscik opened this issue Nov 30, 2021 · 23 comments
Open

problems with schema creating #107

KasiaTluscik opened this issue Nov 30, 2021 · 23 comments
Assignees
Labels
Priority: Medium Status: In Progress Has been assigned and is being worked on. Type: Bug

Comments

@KasiaTluscik
Copy link

Hi :)

I had such problem while creating a new scheme on 200 ref sequences from NCBI. I have no idea how to solve it. I'll be grateful for help.

Kasia

CPU cores: 22
BLAST Score Ratio: 0.6
Translation table: 11
Minimum sequence length: 201
Size threshold: 0.2
Word size: 5
Window size: 5
Clustering similarity: 0.2
Representative filter: 0.9
Intra-cluster filter: 0.9
Number of inputs: 201

Predicting gene sequences...

[====================] 100%

Extracting coding sequences...

[ ] 0%
Error on cds_batch_extractor:
Traceback (most recent call last):
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/multiprocessing_operations.py", line 39, in function_helper
results = input_args-1
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 235, in cds_batch_extractor
total = save_extracted_cds(g, identifier, orf_file_path,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 182, in save_extracted_cds
genome_info = extract_genome_cds(reading_frames, contigs, 1)
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 99, in extract_genome_cds
sequence = contigs[contig_id]
KeyError: 'NZ_CP031775'

[====================] 100%Traceback (most recent call last):
File "/home/msszwarc/miniconda3/envs/chewbbaca/bin/chewBBACA.py", line 10, in
sys.exit(main())
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 1480, in main
functions_info[process]1
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/process_datetime.py", line 149, in wrapper
func(*args, **kwargs)
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 193, in create_schema
CreateSchema.main(**vars(args))
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 1192, in main
results = create_schema_seed(input_files, output_directory, schema_name,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 963, in create_schema_seed
cds_files = extract_genes(fasta_files, prodigal_path,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 296, in extract_genes
total_extracted = sum([f[2] for f in extracted_cdss])
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 296, in
total_extracted = sum([f[2] for f in extracted_cdss])
IndexError: list index out of range

@rfm-targa rfm-targa self-assigned this Nov 30, 2021
@rfm-targa
Copy link
Contributor

Greetings @KasiaTluscik,

Thank you for your interest in chewBBACA. Can you tell us what version of chewBBACA you are using and what was the command? Also, you refer that you want to create a schema based on 200 genome assemblies from the NCBI, but chewBBACA is detecting 201 input files. If you only have 200 genome assemblies to pass as input, I ask you to make sure that there are not other files in the directory that might be causing this issue.
Looking forward to hearing from you.

Rafael

@KasiaTluscik
Copy link
Author

KasiaTluscik commented Dec 1, 2021

Thanks for the quick reply. I meant ,,about 200" :) . There are actually 201 FASTA sequences. The program version is: 2.8.5. Command below:
chewBBACA.py CreateSchema -i ~/e_paz/cgMLST/referencje_cgMLST/ -o ~/e_paz/cgMLST/shewanella_cgMLST_scheme --n shewanella_cgMLST_scheme --cpu 22 --ptf ~/e_paz/cgMLST/shewanellatrainedfile.trn

@ramirma
Copy link
Member

ramirma commented Dec 1, 2021

@KasiaTluscik thanks for the clarification and for posting the command. Can you make sure none of the files in the directory are corrupted nor have unusual characters? Also, how did you install chewBBACA?
Looking forward to hearing from you.
Mario

@KasiaTluscik
Copy link
Author

I've checked all files with seqkit stats. Seems to be ok. Maybe I should validate this data with something else? Can you give me an idea? I installed chewBBACA via conda.

@rfm-targa
Copy link
Contributor

@KasiaTluscik, we know that in some cases this type of error was solved by redownloading the genome assemblies or by running a script, such as sequence_cleaner.py from BioPython, to filter out low quality sequences and solve file format issues.
We have tried to reproduce this type of error by downloading the same set of genome assemblies that we were told were leading to the issue, but the process always completed successfully on our side. That is why we think it is an issue related with one or several of the input files that may be corrupted. If it is not much to ask, can you ZIP the whole directory that contains the genome assemblies and send us the data to imm-bioinfo@medicina.ulisboa.pt? It might help us to identify the cause of the issue, after which we can share a solution and implement a verification to detect this issue during execution.

@KasiaTluscik
Copy link
Author

Hi,

Thanks for the hint, but it didn't solve the problem. I downloaded everything again and it doesn't work. I ran sequences through the suggested script (sequence_cleaner.py) and ... still the same error ....I enclose a link to download my dataset (the file is too large for e-mail).
https://we.tl/t-0uEY0SEoJ4
This is new set with 211 fasta.

@mickaelsilva
Copy link
Member

Can you try to run prodigal independently on your dataset? Just to check if it runs an error on any of the fastas

@rfm-targa
Copy link
Contributor

Hello @KasiaTluscik,

Thank you for sharing the dataset with us. We used the files to reproduce the error and found what is causing the issue.
The dataset includes multiple versions for three RefSeq assemblies. The assemblies are the following:

GCF_002209245 (GCF_002209245.1_ASM220924v1_genomic.fna and GCF_002209245.2_ASM220924v2_genomic.fna)
GCF_007004545 (GCF_007004545.2_ASM700454v2_genomic.fna and GCF_007004545.3_ASM700454v3_genomic.fna)
GCF_007923045 (GCF_007923045.2_ASM792304v3_genomic.fna and GCF_007923045.3_ASM792304v4_genomic.fna)

chewBBACA does not expect to find multiple assemblies with the same unique prefix (it selects all characters before the first "." as the unique identifier). You can find more details in an issue that I have created to describe the problem and suggest a solution, #108.
I suggest that you only keep the latest version for each assembly. The NCBI replaced/suppressed the previous versions. To do that you only need to remove the following files:

GCF_002209245.1_ASM220924v1_genomic.fna
GCF_007004545.2_ASM700454v2_genomic.fna
GCF_007923045.2_ASM792304v3_genomic.fna

The CreateSchema process should run without issues after removing those files. If you really need to include multiple versions, you will have to rename the files to ensure that the prefixes are different.
Once again, thank you for sharing the dataset with us. It allowed us to identify the issue and we will include a fix in the next version.
Please let us know if you can create the schema after removing or renaming the files.

Rafael

@rfm-targa rfm-targa added Priority: Medium Status: In Progress Has been assigned and is being worked on. Type: Bug labels Dec 6, 2021
@KasiaTluscik
Copy link
Author

Hello :)
I had an unexpected quarantine pause, but now I'm on the job. Your suggestion actually solved my problem. Now everything works fine. Thank you for your help.
One more question. I've created a schema containing 46039 loci. Isn't that a lot? When I look at schemas for other bacterias, it's usually only a few thousand loci...
greetings
Kasia

@rfm-targa
Copy link
Contributor

rfm-targa commented Dec 15, 2021

Hello @KasiaTluscik,

It is great to know that we could solve this problem.
If I recall correctly, your dataset includes several species for the Shewanella genus. The schemas that have a few thousand loci are usually created with datasets for a single species, although the number of loci in a schema can vary greatly based on the species and dataset composition. A schema with over 40k loci is certainly possible if you are working at genus-level and there is enough interspecific diversity. However, I would suggest that you check if the genome assemblies that you have selected are correctly classified, as the inclusion of misclassified strains can considerably increase the number of loci in a schema and affect the determination of the core-genome.

Rafael

@ramirma
Copy link
Member

ramirma commented Dec 15, 2021

Indeed Kasia, great to know that it worked out. In addition to what Rafael already pointed out I would like to remind you that the CreateSchema process always creates a wgMLST schema. To identify the cgMLST schema you should run the allele call on a set of suitable isolates and then identify which loci are present in the isolates at your desired frequency (100%, 95%, 90%,...). I suspect that if you do this the now with the schema you have the cgMLST schema will come down to a few hundred genes, given the diversity of species you are including. Do let us know how it goes.

@ocarabali
Copy link

Hello
you could help me.

I'm trying to run the command but I get this
chewBBACA.py CreateSchema -i ~/fasta_pseudomona/ --ptf PAO1.trn --cpu 6 If 'chewBBACA.py' is not a typo you can use command-not-found to lookup the package that contains it, like this:
cnf chewBBACA.py
cnf chewBBACA.py CreateSchema -i ~/fasta_pseudomona/ --ptf PAO1.trn --cpu 6

@ocarabali
Copy link

0, in
raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
(base) ocarabali@bioinf-hpc15:~/tesis> chewBBACA.py CreateSchema -i ~/tesis/fasta_pseudomona
Traceback (most recent call last):
File "/vault2/homehpc/ocarabali/miniconda3/bin/chewBBACA.py", line 6, in
from CHEWBBACA.chewBBACA import main
File "/vault2/homehpc/ocarabali/miniconda3/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 8, in
from Bio.Alphabet import generic_dna
File "/vault2/homehpc/ocarabali/miniconda3/lib/python3.9/site-packages/Bio/Alphabet/init.py", line 20, in
raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

1 similar comment
@ocarabali
Copy link

0, in
raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
(base) ocarabali@bioinf-hpc15:~/tesis> chewBBACA.py CreateSchema -i ~/tesis/fasta_pseudomona
Traceback (most recent call last):
File "/vault2/homehpc/ocarabali/miniconda3/bin/chewBBACA.py", line 6, in
from CHEWBBACA.chewBBACA import main
File "/vault2/homehpc/ocarabali/miniconda3/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 8, in
from Bio.Alphabet import generic_dna
File "/vault2/homehpc/ocarabali/miniconda3/lib/python3.9/site-packages/Bio/Alphabet/init.py", line 20, in
raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

@rfm-targa
Copy link
Contributor

Hello @ocarabali,

The error you report seems to be related with the Bio.Alphabet module that was removed in Biopython 1.78. We updated chewBBACA's code based on the recommendations from the Biopython developers and this issue should not affect the latest version. Can you please check if the version you are using is 2.8.5? Please update chewBBACA if the version that is installed is not the latest.
Let us know if this solves the issue.

Rafael

@ocarabali
Copy link

Dear Rafael Mamede

I already solved the previous error.

now i have the following error.

could you help me?

Error on translate_coding_sequences:
Traceback (most recent call last):
File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/multiprocessing_operations.py", line 39, in function_helper
results = input_args-1
File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/sequence_manipulation.py", line 526, in translate_coding_sequences
cds_index = SeqIO.index(sequences_file, 'fasta')
File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/Bio/SeqIO/init.py", line 875, in index
return _IndexedSeqFileDict(
File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/Bio/File.py", line 199, in init
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key '0000-0056-5600'

@ocarabali
Copy link

chewBBACA version: 2.8.5

@ocarabali
Copy link

It seems that your schema was created with chewBBACA 2.1.0 or lower.
It is highly recommended that you run the PrepExternalSchema process to guarantee full compatibility with the new chewBBACA version.
If you wish to continue, the AlleleCall process will convert the schema to v2.8.5, but will not determine if schema structure respects configuration values.

@massiizsve
Copy link

I've got the same problem with version 3.1.2
Provided path does not include all the necessary schema files. Please verify that you have passed the correct path to the schema.
could you help me?
I've checked for same assemblies names and I didn't got...

@ramirma
Copy link
Member

ramirma commented Apr 24, 2023

Did you use the PrepExternalSchema module to convert your schema to a format usable by chewBBACA 3.1.2? What is the origin of the schema you are trying to use?

@massiizsve
Copy link

Hi Ramirma,
thanks for your quick reply, I used the command: chewBBACA.py CreateSchema -i Genomes -o schema --cpu 16 --ptf Salmonella_enterica.trn
to generate the schema, it didn't return me any error message.
thanks!

@ramirma
Copy link
Member

ramirma commented Apr 26, 2023

Thank you for the clarification @massiizsve . What command did you use to do allele calling? Did you do allele calling on the same genomes you used to create the schema?
Mario

@rfm-targa
Copy link
Contributor

Hello @massiizsve,

Just to add another check to what @ramirma has already asked. Please verify that you have passed the path to the schema directory to perform allele calling. The schema directory contains the schema FASTA files and a folder named short that includes the FASTA files with the representative alleles. Based on the command you used to create the schema, the schema directory should be schema/schema_seed. chewBBACA prints the warning you shared if it cannot find the FASTA files and the short folder in the path you provided.

Rafael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Status: In Progress Has been assigned and is being worked on. Type: Bug
Projects
None yet
Development

No branches or pull requests

6 participants