problems with schema creating #107

KasiaTluscik · 2021-11-30T11:47:53Z

Hi :)

I had such problem while creating a new scheme on 200 ref sequences from NCBI. I have no idea how to solve it. I'll be grateful for help.

Kasia

CPU cores: 22
BLAST Score Ratio: 0.6
Translation table: 11
Minimum sequence length: 201
Size threshold: 0.2
Word size: 5
Window size: 5
Clustering similarity: 0.2
Representative filter: 0.9
Intra-cluster filter: 0.9
Number of inputs: 201

Predicting gene sequences...

[====================] 100%

Extracting coding sequences...

[ ] 0%
Error on cds_batch_extractor:
Traceback (most recent call last):
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/multiprocessing_operations.py", line 39, in function_helper
results = input_args-1
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 235, in cds_batch_extractor
total = save_extracted_cds(g, identifier, orf_file_path,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 182, in save_extracted_cds
genome_info = extract_genome_cds(reading_frames, contigs, 1)
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 99, in extract_genome_cds
sequence = contigs[contig_id]
KeyError: 'NZ_CP031775'

[====================] 100%Traceback (most recent call last):
File "/home/msszwarc/miniconda3/envs/chewbbaca/bin/chewBBACA.py", line 10, in
sys.exit(main())
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 1480, in main
functions_info[process]1
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/process_datetime.py", line 149, in wrapper
func(*args, **kwargs)
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 193, in create_schema
CreateSchema.main(**vars(args))
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 1192, in main
results = create_schema_seed(input_files, output_directory, schema_name,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 963, in create_schema_seed
cds_files = extract_genes(fasta_files, prodigal_path,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 296, in extract_genes
total_extracted = sum([f[2] for f in extracted_cdss])
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 296, in
total_extracted = sum([f[2] for f in extracted_cdss])
IndexError: list index out of range

rfm-targa · 2021-11-30T17:01:03Z

Greetings @KasiaTluscik,

Thank you for your interest in chewBBACA. Can you tell us what version of chewBBACA you are using and what was the command? Also, you refer that you want to create a schema based on 200 genome assemblies from the NCBI, but chewBBACA is detecting 201 input files. If you only have 200 genome assemblies to pass as input, I ask you to make sure that there are not other files in the directory that might be causing this issue.
Looking forward to hearing from you.

Rafael

KasiaTluscik · 2021-12-01T06:03:46Z

Thanks for the quick reply. I meant ,,about 200" :) . There are actually 201 FASTA sequences. The program version is: 2.8.5. Command below:
chewBBACA.py CreateSchema -i ~/e_paz/cgMLST/referencje_cgMLST/ -o ~/e_paz/cgMLST/shewanella_cgMLST_scheme --n shewanella_cgMLST_scheme --cpu 22 --ptf ~/e_paz/cgMLST/shewanellatrainedfile.trn

ramirma · 2021-12-01T11:43:20Z

@KasiaTluscik thanks for the clarification and for posting the command. Can you make sure none of the files in the directory are corrupted nor have unusual characters? Also, how did you install chewBBACA?
Looking forward to hearing from you.
Mario

KasiaTluscik · 2021-12-02T06:04:21Z

I've checked all files with seqkit stats. Seems to be ok. Maybe I should validate this data with something else? Can you give me an idea? I installed chewBBACA via conda.

rfm-targa · 2021-12-02T17:38:26Z

@KasiaTluscik, we know that in some cases this type of error was solved by redownloading the genome assemblies or by running a script, such as sequence_cleaner.py from BioPython, to filter out low quality sequences and solve file format issues.
We have tried to reproduce this type of error by downloading the same set of genome assemblies that we were told were leading to the issue, but the process always completed successfully on our side. That is why we think it is an issue related with one or several of the input files that may be corrupted. If it is not much to ask, can you ZIP the whole directory that contains the genome assemblies and send us the data to imm-bioinfo@medicina.ulisboa.pt? It might help us to identify the cause of the issue, after which we can share a solution and implement a verification to detect this issue during execution.

KasiaTluscik · 2021-12-06T12:31:34Z

Hi,

Thanks for the hint, but it didn't solve the problem. I downloaded everything again and it doesn't work. I ran sequences through the suggested script (sequence_cleaner.py) and ... still the same error ....I enclose a link to download my dataset (the file is too large for e-mail).
https://we.tl/t-0uEY0SEoJ4
This is new set with 211 fasta.

mickaelsilva · 2021-12-06T12:37:46Z

Can you try to run prodigal independently on your dataset? Just to check if it runs an error on any of the fastas

rfm-targa · 2021-12-06T15:44:42Z

Hello @KasiaTluscik,

Thank you for sharing the dataset with us. We used the files to reproduce the error and found what is causing the issue.
The dataset includes multiple versions for three RefSeq assemblies. The assemblies are the following:

GCF_002209245 (GCF_002209245.1_ASM220924v1_genomic.fna and GCF_002209245.2_ASM220924v2_genomic.fna)
GCF_007004545 (GCF_007004545.2_ASM700454v2_genomic.fna and GCF_007004545.3_ASM700454v3_genomic.fna)
GCF_007923045 (GCF_007923045.2_ASM792304v3_genomic.fna and GCF_007923045.3_ASM792304v4_genomic.fna)

chewBBACA does not expect to find multiple assemblies with the same unique prefix (it selects all characters before the first "." as the unique identifier). You can find more details in an issue that I have created to describe the problem and suggest a solution, #108.
I suggest that you only keep the latest version for each assembly. The NCBI replaced/suppressed the previous versions. To do that you only need to remove the following files:

GCF_002209245.1_ASM220924v1_genomic.fna
GCF_007004545.2_ASM700454v2_genomic.fna
GCF_007923045.2_ASM792304v3_genomic.fna

The CreateSchema process should run without issues after removing those files. If you really need to include multiple versions, you will have to rename the files to ensure that the prefixes are different.
Once again, thank you for sharing the dataset with us. It allowed us to identify the issue and we will include a fix in the next version.
Please let us know if you can create the schema after removing or renaming the files.

Rafael

KasiaTluscik · 2021-12-15T12:18:53Z

Hello :)
I had an unexpected quarantine pause, but now I'm on the job. Your suggestion actually solved my problem. Now everything works fine. Thank you for your help.
One more question. I've created a schema containing 46039 loci. Isn't that a lot? When I look at schemas for other bacterias, it's usually only a few thousand loci...
greetings
Kasia

rfm-targa · 2021-12-15T14:51:26Z

Hello @KasiaTluscik,

It is great to know that we could solve this problem.
If I recall correctly, your dataset includes several species for the Shewanella genus. The schemas that have a few thousand loci are usually created with datasets for a single species, although the number of loci in a schema can vary greatly based on the species and dataset composition. A schema with over 40k loci is certainly possible if you are working at genus-level and there is enough interspecific diversity. However, I would suggest that you check if the genome assemblies that you have selected are correctly classified, as the inclusion of misclassified strains can considerably increase the number of loci in a schema and affect the determination of the core-genome.

Rafael

ramirma · 2021-12-15T16:52:05Z

Indeed Kasia, great to know that it worked out. In addition to what Rafael already pointed out I would like to remind you that the CreateSchema process always creates a wgMLST schema. To identify the cgMLST schema you should run the allele call on a set of suitable isolates and then identify which loci are present in the isolates at your desired frequency (100%, 95%, 90%,...). I suspect that if you do this the now with the schema you have the cgMLST schema will come down to a few hundred genes, given the diversity of species you are including. Do let us know how it goes.

ocarabali · 2022-01-29T16:28:20Z

Hello
you could help me.

I'm trying to run the command but I get this
chewBBACA.py CreateSchema -i ~/fasta_pseudomona/ --ptf PAO1.trn --cpu 6 If 'chewBBACA.py' is not a typo you can use command-not-found to lookup the package that contains it, like this:
cnf chewBBACA.py
cnf chewBBACA.py CreateSchema -i ~/fasta_pseudomona/ --ptf PAO1.trn --cpu 6

ocarabali · 2022-01-29T17:15:46Z

0, in
raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
(base) ocarabali@bioinf-hpc15:~/tesis> chewBBACA.py CreateSchema -i ~/tesis/fasta_pseudomona
Traceback (most recent call last):
File "/vault2/homehpc/ocarabali/miniconda3/bin/chewBBACA.py", line 6, in
from CHEWBBACA.chewBBACA import main
File "/vault2/homehpc/ocarabali/miniconda3/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 8, in
from Bio.Alphabet import generic_dna
File "/vault2/homehpc/ocarabali/miniconda3/lib/python3.9/site-packages/Bio/Alphabet/init.py", line 20, in
raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

ocarabali · 2022-01-29T17:16:07Z

0, in
raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.
(base) ocarabali@bioinf-hpc15:~/tesis> chewBBACA.py CreateSchema -i ~/tesis/fasta_pseudomona
Traceback (most recent call last):
File "/vault2/homehpc/ocarabali/miniconda3/bin/chewBBACA.py", line 6, in
from CHEWBBACA.chewBBACA import main
File "/vault2/homehpc/ocarabali/miniconda3/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 8, in
from Bio.Alphabet import generic_dna
File "/vault2/homehpc/ocarabali/miniconda3/lib/python3.9/site-packages/Bio/Alphabet/init.py", line 20, in
raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

rfm-targa · 2022-01-31T11:32:08Z

Hello @ocarabali,

The error you report seems to be related with the Bio.Alphabet module that was removed in Biopython 1.78. We updated chewBBACA's code based on the recommendations from the Biopython developers and this issue should not affect the latest version. Can you please check if the version you are using is 2.8.5? Please update chewBBACA if the version that is installed is not the latest.
Let us know if this solves the issue.

Rafael

ocarabali · 2022-02-19T02:25:00Z

Dear Rafael Mamede

I already solved the previous error.

now i have the following error.

could you help me?

Error on translate_coding_sequences:
Traceback (most recent call last):
File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/multiprocessing_operations.py", line 39, in function_helper
results = input_args-1
File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/sequence_manipulation.py", line 526, in translate_coding_sequences
cds_index = SeqIO.index(sequences_file, 'fasta')
File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/Bio/SeqIO/init.py", line 875, in index
return _IndexedSeqFileDict(
File "/vault2/soft/miniconda2/envs/chewbbaca/lib/python3.9/site-packages/Bio/File.py", line 199, in init
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key '0000-0056-5600'

ocarabali · 2022-02-19T02:37:26Z

chewBBACA version: 2.8.5

ocarabali · 2022-02-19T23:54:09Z

It seems that your schema was created with chewBBACA 2.1.0 or lower.
It is highly recommended that you run the PrepExternalSchema process to guarantee full compatibility with the new chewBBACA version.
If you wish to continue, the AlleleCall process will convert the schema to v2.8.5, but will not determine if schema structure respects configuration values.

massiizsve · 2023-04-24T14:23:10Z

I've got the same problem with version 3.1.2
Provided path does not include all the necessary schema files. Please verify that you have passed the correct path to the schema.
could you help me?
I've checked for same assemblies names and I didn't got...

ramirma · 2023-04-24T14:44:47Z

Did you use the PrepExternalSchema module to convert your schema to a format usable by chewBBACA 3.1.2? What is the origin of the schema you are trying to use?

massiizsve · 2023-04-25T14:22:19Z

Hi Ramirma,
thanks for your quick reply, I used the command: chewBBACA.py CreateSchema -i Genomes -o schema --cpu 16 --ptf Salmonella_enterica.trn
to generate the schema, it didn't return me any error message.
thanks!

ramirma · 2023-04-26T07:12:26Z

Thank you for the clarification @massiizsve . What command did you use to do allele calling? Did you do allele calling on the same genomes you used to create the schema?
Mario

rfm-targa · 2023-04-26T08:46:55Z

Hello @massiizsve,

Just to add another check to what @ramirma has already asked. Please verify that you have passed the path to the schema directory to perform allele calling. The schema directory contains the schema FASTA files and a folder named short that includes the FASTA files with the representative alleles. Based on the command you used to create the schema, the schema directory should be schema/schema_seed. chewBBACA prints the warning you shared if it cannot find the FASTA files and the short folder in the path you provided.

Rafael

rfm-targa self-assigned this Nov 30, 2021

rfm-targa added Priority: Medium Status: In Progress Has been assigned and is being worked on. Type: Bug labels Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems with schema creating #107

problems with schema creating #107

KasiaTluscik commented Nov 30, 2021

rfm-targa commented Nov 30, 2021

KasiaTluscik commented Dec 1, 2021 •

edited

ramirma commented Dec 1, 2021

KasiaTluscik commented Dec 2, 2021

rfm-targa commented Dec 2, 2021

KasiaTluscik commented Dec 6, 2021

mickaelsilva commented Dec 6, 2021

rfm-targa commented Dec 6, 2021

KasiaTluscik commented Dec 15, 2021

rfm-targa commented Dec 15, 2021 •

edited

ramirma commented Dec 15, 2021

ocarabali commented Jan 29, 2022

ocarabali commented Jan 29, 2022

ocarabali commented Jan 29, 2022

rfm-targa commented Jan 31, 2022

ocarabali commented Feb 19, 2022

ocarabali commented Feb 19, 2022

ocarabali commented Feb 19, 2022

massiizsve commented Apr 24, 2023

ramirma commented Apr 24, 2023

massiizsve commented Apr 25, 2023

ramirma commented Apr 26, 2023

rfm-targa commented Apr 26, 2023

problems with schema creating #107

problems with schema creating #107

Comments

KasiaTluscik commented Nov 30, 2021

rfm-targa commented Nov 30, 2021

KasiaTluscik commented Dec 1, 2021 • edited

ramirma commented Dec 1, 2021

KasiaTluscik commented Dec 2, 2021

rfm-targa commented Dec 2, 2021

KasiaTluscik commented Dec 6, 2021

mickaelsilva commented Dec 6, 2021

rfm-targa commented Dec 6, 2021

KasiaTluscik commented Dec 15, 2021

rfm-targa commented Dec 15, 2021 • edited

ramirma commented Dec 15, 2021

ocarabali commented Jan 29, 2022

ocarabali commented Jan 29, 2022

ocarabali commented Jan 29, 2022

rfm-targa commented Jan 31, 2022

ocarabali commented Feb 19, 2022

ocarabali commented Feb 19, 2022

ocarabali commented Feb 19, 2022

massiizsve commented Apr 24, 2023

ramirma commented Apr 24, 2023

massiizsve commented Apr 25, 2023

ramirma commented Apr 26, 2023

rfm-targa commented Apr 26, 2023

KasiaTluscik commented Dec 1, 2021 •

edited

rfm-targa commented Dec 15, 2021 •

edited