-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problems with schema creating #107
Comments
Greetings @KasiaTluscik, Thank you for your interest in chewBBACA. Can you tell us what version of chewBBACA you are using and what was the command? Also, you refer that you want to create a schema based on 200 genome assemblies from the NCBI, but chewBBACA is detecting 201 input files. If you only have 200 genome assemblies to pass as input, I ask you to make sure that there are not other files in the directory that might be causing this issue. Rafael |
Thanks for the quick reply. I meant ,,about 200" :) . There are actually 201 FASTA sequences. The program version is: 2.8.5. Command below: |
@KasiaTluscik thanks for the clarification and for posting the command. Can you make sure none of the files in the directory are corrupted nor have unusual characters? Also, how did you install chewBBACA? |
I've checked all files with seqkit stats. Seems to be ok. Maybe I should validate this data with something else? Can you give me an idea? I installed chewBBACA via conda. |
@KasiaTluscik, we know that in some cases this type of error was solved by redownloading the genome assemblies or by running a script, such as sequence_cleaner.py from BioPython, to filter out low quality sequences and solve file format issues. |
Hi, Thanks for the hint, but it didn't solve the problem. I downloaded everything again and it doesn't work. I ran sequences through the suggested script (sequence_cleaner.py) and ... still the same error ....I enclose a link to download my dataset (the file is too large for e-mail). |
Can you try to run prodigal independently on your dataset? Just to check if it runs an error on any of the fastas |
Hello @KasiaTluscik, Thank you for sharing the dataset with us. We used the files to reproduce the error and found what is causing the issue. GCF_002209245 (GCF_002209245.1_ASM220924v1_genomic.fna and GCF_002209245.2_ASM220924v2_genomic.fna) chewBBACA does not expect to find multiple assemblies with the same unique prefix (it selects all characters before the first "." as the unique identifier). You can find more details in an issue that I have created to describe the problem and suggest a solution, #108. GCF_002209245.1_ASM220924v1_genomic.fna The CreateSchema process should run without issues after removing those files. If you really need to include multiple versions, you will have to rename the files to ensure that the prefixes are different. Rafael |
Hello :) |
Hello @KasiaTluscik, It is great to know that we could solve this problem. Rafael |
Indeed Kasia, great to know that it worked out. In addition to what Rafael already pointed out I would like to remind you that the CreateSchema process always creates a wgMLST schema. To identify the cgMLST schema you should run the allele call on a set of suitable isolates and then identify which loci are present in the isolates at your desired frequency (100%, 95%, 90%,...). I suspect that if you do this the now with the schema you have the cgMLST schema will come down to a few hundred genes, given the diversity of species you are including. Do let us know how it goes. |
Hello I'm trying to run the command but I get this |
0, in |
1 similar comment
0, in |
Hello @ocarabali, The error you report seems to be related with the Bio.Alphabet module that was removed in Biopython 1.78. We updated chewBBACA's code based on the recommendations from the Biopython developers and this issue should not affect the latest version. Can you please check if the version you are using is 2.8.5? Please update chewBBACA if the version that is installed is not the latest. Rafael |
Dear Rafael Mamede I already solved the previous error. now i have the following error. could you help me? Error on translate_coding_sequences: |
chewBBACA version: 2.8.5 |
It seems that your schema was created with chewBBACA 2.1.0 or lower. |
I've got the same problem with version 3.1.2 |
Did you use the PrepExternalSchema module to convert your schema to a format usable by chewBBACA 3.1.2? What is the origin of the schema you are trying to use? |
Hi Ramirma, |
Thank you for the clarification @massiizsve . What command did you use to do allele calling? Did you do allele calling on the same genomes you used to create the schema? |
Hello @massiizsve, Just to add another check to what @ramirma has already asked. Please verify that you have passed the path to the schema directory to perform allele calling. The schema directory contains the schema FASTA files and a folder named Rafael |
Hi :)
I had such problem while creating a new scheme on 200 ref sequences from NCBI. I have no idea how to solve it. I'll be grateful for help.
Kasia
CPU cores: 22
BLAST Score Ratio: 0.6
Translation table: 11
Minimum sequence length: 201
Size threshold: 0.2
Word size: 5
Window size: 5
Clustering similarity: 0.2
Representative filter: 0.9
Intra-cluster filter: 0.9
Number of inputs: 201
Predicting gene sequences...
[====================] 100%
Extracting coding sequences...
[ ] 0%
Error on cds_batch_extractor:
Traceback (most recent call last):
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/multiprocessing_operations.py", line 39, in function_helper
results = input_args-1
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 235, in cds_batch_extractor
total = save_extracted_cds(g, identifier, orf_file_path,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 182, in save_extracted_cds
genome_info = extract_genome_cds(reading_frames, contigs, 1)
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/gene_prediction.py", line 99, in extract_genome_cds
sequence = contigs[contig_id]
KeyError: 'NZ_CP031775'
[====================] 100%Traceback (most recent call last):
File "/home/msszwarc/miniconda3/envs/chewbbaca/bin/chewBBACA.py", line 10, in
sys.exit(main())
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 1480, in main
functions_info[process]1
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/utils/process_datetime.py", line 149, in wrapper
func(*args, **kwargs)
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/chewBBACA.py", line 193, in create_schema
CreateSchema.main(**vars(args))
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 1192, in main
results = create_schema_seed(input_files, output_directory, schema_name,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 963, in create_schema_seed
cds_files = extract_genes(fasta_files, prodigal_path,
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 296, in extract_genes
total_extracted = sum([f[2] for f in extracted_cdss])
File "/home/msszwarc/miniconda3/envs/chewbbaca/lib/python3.9/site-packages/CHEWBBACA/createschema/CreateSchema.py", line 296, in
total_extracted = sum([f[2] for f in extracted_cdss])
IndexError: list index out of range
The text was updated successfully, but these errors were encountered: