-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bakta run times on cluster #282
Comments
Hi @loulanomics ,
I hope this helps to find the best setting for your situation. If you have any further questions, please do not hesitate to keep asking. |
Thank you for your responses, they are very helpful! I do have a few follow-up questions I will try to explain as clearly as possible. Below are rows from a Bakta TSV output (using
Here is the
Prodigal creates unique CDS identifiers by appending an underscore and number to the original contig header. It appears that Bakta's identifiers are in the
But ideally, the Question 1What is Question 2Is there a built in way to create unique contig + CDS identifers? I could create my own (e.g. Thanks again! |
Hi @loulanomics ,
These are ways to identify genes instead of just names. THe letters are randomly generated, except if you provide one yourself as in the Genome submission example. The numbers are indexing the genes. The +5 is actually a smart way to make sure the numbering stays consistent even in the event of new genes being discovered in between, because of a better annotation etc. In case: https://www.ncbi.nlm.nih.gov/genbank/genome_locustag/ |
Thanks @loulanomics for asking and @cpauvert for jumping in - perfect answer addressing question 1. Regarding question 2: |
Hello, we are currently integrating Bakta
v.1.3.1v.1.9.3 into a metagenomics pipeline. The entire pipeline is run on our HPC cluster, where the necessary databases are also stored.It takes about ~1 day to annotate a ~500MB contigs fasta using the full (not light) Bakta database with 500GB memory. This is okay, but not ideal, so in hopes of speeding it up, we have a few questions.
Question 1
About this FAQ,
Given we are on a SLURM-managed cluster, is it possible that this is the reason run times are so long? As mentioned, the database is stored "locally" in our working directory, but is it considered remote/network because the task is transferred to compute nodes (or some other reason)?
Question 2
You mention,
db-light
could still suit our needs because we are not interested in all possible/hypothetical proteins. However, we do care about all possible CDSs/ORFs as a starting point for more-specific gene annotations later on. If we usedb-light
, are we losing any ORF detection? Or would accompanying a tool likeProdigal
be more suitable?Question 3
Generally, what are the differences between the light and full versions? (feel free to link if this is explained elsewhere)
Thank you!
The text was updated successfully, but these errors were encountered: