Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a lighter database? #95

Closed
GaioTransposon opened this issue Feb 20, 2022 · 5 comments
Closed

a lighter database? #95

GaioTransposon opened this issue Feb 20, 2022 · 5 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@GaioTransposon
Copy link

Hi there and thank you for the tool,

is there an option to download only part of the database?
https://zenodo.org/record/5961398/files/db.tar.gz) is nearly 30GB and it takes about 12 hours to download (I am using bakta_db download --output . with bakta installed with conda.

what if one just wants to use only one of the DBs (eg.: UniProtKB/Swiss-Prot: 2021_04) ?

Kind Regards
Dany

@oschwengers
Copy link
Owner

oschwengers commented Feb 21, 2022

Hi Dany,
thanks for reaching out. Yes, DB size is sometimes and for some users an issue. As we decided to come up with a taxonomically untargeted approach and database, it has become fairly large.

The two largest parts of the DB are the PSC Diamond db (UniRef90 cluster representative sequences) and the SQLite db storing the ~200 million IPS sequence hashes (UniRef100) along with all pre-compiled annotations. Therefore, excluding many except of just one annotation DB wouldn't result in significant DB size reductions.

One option to reduce the databse size (that I already thought about) is to compile sub databases for certain phyla. Of course, that would imply a couple of things to develop, implement and test and thus would take its time on a mid term schedule. If this would be of interest for more users, we'd happily address that.

Another option would be to host the database on more servers that distributed around the globe and thus might provide more bandwidth and better download times. Might that help in your case? Do you know of any free hosting services that would be eligible?

Best regards,
Oliver

@oschwengers oschwengers added enhancement New feature or request help wanted Extra attention is needed labels Feb 21, 2022
@oschwengers oschwengers pinned this issue Mar 10, 2022
@oschwengers
Copy link
Owner

Another idea (inspired by @tseemann) is to use a ranked set of broader protein clusters. This could be addressed by skipping the IPS and PSC from the normal database and use a size-filtered subset of the PSCC, only.

A quick check on Uniprot/UniRef50 revealed 2,660,356 UniRef50 proteins. I'd estimate a size reduction of the entire database down to let's say 3-4 Gb.

@oschwengers oschwengers self-assigned this Feb 22, 2023
oschwengers added a commit that referenced this issue Feb 22, 2023
oschwengers added a commit that referenced this issue Feb 22, 2023
@oschwengers oschwengers added this to the v1.7.0 milestone Feb 22, 2023
@oschwengers
Copy link
Owner

Hi @GaioTransposon,
fyi: you might be interested in v1.7.0 which introduces a light database version as described in #196

This lightweight version is only 1.2 Gb zipped and 3 Gb unzipped.

@jfy133
Copy link

jfy133 commented Mar 2, 2023

EDIT: it was a fault conda installation (I think scales was missing), it's working now :), and using the latst biocontainer build also now works :)

I just tried this with 1.7.0 but I get the following error (both via bioconda intsall conda tool, and also the corresponding singularity biocontainer)

$ bakta_db download --type light
Bakta software version: 1.7.0
Required database schema version: 5

fetch DB versions...
	... compatible DB versions: 1
download database: v5.0, type=light, 2023-02-20, DOI: 10.5281/zenodo.7669534, URL: https://zenodo.org/record/7669534/files/db-light.tar.gz...
Traceback (most recent call last):
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 91, in validator
    result = CONFIG_VARS[key](value)
KeyError: 'scale'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jfellows/.conda/envs/bakta/bin/bakta_db", line 10, in <module>
    sys.exit(main())
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/bakta/db.py", line 203, in main
    download(db_url, tarball_path)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/bakta/db.py", line 119, in download
    with alive_bar(total=total_length, scale='SI') as bar:
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/progress.py", line 95, in alive_bar
    config = config_handler(**options)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 82, in create_context
    local_config.update(_parse(theme, options))
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 106, in _parse
    return {k: validator(k, v) for k, v in options.items()}
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 106, in <dictcomp>
    return {k: validator(k, v) for k, v in options.items()}
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 96, in validator
    raise ValueError('invalid config name: {}'.format(key))
ValueError: invalid config name: scale

Did I miss something in my command, for example?

Conda environment creation: conda create -n bakta -c bioconda bakta

@oschwengers
Copy link
Owner

Yes, the 3rd party dependencies needed an update. It should work, now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants