Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing KMCP with HumGut #38

Open
ericvdtoorn opened this issue Aug 29, 2023 · 9 comments
Open

Optimizing KMCP with HumGut #38

ericvdtoorn opened this issue Aug 29, 2023 · 9 comments

Comments

@ericvdtoorn
Copy link

Hello Shenwei,
Thank you for making this new metagenomic tool! I'm interested in benchmarking its performance, and for that I want to perform classification on a large (10,000+) number of samples. There are a number of minor things I've come across but I have an end to end sample running now. My main objective for opening this issue is to discuss potential optimizations to what I'm doing to reduce the total time for running this.

Currently, end to end this single sample took 9 hours (08:58) with 32 cores and 23 GB of RAM (100GB was max)

  • Database used was the HumGut database, split over 10 indices
  • Working directory was in RAM (/dev/shm), started by copying the samples to that (done to avoid being impacted by NAS speeds)
  • Times
    • Step 1: kmcp search, through parallel with 12 threads per job, took 2 hours per search
    • Step 2: kmcp merge, took 15 mins with 64 threads
    • Step 3: kmcp profile, took 4h52 mins with 64 threads
      • I could add the --no-amb-corr flag to reduce by 1h53 mins

From this, I could potentially reduce the time to ~6 hours by allocating more cores and increasing the number of searches running in parallel and adding the --no-amb-corr flag.
Are there more ways in which I am missing optimizations?

I've created a gist with the actual SLURM submission, can also upload the logs of the jobs if that's helpful.

@shenwei356
Copy link
Owner

shenwei356 commented Aug 29, 2023

Hi Eric, thanks for your interest.

Firstly, for the database, I'm wondering why you split it over 10 indices. How many reference genomes do you use? Previously, I created one with 30,691 genomes, with a total size of 21.6GB. That would fill into your HPC node (100GB). I mean you could simply create one KMCP database, and search for one sample against it on a HPC node with all CPUs (64 cores). So the merging process would be unnecessary.

$ du -sh humgut.kmcp
22G     humgut.kmcp


$ more humgut.kmcp/R001/__db.yml 
version: 4
unikiVersion: 4
alias: humgut.kmcp
k: 0
ks:
- 21
hashed: true
canonical: true
scaled: false
scale: 1
minimizer: false
minimizer-w: 0
syncmer: false
syncmer-s: 0
split-seq: true
split-size: 0
split-num: 10
split-overlap: 0
compact-size: true
hashes: 1
fpr: 0.3
numNameGroups: 306910
blocksize: 9592
totalKmers: 61025880337
files:

But according to the #36, you seemed not to split the genome set.

@shenwei356
Copy link
Owner

To be honest, there's little space to significantly improve the speed of KMCP :(. I have to say, 10,000+ is such a large number. Unless you have a large number of HPC nodes, that would take a long time with KMCP. May be use can use a portion of the samples for benchmark purpose.

@ericvdtoorn
Copy link
Author

ericvdtoorn commented Aug 29, 2023

Hi Eric, thanks for your interest.

Firstly, for the database, I'm wondering why you split it over 10 indices. How many reference genomes do you use? Previously, I created one with 30,691 genomes, with a total size of 21.6GB. That would fill into your HPC node (100GB). I mean you could simply create one KMCP database, and search for one sample against it on a HPC node with all CPUs (64 cores). So the merging process would be unnecessary.

$ du -sh humgut.kmcp
22G     humgut.kmcp


$ more humgut.kmcp/R001/__db.yml 
version: 4
unikiVersion: 4
alias: humgut.kmcp
k: 0
ks:
- 21
hashed: true
canonical: true
scaled: false
scale: 1
minimizer: false
minimizer-w: 0
syncmer: false
syncmer-s: 0
split-seq: true
split-size: 0
split-num: 10
split-overlap: 0
compact-size: true
hashes: 1
fpr: 0.3
numNameGroups: 306910
blocksize: 9592
totalKmers: 61025880337
files:

But according to the #36, you seemed not to split the genome set.

Oh? I followed the instructions on the second part of this to build the split database and did get multiple DBs:

# number of databases
N_DB=10
CURRENT=$SLURM_ARRAY_TASK_ID
# split -n r/$N_DB -d  ${DB_NAME}.files.txt ${DB_NAME}.n$N_DB-
f=$(sed -n "${CURRENT}p" subsets.txt)
# for f in ${DB_NAME}.n$N_DB-*; do

echo $f;
kmcp compute -i $f -O $f-k21-n10 -k 21 -n 10 -l 150 -B plasmid \
    --log $f-k21-n10.log --force -j 32
kmcp index -j 32 -I $f-k21-n10 -O $f.kmcp -n 1 -f 0.3 \
    --log $f.kmcp.log --force
# cp taxid and name mapping file to database directory
cp taxid.map name.map $f.kmcp/
> fd __db.yml
kmcp_humgut.n10-05.kmcp/R001/__db.yml
kmcp_humgut.n10-03.kmcp/R001/__db.yml
kmcp_humgut.n10-00.kmcp/R001/__db.yml
kmcp_humgut.n10-08.kmcp/R001/__db.yml
kmcp_humgut.n10-09.kmcp/R001/__db.yml
kmcp_humgut.n10-06.kmcp/R001/__db.yml
kmcp_humgut.n10-07.kmcp/R001/__db.yml
kmcp_humgut.n10-04.kmcp/R001/__db.yml
humgut.kmcp.single/R001/__db.yml
kmcp_humgut.n10-02.kmcp/R001/__db.yml
kmcp_humgut.n10-01.kmcp/R001/__db.yml

And I'm using a list of the n10-NN databases as my index.

Did this as from your results it seemed to be a lot faster to work with a split database.

@ericvdtoorn
Copy link
Author

To be honest, there's little space to significantly improve the speed of KMCP :(. I have to say, 10,000+ is such a large number. Unless you have a large number of HPC nodes, that would take a long time with KMCP. May be use can use a portion of the samples for benchmark purpose.

Unfortunate to hear it can't go faster yet, though it won't be much of an issue for most users (not everyday you're running a lot of samples ;-) ).
I've requested access to more compute, hopefully that will be enough to run (a significant part of) the samples.

@shenwei356
Copy link
Owner

Did this as from your results it seemed to be a lot faster to work with a split database.

Yes for one sample. Not for multiple samples. You can also test with the single database (humgut.kmcp.single).

Using split databases is mainly for two scenes:

  1. Parallelization with HPC nodes, with each node used for searching against one split database. This would largely improve the search speed for ONE sample as all CPUs are working simultaneously for just one sample, and a merge step is needed. However, for many samples, the total CPU hours are roughly the same as searching against one single database for each sample on an HPC node.
  2. A single database built from a large number of reference genomes can't fill into the RAM. So you have to use split databases. Obviously, you don't need to.

@ericvdtoorn
Copy link
Author

ericvdtoorn commented Sep 1, 2023

Hi, thank you for your comments, I'm setting up the single server now.

Separately, I added some code to do performance tracking to the kmcp profile command, and got the attached flamegraph (can be viewed with go tool pprof -http=:8080 ./file).

From this it seems that there is a significant cost (30%+) associated with parsing the data. Given that the input of the profile function is internal anyway, I think it's feasible to first convert it into a different format with static typing (e.g. Parquet) or even as an in-memory structure like Arrow.

In the process, I also noticed that the output of the merged TSVs has a slight bug; there is an extra tab after every row but the first, making strict parsers panic.

image

DRR171473.kmcp.perf.profile.zip

@shenwei356
Copy link
Owner

I'm thrilled you use the pprof tool to analyze the code. Previously, I used it a lot to improve the performance.

From this it seems that there is a significant cost (30%+) associated with parsing the data

Yes, parsing floats (fpr and qcov) is slow. I've tried some methods, but there's no significant improvement.

Given that the input of the profile function is internal anyway, I think it's feasible to first convert it into a different format with static typing (e.g. Parquet) or even as an in-memory structure like Arrow.

The input of the profile, i.e., search result, needs to be in a plain format for easy interpretation. Cause KMCP can act as a sequence/genome search tool.

But I think it's a good way to use binary format for fast download parsing, companying with a view command to convert to plain format.

I also noticed that the output of the merged TSVs has a slight bug; there is an extra tab after every row but the first, making strict parsers panic.

Thanks, it's fixed. Please use the new binaries:

changes

@ericvdtoorn
Copy link
Author

I'm thrilled you use the pprof tool to analyze the code. Previously, I used it a lot to improve the performance.

Pleasure to look at!

Yes, parsing floats (fpr and qcov) is slow. I've tried some methods, but there's no significant improvement.
The input of the profile, i.e., search result, needs to be in a plain format for easy interpretation. Cause KMCP can act as a sequence/genome search tool.
But I think it's a good way to use binary format for fast download parsing, companying with a view command to convert to plain format.

Ah yes, I'm focused on the profiling for my usecase but that's a valid point. Could potentially also be inferred from the outFile parameters of the search and merge commands?
The buffered reading also occurs 5 times, would also be possible to cache the entire thing into memory (if it's available, perhaps something like the -w param for search?)

Thanks, it's fixed. Please use the new binaries:

changes

Thank you for the quick fix!

@shenwei356
Copy link
Owner

I check the pprof again. Besides float parsings, gzip reading and writing and column value splitting are also performance bottlenecks (HTML: [kmcp cpu.zip]) .

(https://github.com/shenwei356/kmcp/files/12505199/kmcp.cpu.zip))

image

For buffering search results in memory, I don't think it's feasible, cause some gzip-compressed search result files are quite large (>10GB) in my analysis, which would occupy a huge number of memory, even using some compact data structures.

Using serializing binary files should be the best way for quick downstream parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants