Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long read metagenomic profiling #27

Open
JensUweUlrich opened this issue Mar 27, 2023 · 2 comments
Open

long read metagenomic profiling #27

JensUweUlrich opened this issue Mar 27, 2023 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@JensUweUlrich
Copy link

Dear Wei Shen,
I really like your tool and your tutorials. I just have a question regarding long read metagenomic profiling. Is there a specific parameter combination you would recommend to use to taxonomic profiling? It seems like I'm missing some organisms from the Zymo Mock Community even when using profiling mode m=0.
Thanks
Jens

@shenwei356
Copy link
Owner

Thanks for your interest.

KMCP is only suitable for short-read metagenomic profiling, with much lower sensitivity on long-read datasets. My initial plan was to support both short and long reads. But the read matching strategy, i.e., keeping reads with enough (>= 50% ) k-mers contained in a genome chunk, is of low sensitivity for long reads, even for HIFI reads.

Some strategies were tried, but the results were out of expectation.

  1. Setting a lower similarity threshold. For our probabilistic data structure, lower thresholds will significantly increase the false-positive rates of a read, though the FPR can also be reduced at the cost of bigger databases.
  2. Using sketching algorithm. ScaledMinash, Closed Syncmers, and Minimizer were all implemented (available in the current version) and tested, but they didn't work well on error-prone long reads with lower sensitivity. Though tools like minimap2 benefit from Minimizer with location information for seeding and chaining in sequence alignment, we failed to utilize them in taxonomic profiling.
  3. Using multiple k-mers. K-mers of different lengths, e.g., 17, 21, 31, didn't do better than a single value and doubled the database size.
  4. Using Simhash with a higher tolerance than k-mer on base substitution. It's slower and has lower sensitivity unexpectedly.
  5. Breaking long reads into short ones. It only applies to HIFI reads, but the strength of the long reads is wasted.

@shenwei356 shenwei356 added the documentation Improvements or additions to documentation label Jun 6, 2023
@shenwei356
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants