Efficently extracting counts from subset of kmers? #36

nikostr · 2024-05-07T09:47:15Z

I have run kmdiff and identified overrepresented kmers among two groups. Following this, I created a membership matrix to identify kmers present in all my case samples, and intersected these with the overrepresented kmers identified by kmdiff. Now I am interested in getting the counts of these in each of my case samples. I already have the count matrices produced by kmdiff. Dumping these to text and grepping them is obviously one way of doing it, but clearly not very efficient. What would your recommendation be here? Unfortunately my C++ is terrible.

nikostr · 2024-05-30T13:34:35Z

I posted this question before I understood the merge and aggregate command. In case someone else has the same issue, I solved it by doing the following:

kmtricks merge \
    --recurrence-min $N_CASES \
    --cpr \
    --run-dir kmdiff-count \
    --threads 16

kmtricks aggregate \
    --run-dir kmdiff-count \
    --matrix kmer \
    --format text \
    --cpr-in \
    --output count-matrix.out \
    --threads 16

The first command creates a matrix with kmers occurring in at least as many samples as I have cases (N_CASES), and the second command dumps this as a text file. Following this I grepped count-matrix.out with the list of kmers I had identified previously.

Note: using this count matrix it should be possible to find these kmers without creating the membership matrix.

nikostr closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficently extracting counts from subset of kmers? #36

Efficently extracting counts from subset of kmers? #36

nikostr commented May 7, 2024

nikostr commented May 30, 2024

Efficently extracting counts from subset of kmers? #36

Efficently extracting counts from subset of kmers? #36

Comments

nikostr commented May 7, 2024

nikostr commented May 30, 2024