Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Does two reads with opposite strand with same mapping location regarded as duplicates? #67

Open
zztin opened this issue Jul 5, 2023 · 1 comment

Comments

@zztin
Copy link

zztin commented Jul 5, 2023

Hi Andrew,

I have a nanopore sequencing library where one molecule could result in reads on both forward or reverse strands. I would like to know if preseq regard these pairs as "distinct" observation, or as "repeated" observation?

The definition of "distinct" observation is not very clearly stated in the documentation. I was trying to find it in the implementation:

Related lines:

curr_gr.get_start() != prev_gr.get_start())

Related class: GenomicRegion.hpp

It seems like get_start() is getting the smaller coordinate of a bam read?
If so, does it mean that both reverse and forward-mapped reads would be seen as the same molecule (same 'start') -- This is the desired behavior in my use case.

I would love to hear from you!

The command I used:

./preseq lc_extrap -B -o ./yield-estimates.txt <input-sorted-bam>

Software version:
v2.0 downloaded precompiled binary

Best,
Li-Ting

@andrewdavidsmith
Copy link
Contributor

@zztin I think you are probably right about the meaning of that code. Frankly, it's a weak criterion, but it should work for generating big-picture complexity results. It would not be a great approach if the purpose were, e.g., to accurately identify SNVs. It might lose information that would help at some important locus. Also, I've only ever considered how this should work for short reads.

Unfortunately I don't think I can answer your question now and I will leave this issue open, but here's one thing that might help: the best way to use Preseq is by providing a counts histogram (or the counts themselves). That would allow you to count duplicates according to your own definition.

I'll try go address this in more detail soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants