Systematic bias at low coverage (under 20%) #45

tylerbarnum · 2020-08-28T17:14:03Z

(For others who come across this: this is an issue with an edge use case of Nonpareil; I’m otherwise very happy with the program and trust it for higher coverage samples).

I designed an experiment to see how the output of Nonpareil changes when a FASTQ is repeatedly halved in size. The behavior above a redundancy value of 20% is that the subsampled FASTQ files follow the Nonpareil curve of the larger FASTQ file. Under 20%, however, the data show a systematic bias towards low redundancy (an example of the data is shown within the affected range in the below plot). The bias affects estimates of diversity and how much additional sequencing effort is needed. I suspect that the issue may be, using the language in the original paper, in the assumptions behind how the total number of reads affects the probability of observing matches between reads. At low total number of reads, it becomes less and less likely to find matches between reads; is the binomial distribution still appropriate in such a context?

lmrodriguezr self-assigned this Aug 28, 2020

lmrodriguezr added the bug label Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematic bias at low coverage (under 20%) #45

Systematic bias at low coverage (under 20%) #45

tylerbarnum commented Aug 28, 2020

Systematic bias at low coverage (under 20%) #45

Systematic bias at low coverage (under 20%) #45

Comments

tylerbarnum commented Aug 28, 2020