Data for "Per base sequence quality" section differ from FastQC #25

Shelestova-Anastasia · 2022-01-25T11:08:52Z

falco version 0.3.0

Fastqc and falco results are differrent for section "Per base sequence quality"

Per base sequence quality seqtion to compare in results:
pbsq_falco.txt
pbsq_fastqc.txt

Could not attach sample (25.3 Mb). Let me know if I can provide more information.

Shelestova-Anastasia · 2022-02-24T10:16:35Z

Now I made small fastq to demonstrate per base quality difference.

Sample: test_quality.fastq.gz

Falco result:
falco_fastqc_data.txt

Fastqc result:
fastqc_data.txt

See >>Per base sequence quality section, 10-14 group.

falco:
10-14 36.6908 38.0 38.0 38.0 36.0 38.0
fastqc:
10-14 36.11448 37.4 36.6 37.4 34.4 37.4

I see Median, Lower, Quartile, Upper Quartile, 10th Percentile, 90th Percentile are always ints - not doubles.
Why the results are different?

guilhermesena1 · 2022-02-25T20:28:38Z

Hello,

Thank you so much for looking into the differences and my sincere apologies for not reaching out sooner!

We can merge your fix to the main branch if you submit a pull request. The only thing I notice is you are using C notation to cast as double, whereas we use static_cast to encapsulate the casting and not require convoluted parentheses.

In other words the line

y = (double) x;

would be written as

y = static_cast<double>(x);

I can also fix this myself if you submit the PR as-is.

Thank you again for looking into the issue!

… to project style

Shelestova-Anastasia · 2022-02-28T01:47:13Z

Hello, @guilhermesena1. I fixed cast. Thank you!

infotmatician · 2022-10-11T08:16:38Z

Hi, thank you for your effort to fixed this issue @Shelestova-Anastasia @guilhermesena1. I am using Nanopore Sequencing data, and getting the same error in Falco v1.2.1 the per base sequence quality section.
The error is the NaN values of FastQC per base sequence quality section differ from the Falco output. I am not sure but Falco assigns the mean values for all Quartiles columns.
Is that true? Does falco has threshold for total number of reads on those positions?
Best Regards
[
falco_fastqc_data.txt
fastqc_data.txt
](url)

guilhermesena1 · 2022-10-11T10:39:18Z

Hello,

Thanks for reporting the different output.

I need to look a little further into FastQC's code, but intuitively it doesn't make sense to me that any of the quantiles would be NaNs. If there is at least one read of a given size (or within a size range), then the mean, median, and all quantiles would be well-defined. For example, if there is a single read quality, then all values would be identical. I don't see a case where "NaN" would be the desired outcome.

I'll look into it though.

guilhermesena1 · 2022-10-11T10:43:36Z

Yeah it looks like FastQC only considers a position to be "valid" if they have at least 100 reads covering that position. I guess we'll have to emulate that functionality as well?

This is the getPercentile() function they use in the module

	
	private double getPercentile (int minbp, int maxbp, int offset, int percentile) {
		int count = 0;
		double total = 0;
	
		for (int i=minbp-1;i<maxbp;i++) {
			if (qualityCounts[i].getTotalCount() > 100) {
				count++;
				total += qualityCounts[i].getPercentile(offset, percentile);
			}
		}
		
		if (count > 0) {
			return total/count;
		}
		return Double.NaN;
		
	}

infotmatician · 2022-10-11T12:23:30Z

Thank you for the quick response.
I agree with you. We dont need to add. Better idea to show the mean, median, and all quantiles values even if there is a one read. I just did not understand to output of Falco, because all values are the same on some base intervals.

guilhermesena1 · 2022-10-11T14:40:39Z

It may be a bug, but this is also the correct output if there is only one read within a certain group (very possible in ONT data with high variability of read lengths). If there is only one read of size 50,000, then all quantiles of quality would be the quality value of the 50,000th base of that read, right?

infotmatician · 2022-10-11T18:40:23Z

Yeah it should be, that's correct you are right. Thanks for the response, and such a great tool

Shelestova-Anastasia mentioned this issue Feb 25, 2022

per base sequence quality fix: https://github.com/smithlabcode/falco/… #27

Merged

Shelestova-Anastasia pushed a commit to Shelestova-Anastasia/falco that referenced this issue Feb 28, 2022

per base sequence quality fix: smithlabcode#25 - fixed cast to double…

c07b610

… to project style

guilhermesena1 closed this as completed in a774688 Mar 1, 2022

guilhermesena1 reopened this Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data for "Per base sequence quality" section differ from FastQC #25

Data for "Per base sequence quality" section differ from FastQC #25

Shelestova-Anastasia commented Jan 25, 2022

Shelestova-Anastasia commented Feb 24, 2022

guilhermesena1 commented Feb 25, 2022 •

edited

Shelestova-Anastasia commented Feb 28, 2022

infotmatician commented Oct 11, 2022

guilhermesena1 commented Oct 11, 2022

guilhermesena1 commented Oct 11, 2022

infotmatician commented Oct 11, 2022 •

edited

guilhermesena1 commented Oct 11, 2022

infotmatician commented Oct 11, 2022

Data for "Per base sequence quality" section differ from FastQC #25

Data for "Per base sequence quality" section differ from FastQC #25

Comments

Shelestova-Anastasia commented Jan 25, 2022

Shelestova-Anastasia commented Feb 24, 2022

guilhermesena1 commented Feb 25, 2022 • edited

Shelestova-Anastasia commented Feb 28, 2022

infotmatician commented Oct 11, 2022

guilhermesena1 commented Oct 11, 2022

guilhermesena1 commented Oct 11, 2022

infotmatician commented Oct 11, 2022 • edited

guilhermesena1 commented Oct 11, 2022

infotmatician commented Oct 11, 2022

guilhermesena1 commented Feb 25, 2022 •

edited

infotmatician commented Oct 11, 2022 •

edited