Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data for "Per base sequence quality" section differ from FastQC #25

Open
Shelestova-Anastasia opened this issue Jan 25, 2022 · 9 comments

Comments

@Shelestova-Anastasia
Copy link

falco version 0.3.0

Fastqc and falco results are differrent for section "Per base sequence quality"

Per base sequence quality seqtion to compare in results:
pbsq_falco.txt
pbsq_fastqc.txt

Could not attach sample (25.3 Mb). Let me know if I can provide more information.

@Shelestova-Anastasia
Copy link
Author

Now I made small fastq to demonstrate per base quality difference.

Sample: test_quality.fastq.gz

Falco result:
falco_fastqc_data.txt

Fastqc result:
fastqc_data.txt

See >>Per base sequence quality section, 10-14 group.

falco:
10-14 36.6908 38.0 38.0 38.0 36.0 38.0
fastqc:
10-14 36.11448 37.4 36.6 37.4 34.4 37.4

I see Median, Lower, Quartile, Upper Quartile, 10th Percentile, 90th Percentile are always ints - not doubles.
Why the results are different?

@guilhermesena1
Copy link
Collaborator

guilhermesena1 commented Feb 25, 2022

Hello,

Thank you so much for looking into the differences and my sincere apologies for not reaching out sooner!

We can merge your fix to the main branch if you submit a pull request. The only thing I notice is you are using C notation to cast as double, whereas we use static_cast to encapsulate the casting and not require convoluted parentheses.

In other words the line

y = (double) x;

would be written as

y = static_cast<double>(x);

I can also fix this myself if you submit the PR as-is.

Thank you again for looking into the issue!

Shelestova-Anastasia pushed a commit to Shelestova-Anastasia/falco that referenced this issue Feb 28, 2022
@Shelestova-Anastasia
Copy link
Author

Hello, @guilhermesena1. I fixed cast. Thank you!

@infotmatician
Copy link

Hi, thank you for your effort to fixed this issue @Shelestova-Anastasia @guilhermesena1. I am using Nanopore Sequencing data, and getting the same error in Falco v1.2.1 the per base sequence quality section.
The error is the NaN values of FastQC per base sequence quality section differ from the Falco output. I am not sure but Falco assigns the mean values for all Quartiles columns.
Is that true? Does falco has threshold for total number of reads on those positions?
Best Regards
[
falco_fastqc_data.txt
fastqc_data.txt
](url)

@guilhermesena1
Copy link
Collaborator

Hello,

Thanks for reporting the different output.

I need to look a little further into FastQC's code, but intuitively it doesn't make sense to me that any of the quantiles would be NaNs. If there is at least one read of a given size (or within a size range), then the mean, median, and all quantiles would be well-defined. For example, if there is a single read quality, then all values would be identical. I don't see a case where "NaN" would be the desired outcome.

I'll look into it though.

@guilhermesena1
Copy link
Collaborator

Yeah it looks like FastQC only considers a position to be "valid" if they have at least 100 reads covering that position. I guess we'll have to emulate that functionality as well?

This is the getPercentile() function they use in the module

	
	private double getPercentile (int minbp, int maxbp, int offset, int percentile) {
		int count = 0;
		double total = 0;
	
		for (int i=minbp-1;i<maxbp;i++) {
			if (qualityCounts[i].getTotalCount() > 100) {
				count++;
				total += qualityCounts[i].getPercentile(offset, percentile);
			}
		}
		
		if (count > 0) {
			return total/count;
		}
		return Double.NaN;
		
	}

@infotmatician
Copy link

infotmatician commented Oct 11, 2022

Thank you for the quick response.
I agree with you. We dont need to add. Better idea to show the mean, median, and all quantiles values even if there is a one read. I just did not understand to output of Falco, because all values are the same on some base intervals.

@guilhermesena1
Copy link
Collaborator

It may be a bug, but this is also the correct output if there is only one read within a certain group (very possible in ONT data with high variability of read lengths). If there is only one read of size 50,000, then all quantiles of quality would be the quality value of the 50,000th base of that read, right?

@infotmatician
Copy link

Yeah it should be, that's correct you are right. Thanks for the response, and such a great tool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants