Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-identical length distribution #42

Open
schorlton opened this issue Sep 22, 2022 · 5 comments
Open

Non-identical length distribution #42

schorlton opened this issue Sep 22, 2022 · 5 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@schorlton
Copy link

Same file. Running falco v1.2.1 from bioconda and MultiQC 1.12. Can reproduce by running on nanopore data from SRA with long read lengths.

MultiQC report of FastQC:
image

MultiQC report of falco:
image

I believe falco calculates length distribution for every length, while FastQC creates a histogram in fastqc_data.txt. Which is better? The granularity and detail is nice, but it can also obscure plotting. Should falco reproduce FastQC behaviour or perform some kind of binning of read lengths? Interested in your thoughts.

@andrewdavidsmith andrewdavidsmith added the question Further information is requested label Sep 22, 2022
@andrewdavidsmith
Copy link
Collaborator

andrewdavidsmith commented Sep 22, 2022

@schorlton I personally am not sure which I think is "better". More info is rarely bad. But I'm definitely interested in your opinion on "better" in a general sense. Input is always appreciated! We will definitely consider any suggested change or enhancement.

@schorlton
Copy link
Author

Thanks for quick response! I tend to agree that more info is better. However, this is a somewhat breaking change for use with MultiQC (which I expect many falco users also use). What I would possibly suggest is PR to MultiQC to smooth the line or format this plot as a bar graph (basically a very granular histogram) instead of line graph. Hard to tell what it would look like before implemented, and it would need to work with both tools, but It seems the trend is more important than the individual sizes. Definitely between 0 and ~7500bp on the plot above the line is too thick to be useful. Alternative would be to reproduce FastQC behaviour, or something closer to it than bin size of 1 for read length distribution?

@andrewdavidsmith
Copy link
Collaborator

We'll see how to take a first stab at this and leave this issue open until we can say something on it.

@andrewdavidsmith andrewdavidsmith added the enhancement New feature or request label Sep 22, 2022
@guilhermesena1
Copy link
Collaborator

guilhermesena1 commented Sep 22, 2022

Hello,

When making the sequence length module analysis I had previously made a somewhat executive decision to not group it, because I assume that, in any long read dataset (where this module is often relevant), the number of reads would never generate gigantic bar plots.

That said, this was a bad decision. It's not our call to decide on the behavior of the module, but rather to emulate it faithfully. I'll work on creating base groups for this module. It will be disabled if --nogroup is provided, but I worry a bit that if someone wants to not group length but group, say, sequence content (which generates very large plots), they can't detach one from the other. We might need to add some more falco-specific flags to add this functionality to only group certain modules.

If I may ask: I'm very curious about additional insights that MultiQC provides that is not already available on falco's HTML output? One of our goals in making falco was modernizing the FastQC plots, which I believe is similar to what MultiQC provides. In that spirit, the falco HTML plots for sequence lengths are bar plots, like you suggested (and I fully agree).

Is MultiQC advantageous in this case because you can merge QC metrics for multiple datasets? Or create customized tools for additional summary statistics beyond what FastQC provides?

@schorlton
Copy link
Author

Hi @guilhermesena1, sorry for the very late reply and thank you for your input and work on Falco. I look forward to a solution that facilitates integration with MultiQC. MultiQC aggregates reports from many tools which go well beyond read quality control (see MultiQC Modules). The functionality of MultiQC and HTML reports from FastQC/Falco are hard to compare as they are different in aim and scope. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants