Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

half genome length from tutorial-strawberry #84

Open
fancyge opened this issue Mar 2, 2021 · 4 comments
Open

half genome length from tutorial-strawberry #84

fancyge opened this issue Mar 2, 2021 · 4 comments
Labels
genomescope_included potential_problems for potential problems given different scenarios (How does XXX infulence sumdgeplot?) smudgeplot_included if smudgeplot was posted with the quesiton / problem

Comments

@fancyge
Copy link

fancyge commented Mar 2, 2021

This seems a silly question, but I really confused with why I get a half estimated compared to the tutorial with the same commands.

So I have followed https://github.com/KamilSJaron/smudgeplot/wiki/tutorial-strawberry to get started with smudgeplot and genomescope. The commands I used:

mkdir -p strawberry_iinumae && cd strawberry_iinumae
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_2.fastq.gz

mkdir tmp 
ls DRR013884_1.fastq.gz DRR013884_2.fastq.gz > FILES 
kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmer_counts tmp 

kmc_dump -ci100 -cx3000 kmer_counts kmer_k21.dump 
smudgeplot.py hetkmers -o kmer_pairs < kmer_k21.dump 
smudgeplot.py plot -o f_iinumae -t "Fragaria iinumae" -q 0.99 kmer_pairs_coverages.tsv

## genomescope
kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000
Rscript genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Fiinumae_genomescope

Actually, smudgeplot gave the same results as shown in the tutorial and estimated as tetraploid. However, genomescope showed 100Mbp in length with -p 2 and the heterozygosity rate is much higher (13.8%). Is the number following "len:" the estimated genome size or should multiply with p? Can you please give me insights on this? I appreciate you help.
Fiinumae_genomescope_linear_plot
k21_smudgeplot_log10

@KamilSJaron
Copy link
Owner

The thing with genomescope is that it's all based on guessing right the 1n coverage. The model you posted got it wrong, instead of 146x it estimated twice as much. As a consequence, the real diploid peak is the haploid peak in the model, and real tetraploid peak is consideed the diploid one. All that leads to an unrealistically high heterozygosity estimate (for a strawberry) and about half of the expected genome size.

The genome size usually means the haploid genome size (i.e. counting each chromosome only once) - this is the value reported by genomescope and also the value you find in genome browsers etc. Flow cytometry and polyploidy folks sometimes also talk about the total genomic content in a cell, and that is ploidy * haploid genome size (what you measure using fc).

So, the answer is the model you show here is wrong and I am nor sure why. It could be just a freak convergence, but I should check if i can reconstruct the tutorial with the latest version of genomescope. Would you mind zipping the kmer histogram and posting it here? I don't think I currently have the data by my hand :-)

@fancyge
Copy link
Author

fancyge commented Mar 3, 2021

Thank you very much for the quick response. I see the kcov 287 is consistent with the coverage in the x-axis in genomescope. Is the 1n coverage inferred from the highest peak in the genomescope plot? I used genomescope v2.0.

kmer_k21.hist.zip

@fancyge
Copy link
Author

fancyge commented Mar 3, 2021

Hi Kamil,

I just tried with genomescope v1.0 using the same hist file and this time it gave the same as in the tutorial. This seems strange and I hope you can help me if I missed something in v2. Thank you.
plot

@KamilSJaron
Copy link
Owner

Indeed, the older versions of genomescope certainly converged on the expected coverage (~140x) right away, but the latest version really does not - the default run estimates 1n = 280.

If you specify the coverage prior (-l 140) the model converges as expected (pic attached), but that's not completely satisfying.

I am out of my depth @tbenavi1 ... I recalled this tweet, and tried reducing number of rounds - https://twitter.com/t_rhyker/status/1288863398374014979?s=20, but that changes nothing (also the situation is quite different, here there is soooo much coverage)

Fiinumae_genomescope_linear_plot

@KamilSJaron KamilSJaron added the potential_problems for potential problems given different scenarios (How does XXX infulence sumdgeplot?) label Apr 9, 2021
@KamilSJaron KamilSJaron added smudgeplot_included if smudgeplot was posted with the quesiton / problem genomescope_included labels Apr 13, 2022
@KamilSJaron KamilSJaron mentioned this issue Aug 22, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
genomescope_included potential_problems for potential problems given different scenarios (How does XXX infulence sumdgeplot?) smudgeplot_included if smudgeplot was posted with the quesiton / problem
Projects
None yet
Development

No branches or pull requests

2 participants