half genome length from tutorial-strawberry #84

fancyge · 2021-03-02T21:35:46Z

This seems a silly question, but I really confused with why I get a half estimated compared to the tutorial with the same commands.

So I have followed https://github.com/KamilSJaron/smudgeplot/wiki/tutorial-strawberry to get started with smudgeplot and genomescope. The commands I used:

mkdir -p strawberry_iinumae && cd strawberry_iinumae
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_2.fastq.gz

mkdir tmp 
ls DRR013884_1.fastq.gz DRR013884_2.fastq.gz > FILES 
kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmer_counts tmp 

kmc_dump -ci100 -cx3000 kmer_counts kmer_k21.dump 
smudgeplot.py hetkmers -o kmer_pairs < kmer_k21.dump 
smudgeplot.py plot -o f_iinumae -t "Fragaria iinumae" -q 0.99 kmer_pairs_coverages.tsv

## genomescope
kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000
Rscript genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Fiinumae_genomescope

Actually, smudgeplot gave the same results as shown in the tutorial and estimated as tetraploid. However, genomescope showed 100Mbp in length with -p 2 and the heterozygosity rate is much higher (13.8%). Is the number following "len:" the estimated genome size or should multiply with p? Can you please give me insights on this? I appreciate you help.

The text was updated successfully, but these errors were encountered:

KamilSJaron · 2021-03-03T10:26:20Z

The thing with genomescope is that it's all based on guessing right the 1n coverage. The model you posted got it wrong, instead of 146x it estimated twice as much. As a consequence, the real diploid peak is the haploid peak in the model, and real tetraploid peak is consideed the diploid one. All that leads to an unrealistically high heterozygosity estimate (for a strawberry) and about half of the expected genome size.

The genome size usually means the haploid genome size (i.e. counting each chromosome only once) - this is the value reported by genomescope and also the value you find in genome browsers etc. Flow cytometry and polyploidy folks sometimes also talk about the total genomic content in a cell, and that is ploidy * haploid genome size (what you measure using fc).

So, the answer is the model you show here is wrong and I am nor sure why. It could be just a freak convergence, but I should check if i can reconstruct the tutorial with the latest version of genomescope. Would you mind zipping the kmer histogram and posting it here? I don't think I currently have the data by my hand :-)

fancyge · 2021-03-03T13:55:48Z

Thank you very much for the quick response. I see the kcov 287 is consistent with the coverage in the x-axis in genomescope. Is the 1n coverage inferred from the highest peak in the genomescope plot? I used genomescope v2.0.

kmer_k21.hist.zip

fancyge · 2021-03-03T19:03:26Z

Hi Kamil,

I just tried with genomescope v1.0 using the same hist file and this time it gave the same as in the tutorial. This seems strange and I hope you can help me if I missed something in v2. Thank you.

KamilSJaron · 2021-03-13T11:41:54Z

Indeed, the older versions of genomescope certainly converged on the expected coverage (~140x) right away, but the latest version really does not - the default run estimates 1n = 280.

If you specify the coverage prior (-l 140) the model converges as expected (pic attached), but that's not completely satisfying.

I am out of my depth @tbenavi1 ... I recalled this tweet, and tried reducing number of rounds - https://twitter.com/t_rhyker/status/1288863398374014979?s=20, but that changes nothing (also the situation is quite different, here there is soooo much coverage)

KamilSJaron added the question label Mar 3, 2021

KamilSJaron added the potential_problems for potential problems given different scenarios (How does XXX infulence sumdgeplot?) label Apr 9, 2021

KamilSJaron added smudgeplot_included if smudgeplot was posted with the quesiton / problem genomescope_included labels Apr 13, 2022

KamilSJaron removed the question label Aug 17, 2023

KamilSJaron mentioned this issue Aug 22, 2023

Update wiki #54

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

half genome length from tutorial-strawberry #84

half genome length from tutorial-strawberry #84

fancyge commented Mar 2, 2021 •

edited by KamilSJaron

KamilSJaron commented Mar 3, 2021

fancyge commented Mar 3, 2021

fancyge commented Mar 3, 2021 •

edited

KamilSJaron commented Mar 13, 2021

half genome length from tutorial-strawberry #84

half genome length from tutorial-strawberry #84

Comments

fancyge commented Mar 2, 2021 • edited by KamilSJaron

KamilSJaron commented Mar 3, 2021

fancyge commented Mar 3, 2021

fancyge commented Mar 3, 2021 • edited

KamilSJaron commented Mar 13, 2021

fancyge commented Mar 2, 2021 •

edited by KamilSJaron

fancyge commented Mar 3, 2021 •

edited