Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sg.config configuration (The homology of the de novo assembled genome is not known) #4

Open
hushaoqiang opened this issue Apr 28, 2022 · 24 comments

Comments

@hushaoqiang
Copy link

Hello, thank you very much for your SubPhaser software, which is very useful for genotyping. I used HIFI and HIC to assemble the genome of Fragaria x ananassa but I do not know the homology of the 28 scaffolds. Should I put all the chromosomes in the sg.config file in one line?
I am looking forward to your reply.

@zhangrengang
Copy link
Owner

No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:

fan-fve

So the sg.config configuration can be set like:

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4
...

@zhangrengang
Copy link
Owner

zhangrengang commented Apr 28, 2022

Another way for you, the HiC contact heatmap can also show diagonal signals between homeologous chromosomes like:
hic-heatmap
So the sg.config configuration can be set like:

chr01a chr01b chr01c chr01d
chr02a chr02b chr02c chr02d
...

Note that the HiC heatmap should not be filtered by mapping qualtiy of reads.

@hushaoqiang
Copy link
Author

谢谢老师您的解答,我将使用您所推荐的方式来做同源分析。另外还有个问题请教一下老师您,为了说详细点,我就是用中文了。
老师,使用Camarosa草莓的基因组序列想要试运行一下,基因组文件是发表的序列。配置文件内容:
Fvb1-1 Fvb1-2 Fvb1-3 Fvb1-4
Fvb2-1 Fvb2-2 Fvb2-3 Fvb2-4
Fvb3-1 Fvb3-2 Fvb3-3 Fvb3-4
Fvb4-1 Fvb4-2 Fvb4-3 Fvb4-4
Fvb5-1 Fvb5-2 Fvb5-3 Fvb5-4
Fvb6-1 Fvb6-2 Fvb6-3 Fvb6-4
Fvb7-1 Fvb7-2 Fvb7-3 Fvb7-4

运行命令是:subphaser -i F_ana_Camarosa_6-28-17_hardmasked.fasta -c sg.config
但是会出现:
22-04-28 18:21:39 [INFO] Consistent with subgenome assignment: 16 (6.93%); potential exchange: 149 (64.50%)
22-04-28 18:21:39 [INFO] Output: /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.bin.enrich
22-04-28 18:21:39 [INFO] ###Step: LTR
22-04-28 18:21:39 [INFO] Identifying LTR-RTs by ['ltr_harvest']
22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.scn.ok exists; skip this step
22-04-28 18:21:45 [INFO] 8186 LTRs identified
22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.tesort.ok exists; skip this step
22-04-28 18:21:45 [INFO] By TEsorter, 190 (2.3%) are classified as LTRs, of which 0 (0.0%) are intact with complete protein domains
22-04-28 18:21:45 [INFO] After filtering, 186 / 8186 (2.3%) LTRs retained
22-04-28 18:21:45 [INFO] Outputing coordinate - LTR maps to /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.ltr.bin.count
22-04-28 18:21:45 [INFO] Start Pool with 32 process(es)
22-04-28 18:21:45 [INFO] Processed 0 sequences
Traceback (most recent call last):
File "/home/hsq/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 779, in main
pipeline.run()
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 516, in run
ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 563, in step_l tr
Seqs.map_kmer3([ltrfile], d_kmers, fout=fout, k=self.k, ncpu=self.ncpu,
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/Seqs.py", line 97, in map_kmer3
logger.info('{} ({:.2%}) sequences contain subgenome-specific kmers'.format(mapped_seqs, mapped_seqs/i))
ZeroDivisionError: division by zero

k15_q200_f2.kmer_pca.pdf是成功生成的。生成的pca图与老师您附表中的图不一样。
请问一下老师我是不是需要调整参数或者其他设置有什么问题。
期待老师您的回复

@zhangrengang
Copy link
Owner

Do not use hardmasked genome sequences, because subphaser is based on repeated sequences (TEs in most cases) in fact.

@hushaoqiang
Copy link
Author

老师,请问您对开麦罗莎草莓做分析时(Fig. S27 Subgenome phasing with SubPhaser in the Fragaria x ananassa genome.),使用的文件是哪一个啊。

@zhangrengang
Copy link
Owner

F_ana_Camarosa_6-28-17.fasta.gz. But you may not get the identical results but similar results.

@hushaoqiang
Copy link
Author

老师,我刚刚又去GDR网站看了一下,我只看到了 F_ana_Camarosa_6-28-17_hardmasked.fasta.gz跟[F_ana_Camarosa_6-28-17_unmasked.fasta.gz两个文件,请问一下您告诉我的F_ana_Camarosa_6-28-17.fasta.gz文件这个在哪个链接下载啊,麻烦老师您了。

@zhangrengang
Copy link
Owner

F_ana_Camarosa_6-28-17_unmasked.fasta.gz

@hushaoqiang
Copy link
Author

老师,已成功运行,4套亚基因组成功分型。图非常好看,感谢老师您的耐心指导与为科研工作者所开发的SubPhaser软件。

@hushaoqiang
Copy link
Author

Teacher, I have successfully found homologous chromosomes using MUMmer software, and then performed genotyping using SubPhaser. The results are very good.
Thank you!

@zhangrengang
Copy link
Owner

Great. Thanks for your feedback.

@kashiff007
Copy link

kashiff007 commented Jun 5, 2022

Hi @zhangrengang, Thanks for this excellent software. I have been using this for my newly assembled allopolyploidy genome (2n = 4x = 40). The genome has two subgenomes, but I don't know the homologous pairs. So as you suggested in previous comments, I used mummer and minimapper2 to find the homologous pairs and assigned them in the configurations file.

But after running the SubPhaser with default parameters I got 19 chr in SG1 and only one chr in SG2. Idially, it should be 10 chr in each subgenome.
image
The differential k-mer heatmap also shows the similar result:
image

Can you suggest why it happend? Am I approaching rightly?

@zhangrengang
Copy link
Owner

@kashiff007 It seems that your genome is contig-level, but not chromosome-level?

@kashiff007
Copy link

kashiff007 commented Jun 5, 2022

Yes, It is a newly assembled genome and from cytometry, we speculate that the largest 20 contigs are full genome length.

@zhangrengang
Copy link
Owner

@kashiff007 Can you show the homoelogous relationship such as dot plot, and your config file? How much percent do the largest 20 contigs account for the whole genome?

@kashiff007
Copy link

kashiff007 commented Jun 5, 2022

image

here y-axis is my new genome and when we align with another closely related species Oropetium (x-asix), id gives the following plot. You can see each of the x-axis has two aligned positions in y-axis, which tells that these are homologous.

Based on this I made a config file which looks like:
ptg000016l ptg000003l
ptg000013l ptg000004l
ptg000022l ptg000015l
ptg000011l ptg000007l
ptg000012l ptg000006l
ptg000019l ptg000008l
ptg000017l ptg000018l
ptg000002l ptg000010l
ptg000001l ptg000005l
ptg000020l ptg000014l
Each row has two contigs (assuming homologous chromosomes) separated by space.

Total genoem size is 587 mb (587146212) and top 20 contigs are abount 573mb (573812465). or 97.72% of the total genome.

@zhangrengang
Copy link
Owner

@kashiff007 It seems no problem. Do you have evidence for allo?

@kashiff007
Copy link

I don't have evidence for allo. My primary job was to separate the subgenemes so I followed this thread. The circos looks like this:
circos

@kashiff007
Copy link

is it something to do with clustering?

@zhangrengang
Copy link
Owner

@kashiff007 If there are no much assembly errors (swtich errors, etc,), your genome is likely to be an auto-polyploid or be heavily recombined, which can not by phased by subphaser. You can try to change the parameters, such as setting -q 100 or reducing the number of chromosome sets like:

ptg000016l ptg000003l
ptg000013l ptg000004l
ptg000022l ptg000015l
ptg000011l ptg000007l
ptg000012l 
ptg000006l
ptg000019l 
...
ptg000020l 
ptg000014l

But these may also not work.

@kashiff007
Copy link

OK, I appreciate the explanation.

@kedduck
Copy link

kedduck commented Dec 1, 2023

No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:

fan-fve

So the sg.config configuration can be set like:

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4
...

Hi, @zhangrengang

I'm soryy to disturb you. The plot is very fancy. Is it plotted by "dosPlotly" ?

Thanks!

@zhangrengang
Copy link
Owner

@kedduck No, it is from the paper: Edger P P, Poorten T J, VanBuren R et. al. Origin and evolution of the octoploid strawberry genome [J]. Nature Genet., 2019, 51 (3): 541–547 [http://doi.org/10.1038/s41588-019-0356-4]

@kedduck
Copy link

kedduck commented Dec 1, 2023

Thanks for your quick reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants