Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trans-spliced gene on different sequences #907

Open
sjackman opened this issue Dec 6, 2018 · 3 comments
Open

Trans-spliced gene on different sequences #907

sjackman opened this issue Dec 6, 2018 · 3 comments

Comments

@sjackman
Copy link
Contributor

sjackman commented Dec 6, 2018

Problem description

Trans-spliced genes on the same sequence work as expected. Trans-spliced genes on different sequences give this error:

gt gff3: error: child on line 19 in file "trans.gff" has different sequence id than its parent on line 14 ('02' vs. '01')

A related but different error occurs when multi-features are on separate sequences.

gt gff3: error: the multi-feature with ID "CDS1" on line 10 in file "trans.gff" has a different sequence id than its counterpart on line 9

Exact command line call triggering the problem

gt gff3 trans.gff

Example minimal input triggering the problem

##gff-version 3
##sequence-region   01 1 1645232
##sequence-region   02 1 769924
01	.	gene	389963	390016	.	+	.	ID=gene1
01	.	mRNA	389963	390016	.	+	.	ID=mRNA1;Parent=gene1,gene2
01	.	gene	222566	222865	.	-	.	ID=gene2
01	.	mRNA	222566	222865	.	-	.	ID=mRNA2;Parent=gene1,gene2
###
01	.	gene	389963	390016	.	+	.	ID=gene3
01	.	mRNA	389963	390016	.	+	.	ID=mRNA3;Parent=gene3,gene4
02	.	gene	222566	222865	.	-	.	ID=gene4
02	.	mRNA	222566	222865	.	-	.	ID=mRNA4;Parent=gene3,gene4
##
01	.	CDS	389963	390016	.	+	.	ID=CDS1
02	.	CDS	222566	222865	.	-	.	ID=CDS1

What GenomeTools version are you reporting an issue for (as output by gt -version)?

gt (GenomeTools) 1.5.10

Did you compile GenomeTools from source? If so, please state the make parameters used.

I installed GenomeTools using brew. brew install genometools

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

macOS 10.11.6 on x86_64

@sjackman
Copy link
Contributor Author

sjackman commented Dec 7, 2018

Here's a related issue: #225
and a useful reference: http://sequenceontology.org/so_wiki/index.php/Discontinuous_features

@sjackman
Copy link
Contributor Author

sjackman commented Dec 7, 2018

@gordon wrote… #225 (comment)

I also don't understand how one would translate such a CDS into a protein sequence. Does anyone understand what the spec says about this? My guess is that the NCBI annotation is not standard conform.

NCBI uses the annotation part=1/2 to indicate which order the parts come in. See for example the annotation of the trans-spliced Arabidopsis thaliana chloroplast gene rps12.
https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=nuccore&report=gff3&id=NC_000932.1

##gff-version 3
##sequence-region NC_000932.1 1 154478
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3702
NC_000932.1	RefSeq	gene	69611	69724	.	-	.	Dbxref=GeneID:844801;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ArthCp047;part=1/2
NC_000932.1	RefSeq	gene	139856	140650	.	+	.	Dbxref=GeneID:844801;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ArthCp047;part=2/2
NC_000932.1	RefSeq	CDS	69611	69724	.	-	0	ID=cds-NP_051038.1;Parent=gene-ArthCp047;Dbxref=Genbank:NP_051038.1,GeneID:844801;Name=NP_051038.1;Note=trans-spliced;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=NP_051038.1;transl_table=11
NC_000932.1	RefSeq	CDS	139856	140087	.	+	0	ID=cds-NP_051038.1;Parent=gene-ArthCp047;Dbxref=Genbank:NP_051038.1,GeneID:844801;Name=NP_051038.1;Note=trans-spliced;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=NP_051038.1;transl_table=11
NC_000932.1	RefSeq	CDS	140625	140650	.	+	2	ID=cds-NP_051038.1;Parent=gene-ArthCp047;Dbxref=Genbank:NP_051038.1,GeneID:844801;Name=NP_051038.1;Note=trans-spliced;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=NP_051038.1;transl_table=11

Could Genometools support this part=1/2 attribute for gt extractfeat -type CDS -join?

@sjackman
Copy link
Contributor Author

sjackman commented Dec 7, 2018

For future folk who may stumble on this issue, here's my workaround for now to extract and translate trans-spliced genes that span multiple sequences. I name the sequences Name=gene_part1 Name=gene_part2 and so forth, and then stitch them together.

# Extract DNA sequences of GFF CDS features from a FASTA file
%.gff.cds.fa: %.gff %.fa
	bin/gt-bequeath Name <$< \
	| gsed -E 's/Name=([^;]*)/Target=\1 0 0/' \
	| gt extractfeat -type CDS -join -coords -target -matchdescstart -retainids -seqid -seqfile $*.fa - \
	| gsed -E 's/>(.*target IDs ([^]|]*).*)/>\2_\1/' >$@

# Extract the trans-spliced coding sequences.
%.cds.part.fa: %.cds.fa
	seqmagick convert --pattern-include=_part --sort=name-asc --line-wrap=0 $< $@

# Splice the trans-spliced coding sequences.
%.cds.trans.fa: %.cds.part.fa
	sed -e 's/_part1//g' -e '/_part[2-9]/d' $< >$@

# Translate protein sequences of GFF CDS features from a FASTA file
%.aa.fa: %.fa
	gt seqtranslate -reverse no -fastawidth 0 $< \
	| gsed -n '/ (1+)$$/{s/ (1+)$$//;p;n;p;n;n;n;n;}' >$@

For the gt-bequeath script, see
#616 (comment)
https://gist.github.com/satta/e4edeb36fbd52e9e4875
https://github.com/sjackman/pgmtdna/blob/master/bin/gt-bequeath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant