Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repairing gff #1

Open
pachiras opened this issue Jan 30, 2020 · 8 comments
Open

Repairing gff #1

pachiras opened this issue Jan 30, 2020 · 8 comments

Comments

@pachiras
Copy link

First of all, let me say thank you for sharing the great project!

But I'm having a hard time importing a new genome.
I found my gff needs to be modified before the import.
So I added some lines in the ini file.

[GFF]
[FILES]
         SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
         GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
         ; /import/src/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3
         ; mv temp.gff3 FES_r1.0.genes.gff3
         PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]

Here, the script for gff modification (/import/src/modify_gff3.sh) was as follows:

#!/usr/bin/env bash

cat $1 | awk -F'\t' 'BEGIN {
    OFS = "\t"
}
{
    if ($3 == "gene") {
        print # print gene

        split($9, GWORDS, ";")
        PARENT = GWORDS[1]
        gsub("ID=", "Parent=", PARENT)
        gsub("gene", "transcript", $9)

        $3 = "mRNA"
        $9 = $9";"PARENT
    } else {
        gsub("gene", "transcript", $9);
    }

    print
}

I checked the modified gff by this script was imported correctly.

The script was shared with the container by the docker command option:

docker run --rm -u $UID:$GROUPS --name easy-import-fagopyrum_esculentum_fes1_core_1_1 --network docker_genomehubs-network -v /data/nas1.1/genomehubs/v1/import/conf:/import/conf  -v /data/nas1.1/genomehubs/v1/import/data:/
import/data -v /data/nas1.1/genomehubs/v1/import/src:/import/src -e DATABASE=fagopyrum_esculentum_fes1_core_1_1 -e FLAGS="-s -p" genomehubs/easy-import:19.05

But the command output many warnings and the script didn't seem to be working.
Am I doing wrong?

Any comment would be appreciated.
Thank you.

@rjchallis
Copy link
Contributor

I think I can see what the problem is here. The script that you want to run on the input GFF won't be run by the import container. The lines:

         ; /import/src/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3
         ; mv temp.gff3 FES_r1.0.genes.gff3

are treated as comments. Where I've used this in examples, it is meant to link any notes on steps that had to be run manually to the configuration file, but there is no code in the container to actually run such commands.

If you run your script on the downloaded file (which will be in a subdirectory of /data/nas1.1/genomehubs/v1/import/data) then rerun the import, it should skip the download (as the file already exists locally) and use your modified file. You will also need to be sure to delete any other files with this filename as a prefix (e.g. *.sorted.gff) to make sure it resumes the import process from the new file and not a processed version of the old file.

If you are still having problems could you share a small portion of your GFF and I can try to see what else will help. It looks like the changes you are making in your script could probably be handled by options to the [GFF] configuration, but I realise that this doesn't use the most intuitive syntax.

I should also note that the import will usually generate quite a few warnings about the sequence names not being valid accessions that can be ignored. If you could show me any other error messages that you get then that will help with working out how best to get the file imported.

@pachiras
Copy link
Author

pachiras commented Feb 4, 2020

Hi @rjchallis,

Thank you for your detailed comment.
It was my misunderstanding. I thought the command after the semi colons are executed in the process.

I still want some kind of preprocessing mechanism in the import process since:

  • it will improve the reproducibility of the database creation
  • I don't want to use extra disk space for the modified gff
  • I would easily forget what I had done before ;-)

So I have been testing some experimental code in https://github.com/pachiras/easy-import/tree/preprocessing

Firstly I prepared the ini code:

[DATABASE_CORE]
        NAME = fagopyrum_esculentum_fes1_core_1_1
[META]
        SPECIES.PRODUCTION_NAME = fagopyrum_esculentum_fes1
        SPECIES.SCIENTIFIC_NAME = Fagopyrum esculentum
        SPECIES.COMMON_NAME = buckwheat
        SPECIES.DISPLAY_NAME = Genus species
        SPECIES.DIVISION = EnsemblMetazoa
        SPECIES.URL = Genus_species_asm
        SPECIES.TAXONOMY_ID = 1
        SPECIES.ALIAS = [ ]                   
        ASSEMBLY.NAME = Assembly.name
        ASSEMBLY.DATE = 2018-30-10
        ASSEMBLY.ACCESSION = 
        ASSEMBLY.DEFAULT = Assembl.name
        PROVIDER.NAME = Provider name
        PROVIDER.URL = http://example.com      
        GENEBUILD.ID = 1
        GENEBUILD.START_DATE = 2018-10
        GENEBUILD.VERSION = 1
        GENEBUILD.METHOD = import
[GFF]
[FILES]
        SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
        GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
        PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
        COMMAND1 = /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3
        COMMAND2 = mv temp.gff3 FES_r1.0.genes.gff3
[GENE_STABLE_IDS]
        GFF = [ gene->ID /(.+)/ /gene:// ]
[TRANSCRIPT_STABLE_IDS]
        GFF = [ SELF->ID /(.+)/ /transcript:// ]
[TRANSLATION_STABLE_IDS]
        GFF = [ SELF->ID /(.+)/ /transcript:// ]
[MODIFY]
        OVERWRITE_DB = 1
        TRUNCATE_SEQUENCE_TABLES = 1
        TRUNCATE_GENE_TABLES = 1
        INVERT_PHASE = 1
        AUTO_PHASE = 1

Here, I added PREPROCESSING section.

And then I modified core/prepare_gff.pl to execute the commands:

## preprocess the download/obtain files using the commands in the ini file
foreach my $subsection (sort keys %{$params->{'PREPROCESSING'}}){
    preprocess_files($params->{'PREPROCESSING'}{$subsection});
}

The function preprocess_files was defined in modules/EasyImport/Core.pm

sub preprocess_files {
    my ($command);
    $command = shift;
    system "$command";
}

Cloning those modified files into a docker container, I tested the code from the inside of container:

eguser@cf5a689d91ee:/import$ perl /ensembl/easy-import/core/prepare_gff.pl /import/conf/default.ini /import/conf/overwrite.ini /import/
conf/fagopyrum_esculentum_fes1_core_1_1.ini
--2020-02-04 00:57:28--  ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz
=> 'FES_r1.0.genes.gff3.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.genes.gff3.gz ... 25416389
==> PASV ... done.    ==> RETR FES_r1.0.genes.gff3.gz ... done.
Length: 25416389 (24M) (unauthoritative)

FES_r1.0.genes.gff3.gz            100%[============================================================>]  24.24M  7.95MB/s    in 3.0s

2020-02-04 00:57:31 (7.95 MB/s) - 'FES_r1.0.genes.gff3.gz' saved [25416389]

--2020-02-04 00:57:32--  ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz
=> 'FES_r1.0.pep.fa.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.pep.fa.gz ... 41372522
==> PASV ... done.    ==> RETR FES_r1.0.pep.fa.gz ... done.
Length: 41372522 (39M) (unauthoritative)

FES_r1.0.pep.fa.gz                100%[============================================================>]  39.46M  6.90MB/s    in 5.7s

2020-02-04 00:57:39 (6.88 MB/s) - 'FES_r1.0.pep.fa.gz' saved [41372522]

--2020-02-04 00:57:40--  ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz
=> 'FES_r1.0.genome.fa.gz'
Resolving ftp.kazusa.or.jp (ftp.kazusa.or.jp)... 61.117.144.201
Connecting to ftp.kazusa.or.jp (ftp.kazusa.or.jp)|61.117.144.201|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/buckwheat ... done.
==> SIZE FES_r1.0.genome.fa.gz ... 278644757
==> PASV ... done.    ==> RETR FES_r1.0.genome.fa.gz ... done.
Length: 278644757 (266M) (unauthoritative)

FES_r1.0.genome.fa.gz             100%[============================================================>] 265.74M  5.86MB/s    in 31s

2020-02-04 00:58:12 (8.45 MB/s) - 'FES_r1.0.genome.fa.gz' saved [278644757]

WARNING: Fes_sc0000543.1.g000011.gia.1 cds coordinates (44877, 45821) do not match provided protein
WARNING: Fes_sc0001485.1.g000007.gia.1 cds coordinates (29528, 29922) do not match provided protein
WARNING: Fes_sc0003521.1.g000002.gia.1 cds coordinates (3932, 4261) do not match provided protein
WARNING: Fes_sc0005272.1.g000006.gia.1 cds coordinates (24276, 24325) do not match provided protein
WARNING: Fes_sc0005539.1.g000005.gia.1 cds coordinates (22002, 22410) do not match provided protein
WARNING: Fes_sc0007369.1.g000002.gia.1 cds coordinates (12940, 13258) do not match provided protein
WARNING: Fes_sc0007394.1.g000006.gia.1 cds coordinates (29857, 30006) do not match provided protein
WARNING: Fes_sc0009818.1.g000002.gia.1 cds coordinates (9581, 9925) do not match provided protein
WARNING: Fes_sc0011505.1.g000003.gia.1 cds coordinates (12216, 12641) do not match provided protein
WARNING: Fes_sc0012848.1.g000001.gia.1 cds coordinates (152, 350) do not match provided protein
WARNING: Fes_sc0015416.1.g000003.gia.1 cds coordinates (9064, 9297) do not match provided protein
WARNING: Fes_sc0016234.1.g000003.gia.1 cds coordinates (17194, 17601) do not match provided protein
WARNING: Fes_sc0016393.1.g000001.gia.1 cds coordinates (3544, 4007) do not match provided protein
WARNING: Fes_sc0021518.1.g000002.gia.1 cds coordinates (5722, 5948) do not match provided protein

eguser@cf5a689d91ee:/import$ perl /ensembl/easy-import/core/import_gene_models.pl /import/conf/default.ini /import/conf/overwrite.ini /import/conf/fagopyrum_esculentum_fes1_core_1_1.ini
286768 genes in database
286768 transcripts in database
286768 translations in database
917376 exons in database

Although it gave 14 WARNINGs, I could say it was successful.

As for GFF itself (ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz), the file doesn't contain any mRNAs which I need for our study.
My modification script modify_gff3.sh inserts mRNAs with the same size of its genes
creating proper Parent-Child relations.

I tried some techniques equipped in easy-import, but I failed to get my desired result.

I would appreciate if you could give me any comment about the preprocessing.

@rjchallis
Copy link
Contributor

Thanks for putting int the effort in to make preprocessing! I really like the idea and this looks like a very flexible implementation. I had anticipated including one-liners as comments but having them run like this is a good step towards greater reproducibility. I be happy to take a pull request for this.

One thing that would be good to resolve is that as you are referencing a script, the commands in modify_gff3.sh would be unknown to someone trying to recreate this from the ini file alone. Maybe just having the script as a Github gist with a note of its URL would be a good idea.

Only 14 warnings is good, presumably these are stop codons or non-ACGT bases that are just one character out in the file versus the database.

For the particular problem of missing mRNAs, the line:

_CONDITION_9 = [ EXPECTATION exon     hasParent transcript|mrna|mirna|trna|ncrna|rrna force ]

in the default.ini file should be creating parent transcripts. I'm not able to test this at the moment, but I wonder if you were having problems as any new rule you tried were being applied after this. If so, you should be able to create a _CONDITION_9 in your ini to overwrite this default that has mrna first (or only mrna) then the GFF processing should be able to fill in the missing mRNAs:

_CONDITION_9 = [ EXPECTATION exon     hasParent mrna force ]

@pachiras
Copy link
Author

pachiras commented Feb 12, 2020

Hi @rjchallis,

Sorry, for the late reply.
I am reporting the result of your advice.

This is the snippet of the original gff:

Fes_sc0000001.1	augustus_arabi	gene	560	4505	0.05	+	.	ID=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	560	645	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	589	645	0.39	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	727	959	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	727	959	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	1078	1326	0.72	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	1078	1326	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	1414	1558	0.98	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	1414	1558	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	1692	1793	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	1692	1793	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	2046	2138	0.91	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	2046	2138	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	2211	2252	0.96	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	2211	2252	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	2379	2489	0.8	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	2379	2489	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	2605	2676	0.99	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	2605	2676	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	3172	3354	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	3172	3354	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	3856	4212	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	3856	4212	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	4295	4405	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	4295	4505	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=gene:Fes_sc0000001.1.g000001.aua.1

There is no mRNA in the file.
And this is the output of my modification script (modify_gff3.sh):

Fes_sc0000001.1	augustus_arabi	gene	560	4505	0.05	+	.	ID=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	mRNA	560	4505	0.05	+	.	ID=transcript:Fes_sc0000001.1.g000001.aua.1;Parent=gene:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	560	645	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	589	645	0.39	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	727	959	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	727	959	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	1078	1326	0.72	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	1078	1326	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	1414	1558	0.98	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	1414	1558	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	1692	1793	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	1692	1793	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	2046	2138	0.91	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	2046	2138	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	2211	2252	0.96	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	2211	2252	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	2379	2489	0.8	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	2379	2489	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	2605	2676	0.99	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	2605	2676	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	3172	3354	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	3172	3354	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	3856	4212	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	3856	4212	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	CDS	4295	4405	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=transcript:Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	augustus_arabi	exon	4295	4505	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=transcript:Fes_sc0000001.1.g000001.aua.1

Here, a mRNA with the same size of gene was inserted and CDSs and eons belonged to it.

Finally, the result after setting the condition

_CONDITION_9 = [ EXPECTATION exon     hasParent mrna force ]

in the ini file (fagopyrum_esculentum_fes1_core_1_1.ini):

Fes_sc0000001.1	augustus_arabi	gene	560	4505	0.05	+	.	ID=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=Fes_sc0000001.1.g000001.aua.1
Fes_sc0000001.1	GFFTree	mrna	1078	1326	.	+	.	ID=GFFTree_mrna670;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna670;translation_stable_id=GFFTree_mrna670
Fes_sc0000001.1	augustus_arabi	CDS	1078	1326	0.72	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.3;Parent=GFFTree_mrna670
Fes_sc0000001.1	augustus_arabi	exon	1078	1326	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.3;Parent=GFFTree_mrna670
Fes_sc0000001.1	GFFTree	mrna	1414	1558	.	+	.	ID=GFFTree_mrna671;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna671;translation_stable_id=GFFTree_mrna671
Fes_sc0000001.1	augustus_arabi	CDS	1414	1558	0.98	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.4;Parent=GFFTree_mrna671
Fes_sc0000001.1	augustus_arabi	exon	1414	1558	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.4;Parent=GFFTree_mrna671
Fes_sc0000001.1	GFFTree	mrna	1692	1793	.	+	.	ID=GFFTree_mrna672;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna672;translation_stable_id=GFFTree_mrna672
Fes_sc0000001.1	augustus_arabi	CDS	1692	1793	1	+	2	ID=CDS:Fes_sc0000001.1.g000001.aua.1.5;Parent=GFFTree_mrna672
Fes_sc0000001.1	augustus_arabi	exon	1692	1793	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.5;Parent=GFFTree_mrna672
Fes_sc0000001.1	GFFTree	mrna	2046	2138	.	+	.	ID=GFFTree_mrna673;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna673;translation_stable_id=GFFTree_mrna673
Fes_sc0000001.1	augustus_arabi	CDS	2046	2138	0.91	+	2	ID=CDS:Fes_sc0000001.1.g000001.aua.1.6;Parent=GFFTree_mrna673
Fes_sc0000001.1	augustus_arabi	exon	2046	2138	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.6;Parent=GFFTree_mrna673
Fes_sc0000001.1	GFFTree	mrna	2211	2252	.	+	.	ID=GFFTree_mrna674;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna674;translation_stable_id=GFFTree_mrna674
Fes_sc0000001.1	augustus_arabi	CDS	2211	2252	0.96	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.7;Parent=GFFTree_mrna674
Fes_sc0000001.1	augustus_arabi	exon	2211	2252	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.7;Parent=GFFTree_mrna674
Fes_sc0000001.1	GFFTree	mrna	2379	2489	.	+	.	ID=GFFTree_mrna675;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna675;translation_stable_id=GFFTree_mrna675
Fes_sc0000001.1	augustus_arabi	CDS	2379	2489	0.8	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.8;Parent=GFFTree_mrna675
Fes_sc0000001.1	augustus_arabi	exon	2379	2489	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.8;Parent=GFFTree_mrna675
Fes_sc0000001.1	GFFTree	mrna	2605	2676	.	+	.	ID=GFFTree_mrna676;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna676;translation_stable_id=GFFTree_mrna676
Fes_sc0000001.1	augustus_arabi	CDS	2605	2676	0.99	+	1	ID=CDS:Fes_sc0000001.1.g000001.aua.1.9;Parent=GFFTree_mrna676
Fes_sc0000001.1	augustus_arabi	exon	2605	2676	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.9;Parent=GFFTree_mrna676
Fes_sc0000001.1	GFFTree	mrna	3172	3354	.	+	.	ID=GFFTree_mrna677;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna677;translation_stable_id=GFFTree_mrna677
Fes_sc0000001.1	augustus_arabi	CDS	3172	3354	1	+	2	ID=CDS:Fes_sc0000001.1.g000001.aua.1.10;Parent=GFFTree_mrna677
Fes_sc0000001.1	augustus_arabi	exon	3172	3354	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.10;Parent=GFFTree_mrna677
Fes_sc0000001.1	GFFTree	mrna	3856	4212	.	+	.	ID=GFFTree_mrna678;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna678;translation_stable_id=GFFTree_mrna678
Fes_sc0000001.1	augustus_arabi	CDS	3856	4212	1	+	2	ID=CDS:Fes_sc0000001.1.g000001.aua.1.11;Parent=GFFTree_mrna678
Fes_sc0000001.1	augustus_arabi	exon	3856	4212	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.11;Parent=GFFTree_mrna678
Fes_sc0000001.1	GFFTree	mrna	4295	4405	.	+	.	ID=GFFTree_mrna679;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna679;translation_stable_id=GFFTree_mrna679
Fes_sc0000001.1	augustus_arabi	CDS	4296	4405	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.12;Parent=GFFTree_mrna679
Fes_sc0000001.1	GFFTree	mrna	589	645	.	+	.	ID=GFFTree_mrna680;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna680;translation_stable_id=GFFTree_mrna680
Fes_sc0000001.1	augustus_arabi	CDS	590	645	0.39	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.1;Parent=GFFTree_mrna680
Fes_sc0000001.1	GFFTree	mrna	727	959	.	+	.	ID=GFFTree_mrna681;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna681;translation_stable_id=GFFTree_mrna681
Fes_sc0000001.1	augustus_arabi	CDS	727	959	1	+	0	ID=CDS:Fes_sc0000001.1.g000001.aua.1.2;Parent=GFFTree_mrna681
Fes_sc0000001.1	augustus_arabi	exon	727	959	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.2;Parent=GFFTree_mrna681
Fes_sc0000001.1	GFFTree	mrna	4295	4505	.	+	.	ID=GFFTree_mrna682;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna682;translation_stable_id=GFFTree_mrna682
Fes_sc0000001.1	augustus_arabi	exon	4295	4505	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.12;Parent=GFFTree_mrna682
Fes_sc0000001.1	GFFTree	mrna	560	645	.	+	.	ID=GFFTree_mrna683;Parent=gene:Fes_sc0000001.1.g000001.aua.1;stable_id=GFFTree_mrna683;translation_stable_id=GFFTree_mrna683
Fes_sc0000001.1	augustus_arabi	exon	560	645	.	+	.	ID=exon:Fes_sc0000001.1.g000001.aua.1.1;Parent=GFFTree_mrna683
###

Here, mRNA was created for each exon, which creates a lot of WARNINGS in the validation process.
Removing _CONDITION_9 and using only default.ini gave, basically, the same result.
So, I will go with my script for a while. Sorry but thank you for your advice.


As for the preprocessing mechanism, I am glad you like it.
I have created a pull request.

But before merging the code, we have to consider one thing.
I have inserted the preprocessing code only in easy-import/core/prepare_gff.pl.
So some user could skip the preparation process and import the data directly.
Or some user could try to execute the script multiple times, which destroy the gff.

Do you have any idea about about avoiding those happen?

@rjchallis
Copy link
Contributor

Thanks for the detailed report, sorry the built in methods didn't work but good that it can be solved by preprocessing.

I don't think the first you raised will be too much of a problem in practice. Skipping the prepare step also means that the built in methods would not be applied so the preprocessing step is no different here. If the GFF needs fixing it should generate enough warnings/errors to indicate something is wrong.

As for running the script multiple times, this could be a problem and I'm not sure of the best solution. I like that preprocessing is generic so it could be used to update sequence file headers as well/instead of gff, but that complicates finding a way to test if the file has been changed. One idea would be to calculate checksums for each GFF/FASTA infile before preprocessing and write them to disk after preprocessing:

  • If no previous checksums exist, run preprocessing.
  • If checksums exist and are different from the current value, warn and don't run preprocessing

Of course having just typed the checksum version I'm realising that this could be simplified to just touch a file to flag that preprocessing has been done.

@pachiras
Copy link
Author

I like the idea of checksum.
In that case, we'd better to change the format of ini, like,

[FILES]
        SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
        GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
        PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
        GFF = [ /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh FES_r1.0.genes.gff3 > temp.gff3; mv temp.gff3 FES_r1.0.genes.gff3 ]
        SCAFFOLD = [ ...(Scaffold modification commands)... ]

Since this format clarifies the target file for each preprocess, it is easier for the program to determine which checksum should be checked before executing the command.

Or you could represent the target file by {}

[FILES]
        SCAFFOLD = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genome.fa.gz ]
        GFF = [ gff ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.genes.gff3.gz ]
        PROTEIN = [ fa ftp://ftp.kazusa.or.jp/pub/buckwheat/FES_r1.0.pep.fa.gz ]
[PREPROCESSING]
        GFF = [ /import/src/fagopyrum_esculentum_fes1_core_1_1/modify_gff3.sh {} > temp.gff3; mv temp.gff3 {} ]
        SCAFFOLD = [ ...(Scaffold modification commands)... ]

@rjchallis
Copy link
Contributor

Thanks, reformatting the ini makes sense to associate files with appropriate checksums.

I prefer your second suggestion using {} to represent the file.

@pachiras
Copy link
Author

OK. I'll think about the implementation and will renew the pull request.
But it could take a few weeks because I'm kind of busy now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants