New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repairing gff #1
Comments
I think I can see what the problem is here. The script that you want to run on the input GFF won't be run by the import container. The lines:
are treated as comments. Where I've used this in examples, it is meant to link any notes on steps that had to be run manually to the configuration file, but there is no code in the container to actually run such commands. If you run your script on the downloaded file (which will be in a subdirectory of If you are still having problems could you share a small portion of your GFF and I can try to see what else will help. It looks like the changes you are making in your script could probably be handled by options to the I should also note that the import will usually generate quite a few warnings about the sequence names not being valid accessions that can be ignored. If you could show me any other error messages that you get then that will help with working out how best to get the file imported. |
Hi @rjchallis, Thank you for your detailed comment. I still want some kind of preprocessing mechanism in the import process since:
So I have been testing some experimental code in Firstly I prepared the ini code:
Here, I added PREPROCESSING section. And then I modified
The function
Cloning those modified files into a docker container, I tested the code from the inside of container:
Although it gave 14 WARNINGs, I could say it was successful. As for GFF itself ( I tried some techniques equipped in easy-import, but I failed to get my desired result. I would appreciate if you could give me any comment about the preprocessing. |
Thanks for putting int the effort in to make preprocessing! I really like the idea and this looks like a very flexible implementation. I had anticipated including one-liners as comments but having them run like this is a good step towards greater reproducibility. I be happy to take a pull request for this. One thing that would be good to resolve is that as you are referencing a script, the commands in Only 14 warnings is good, presumably these are stop codons or non-ACGT bases that are just one character out in the file versus the database. For the particular problem of missing mRNAs, the line:
in the
|
Hi @rjchallis, Sorry, for the late reply. This is the snippet of the original gff:
There is no mRNA in the file.
Here, a mRNA with the same size of gene was inserted and CDSs and eons belonged to it. Finally, the result after setting the condition
in the ini file (
Here, mRNA was created for each exon, which creates a lot of WARNINGS in the validation process. As for the preprocessing mechanism, I am glad you like it. But before merging the code, we have to consider one thing. Do you have any idea about about avoiding those happen? |
Thanks for the detailed report, sorry the built in methods didn't work but good that it can be solved by preprocessing. I don't think the first you raised will be too much of a problem in practice. Skipping the prepare step also means that the built in methods would not be applied so the preprocessing step is no different here. If the GFF needs fixing it should generate enough warnings/errors to indicate something is wrong. As for running the script multiple times, this could be a problem and I'm not sure of the best solution. I like that preprocessing is generic so it could be used to update sequence file headers as well/instead of gff, but that complicates finding a way to test if the file has been changed. One idea would be to calculate checksums for each GFF/FASTA infile before preprocessing and write them to disk after preprocessing:
Of course having just typed the checksum version I'm realising that this could be simplified to just touch a file to flag that preprocessing has been done. |
I like the idea of checksum.
Since this format clarifies the target file for each preprocess, it is easier for the program to determine which checksum should be checked before executing the command. Or you could represent the target file by
|
Thanks, reformatting the I prefer your second suggestion using |
OK. I'll think about the implementation and will renew the pull request. |
First of all, let me say thank you for sharing the great project!
But I'm having a hard time importing a new genome.
I found my gff needs to be modified before the import.
So I added some lines in the ini file.
Here, the script for gff modification (/import/src/modify_gff3.sh) was as follows:
I checked the modified gff by this script was imported correctly.
The script was shared with the container by the docker command option:
But the command output many warnings and the script didn't seem to be working.
Am I doing wrong?
Any comment would be appreciated.
Thank you.
The text was updated successfully, but these errors were encountered: