Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Preserve prodigal metadata for anvi-export-gene-calls #2181

Open
Ge0rges opened this issue Nov 28, 2023 · 6 comments
Open

Comments

@Ge0rges
Copy link
Collaborator

Ge0rges commented Nov 28, 2023

The need

Identifying things like RBS motif, start codon, etc. can come in handy with gene-aware analyses. Such analyses may become more frequent in the future especially given the current effort

to soon implement a general framework for storing the coordinates of any type of genomic feature

#2152

The solution

From @ivagljiva on discord:

to modify what we store in the [contigs] DB, and then add a flag to anvi-export-gene-calls (ie, in the genes_in_contigs table) to include that information in the output.

Perhaps this should be relegated to the effort mentioned by @semiller10 in #2152, but I thought it pertinent to bring it up.

Beneficiaries

Anyone doing gene aware analyses.

@meren
Copy link
Member

meren commented Nov 28, 2023

Just a quick note as I'm passing through this: When we do this we need to think of a way that doesn't require a specific design that locks us in with Prodigal for gene calling. A generic design that can keep track of additional features for genes (or genomic regions, or nucleotides, or codons) that can also be populated from Prodigal output.

@Ge0rges Ge0rges changed the title [FEATURE REQUEST] Preserve prodigal metadata for anti-export-gene-calls [FEATURE REQUEST] Preserve prodigal metadata for anvi-export-gene-calls Jan 17, 2024
@Ge0rges
Copy link
Collaborator Author

Ge0rges commented Feb 21, 2024

Because I understand this exists in the context of broader changes that need be done (that are beyond my current mastery of the Anvi'o codebase), here is a temporary pseudo-solution for anyone who ends up here.

I wrote this script which essentially piggy backs on Anvi'o prodigal caller and response parser:

import sys
sys.path.insert(1, '/path/to/anvio/anvio/drivers')
from prodigal import Prodigal

class ExtendedProdigal(Prodigal):
    def __init__(self):
        super().__init__()

        self.available_parsers = {'v2.6.3': self.extended_parser,
                                  'v2.6.2': self.extended_parser,
                                  'v2.6.0': self.extended_parser}
        
        super().check_version()

    def extended_parser(self, defline):
        """parses this, but keep more information than the default anvio parser:

            204_10M_MERGED.PERFECT.gz.keep_contig_1720_7 # 7086 # 7262 # 1 # ID=3_7;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.294

        """
    
        hit = self._Prodigal__parser_1(defline)
        
        fields = defline.split()

        additional_attributes = fields[8].split(';')

        hit['start_codon'] = additional_attributes[2].split('=')[1]
        hit['rbs_motif'] = additional_attributes[3].split('=')[1]
        hit['rbs_spacer'] = additional_attributes[4].split('=')[1]
        hit['gc_cont'] = additional_attributes[5].split('=')[1]

        return hit

prodigal = ExtendedProdigal()
prodigal.num_threads = int(sys.argv[3])
gene_calls, amino_acids = prodigal.process(sys.argv[1], sys.argv[2])

This won't update your contigs database or otherwise modify any Anvi'o functionality, however if you call it with a FASTA as input, it will return the same dict anvi'o generates in addition to keeping the other prodigal outputs chosen here (e.g. 'gc_cont'). Example run command python prodigal_caller.py mags/Pelagibacter_r-contigs.fna output_folder 40 where the format is python script_name.py input_fasta_path output_folder_for_anvio thread_number.

@Ge0rges
Copy link
Collaborator Author

Ge0rges commented Feb 21, 2024

@meren sidenote: while working on this I noticed an important typo here. I believe that line should read v2.6.0. I didn't want to make a pull request just for that one typo though (but I can if you'd like!).

meren added a commit that referenced this issue Feb 23, 2024
@meren
Copy link
Member

meren commented Feb 23, 2024

Thanks for letting me know about the typo, @Ge0rges. I'm not sure how did it survive this long. I guess because no one is using Prodigal v2.6.0 anymore.

Your temporary workaround is masterful and beautiful. Regarding the original feature request: this has been a difficult one to address because it requires a change in the way we keep gene calls in our relevant table with the addition of a few new columns, which will likely add millions of additional data points to that table, increasing the contigs-db size by a lot while only being relevant to a fraction of the users.

A better solution would be to extend that table if anvi-gen-contigs-db receives a specific flag (i.e., --store-extended-gene-call-data) to store larger amount of information regarding gene calls, rather than doing it every time anvi-gen-contigs-db. But since this table will already go through a change soon with the eukaryotic gene calls, I've been sitting on my hands :)

@Ge0rges
Copy link
Collaborator Author

Ge0rges commented Feb 24, 2024

That makes sense. I was wondering if the revamped mentioned in #2152 is thought of to affect the contigs-db or to involve the creation of new type of artifact centered around genomes/MAGs? If the latter was the case, this feature could be relegated to that artifact rather than expanding on contigs-db.

@meren
Copy link
Member

meren commented Feb 26, 2024

I think it will have to be new, optional tables in contigs-db. We already have the code to mark nucleotide, codon/amino acid positions in contig sequences in contigs-db files, but they are not used outside of anvi'o structure currently. We will have to make them more accessible to mainstream programs :)

The best way to get these things done is to have a project in the lab that needs this solution to be in place. That's why there is a delay currently :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants