Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee compatibility between reference package components and rank recommendation #95

Open
7 tasks
cmorganl opened this issue Oct 9, 2022 · 0 comments
Assignees
Labels
feature request A request for a new feature unlike one that already exists
Projects

Comments

@cmorganl
Copy link
Collaborator

cmorganl commented Oct 9, 2022

Taxonomic classification is reliant on the evolutionary distance (i.e. branch-length, or number of substitutions) linear model. Distances between query sequences and reference sequences inferred during phylogenetic placement are influenced by the underlying reference alignment, and therefore the MSA trimming process. This causes a conflict when, for example, a model trained on a BMGE-trimmed MSA is used to correct classifications derived from ClipKit-trimmed MSA.

Potential Solutions

  1. Every time treesapp assign is executed, the parameters are compared to those that were used to create the reference package. If there are differences that could influence the phylogeny, the reference package is automatically re-trained. MSA-trimming software name, mode and parameters would need to be stored. Creating a parser to extract these attributes for each trimming software would be inconvenient, and potentially unstable across multiple versions.
  2. The linear model would be obsolete by using relative evolutionary distance (RED) to dynamically set taxonomic rank boundaries. Even this route, however, would require repeating phylogenetic inference of the reference phylogeny so that the MSA is the same.
  3. Remove the option of trimming the MSA during phylogenetic placement, only during treesapp create/update. The raw reference leaf sequences would need to be stored in the refpkg so treesapp update and treesapp train can access the raw sequences.

Acceptance criteria

  • Reference package includes a namedtuple that stores trimming parameters
  • --trim_align and related arguments are removed from all subcommands except create and update
  • Linear model used for rank recommendation stores MSA and phylogeny dimensions to ensure compatibility.
  • Store raw, dereplicated (at 99% identity) amino acid and nucleotide (if available) sequence records input to treesapp create, including all candidate sequences unused. These include records that passed the taxonomic screen & filter, and length thresholds.
    • Sequence and sequence name (i.e. FASTA attributes)
    • Genome, chromosome, contig, and/or ORF position of sequence
  • Compress reference package to decrease space required for additional sequences
@cmorganl cmorganl added the feature request A request for a new feature unlike one that already exists label Oct 9, 2022
@cmorganl cmorganl changed the title Retrain reference package based on build parameters Guarantee compatibility between reference package components and rank recommendation Oct 10, 2022
@cmorganl cmorganl added this to To do in v0.12.0 via automation Oct 10, 2022
@cmorganl cmorganl self-assigned this Oct 10, 2022
@cmorganl cmorganl moved this from To do to In progress in v0.12.0 Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A request for a new feature unlike one that already exists
Projects
v0.12.0
In progress
Development

No branches or pull requests

1 participant