Skip to content
This repository has been archived by the owner on Dec 8, 2022. It is now read-only.

seq and Kmer are impractical for everyday use #121

Open
7 tasks
inumanag opened this issue Mar 24, 2020 · 1 comment
Open
7 tasks

seq and Kmer are impractical for everyday use #121

inumanag opened this issue Mar 24, 2020 · 1 comment

Comments

@inumanag
Copy link
Collaborator

inumanag commented Mar 24, 2020

On top of my head:

  • When loading seq via bio.FASTA, comparisons often fail because s'a' != s'A' (and most FASTAs are soft-masked and thus contain loads of lowercase letters). One has to go over this by doing seq = seq(str(seq).upper()).
  • seq = str does not work
  • seq1 + seq1 does not work
  • seq1 + str1 does not work
  • How do you get a k-mer from a sequence? k = Kmer[20](s)?
  • How do you get a sequence from a k-mer? I can get string via str(k), but not a sequence (seq(k) fails).
  • Many slicing operators do not work on seqs and Kmers greatly reducing their usability.
@arshajii
Copy link
Member

This is because sequence is just essentially a string right now internally. Maybe we should have more strict requirements on what can be included in a sequence (i.e. just IUPAC uppercase characters? -- that would require converting when we read sequence data from disk).

TBH I don't think + and = should be overloaded for seq+str -- they are different types and they should be treated differently IMO. If this is really needed then I think an explicit seq1 + seq(str2) is better -- just my opinion. seq1 + seq2 is something we could support pretty easily.

k = Kmer[20](s) is right for that. seq(k) to get a sequence from a k-mer is something we should probably add too.

We can also support more slices on seq. On Kmer it's a lot harder since slices change the type: e.g. k[:3] is of type Kmer[3] and k[:4] is of type Kmer[4] -- not sure what the best way to handle this is. Longer-term I'd prefer to unify k-mer types into a single type and have the compiler deduce and optimize cases where the k-mer length is constant.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants