Reduces memory usage and improves speed #48

courcelm · 2018-05-29T16:48:59Z

Current slicing/join was concatenating the whole chromosome sequence when only a slice was required. That required 2-3 GB of memory in some case and it was slow in my use case:

Before

I suggest reverting back to original pyGeno code that was changed by @ericloud .

After

@ericloud
Is this a problem with your bug reported here:
2fd3fbe#diff-9a0352f44c9b0e9b00e4e2df44eae54b

Current slicing was concatenating the whole chrosome sequence when only a slice was required. That required 2-3 GB of memory in some case and it was slow.

ericloud · 2018-05-29T21:26:57Z

If my memories are correct, I change this part of code to return the correct sequence when insertion, deletion or indel are present.

example:
_ _ _ _ D _ _ _ _ _ S _ _ _ _ _ _ _ _ _ _
_: 1 nucleotide without change
D: a deletion on 1 nuc
S: a SNP on 1 nuc
And we want to return X nuc on the left and on the right of the SNP. This case is very common when you try to load dbSNP.

If we apply the slice on data (old version), the function will return only (X-1) nuc on the left, because one of them is deleted.

But as you mentioned, apply the slice on sequence (new version 3bb0518 and 2fd3fbe) involve to load all the chromosome and increase memory usage and time.

There are certainly a more efficient way to do it, but go back to the old version, just change the problem to somewhere else.

ericloud · 2018-05-30T13:33:57Z

Also have you try to scan your mutations by Chromosome?
In that way the sequence should stay in cache and be loaded only one time.

Something like:

for chr in genome.iterGet(Chromosome):
    for snp in chr.get(dbSNPSNP):
        ...

courcelm · 2018-05-30T14:55:19Z

I'm already processing by chromosome and it reloads the sequence for every transcript.

ericloud · 2018-05-30T15:17:56Z

Even with something like :

for chr in genome.iterGet(Chromosome):
    chr_seq = chrom.getSequenceData()
    for snp in chr.get(dbSNPSNP):
        bin_seq = NucBinarySequence(chr_seq[slice(snp.start,snp.end,1))])
        ...

It's my last suggestion to help you.

I work on other project for the moment, sorry but I don't have the time to improve this specific case.

courcelm · 2018-05-30T15:58:02Z

I can't apply this since I don't want to load the whole chromosome sequence to memory. Anyway I have solve my problem. Thanks for your help.

tariqdaouda · 2018-05-30T17:51:49Z

What is your solution?

courcelm · 2018-05-30T18:23:08Z

I don't have a solution that fixes both the high memory usage and @ericloud bug. I found a way in my script to skip the call to _getSequence

tariqdaouda · 2018-05-31T16:19:44Z

I also have noticed this drop in performances. This is something that would need to be improved in the future.

Reduces memory usage and improves speed

89bf915

Current slicing was concatenating the whole chrosome sequence when only a slice was required. That required 2-3 GB of memory in some case and it was slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduces memory usage and improves speed #48

Reduces memory usage and improves speed #48

courcelm commented May 29, 2018

ericloud commented May 29, 2018 •

edited

ericloud commented May 30, 2018 •

edited

courcelm commented May 30, 2018

ericloud commented May 30, 2018 •

edited

courcelm commented May 30, 2018

tariqdaouda commented May 30, 2018

courcelm commented May 30, 2018 •

edited

tariqdaouda commented May 31, 2018

Reduces memory usage and improves speed #48

Are you sure you want to change the base?

Reduces memory usage and improves speed #48

Conversation

courcelm commented May 29, 2018

ericloud commented May 29, 2018 • edited

ericloud commented May 30, 2018 • edited

courcelm commented May 30, 2018

ericloud commented May 30, 2018 • edited

courcelm commented May 30, 2018

tariqdaouda commented May 30, 2018

courcelm commented May 30, 2018 • edited

tariqdaouda commented May 31, 2018

ericloud commented May 29, 2018 •

edited

ericloud commented May 30, 2018 •

edited

ericloud commented May 30, 2018 •

edited

courcelm commented May 30, 2018 •

edited