Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized the middle_one_away function to make parallelization possible #71

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

RNieuwenhuis
Copy link

As discussed in #65 I generalized the middle_one_away function to make parallelization possible.
It is rebuilt as get_pairs_at_pos function and is now deterministic. Thus it contains a fix for #70 .
Required was the implementation of a function aggregate, that is now also used for the all_one_away function.

Furthermore I made some stylistic changes that PyCharm was bugging about. Also included some more comments. Hope it improves readability for all.

Updated README accordingly.

TODO: A value error is raised but I think it is up to @KamilSJaron to decide on how to handle that.
TODO: Check README if it needs any further changes.
TODO: Find out why hetkmers all_one_away and hetkmers get_pairs_at_pos + hetkmers aggregate report different numbers of unique pairs. Manually inspecting a few pairs missing in either methods results did not yield any insight. Needs a more thorough and systematic inspection. Nevertheless, the outcomes are in the same ball park.

nieuw133 added 8 commits August 23, 2020 18:29
… still the same, except for an output file for each position. Also changed the corresponding commandline structure.
…e kmer. By doing this for all positions separately, it can be done in parallel using e.g. gnu parallel or a cluster scheduling system like slurm. For each position, 3 .tsv files are generated, each containg pairs of either sequences, coverages or kmer IDs. The pairs differ in that certain position. The IDs refers to the line number in the original kmer dump file. (0-based). Using smudgeplot.py aggregate on all produced indices.tsv files, unique pairs are selected that can be used for smudgeplot.py plot.
@KamilSJaron
Copy link
Owner

That's cool, thanks for resolving the deterministic problem (I am still not sure if I understand where was the problem and furthermore, sorry for misleading you with the SA, I used them in a different module I wrote for mapping of kmers).

re 3rd TODO: I actually never checked if all and middle find all the kmer pairs at least for the middle nucleotide. I will try to get it in the minimal reproducible framework.

@KamilSJaron
Copy link
Owner

Sorry, the timing is not the best for me. I am going to annual leave in a few days and I have quite a backlog of things I need to do. I am afraid I will get back to this PR in ~3 weeks the soonest.

nieuw133 added 3 commits September 29, 2020 17:49
…n#70 (comment) by renaming some variables so they won't be overwritten.
…ry. Still the memory footprint is very big, basically equal to the original all_one_away method.
@KamilSJaron
Copy link
Owner

3 weeks? More like 3 years!

It was too big of a task for a single evening and I never managed to commit a longer stratch of time to this, I am very sorry. I went through the changes and they look really nice! I wish I would have appreciated more!!!

We actually returned to the development of smudgeplot, we have a working fully parallelised C version, which is the only reason why I won't integrate this pull request, but again, great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants