-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalized the middle_one_away function to make parallelization possible #71
base: master
Are you sure you want to change the base?
Conversation
… still the same, except for an output file for each position. Also changed the corresponding commandline structure.
…e kmer. By doing this for all positions separately, it can be done in parallel using e.g. gnu parallel or a cluster scheduling system like slurm. For each position, 3 .tsv files are generated, each containg pairs of either sequences, coverages or kmer IDs. The pairs differ in that certain position. The IDs refers to the line number in the original kmer dump file. (0-based). Using smudgeplot.py aggregate on all produced indices.tsv files, unique pairs are selected that can be used for smudgeplot.py plot.
… aggregate function.
That's cool, thanks for resolving the deterministic problem (I am still not sure if I understand where was the problem and furthermore, sorry for misleading you with the SA, I used them in a different module I wrote for mapping of kmers). re 3rd TODO: I actually never checked if all and middle find all the kmer pairs at least for the middle nucleotide. I will try to get it in the minimal reproducible framework. |
Sorry, the timing is not the best for me. I am going to annual leave in a few days and I have quite a backlog of things I need to do. I am afraid I will get back to this PR in ~3 weeks the soonest. |
…n#70 (comment) by renaming some variables so they won't be overwritten.
…ry. Still the memory footprint is very big, basically equal to the original all_one_away method.
… the first value in the tsv file
3 weeks? More like 3 years! It was too big of a task for a single evening and I never managed to commit a longer stratch of time to this, I am very sorry. I went through the changes and they look really nice! I wish I would have appreciated more!!! We actually returned to the development of smudgeplot, we have a working fully parallelised C version, which is the only reason why I won't integrate this pull request, but again, great job! |
As discussed in #65 I generalized the
middle_one_away
function to make parallelization possible.It is rebuilt as
get_pairs_at_pos
function and is now deterministic. Thus it contains a fix for #70 .Required was the implementation of a function
aggregate
, that is now also used for theall_one_away
function.Furthermore I made some stylistic changes that PyCharm was bugging about. Also included some more comments. Hope it improves readability for all.
Updated README accordingly.
TODO: A value error is raised but I think it is up to @KamilSJaron to decide on how to handle that.
TODO: Check README if it needs any further changes.
TODO: Find out why
hetkmers all_one_away
andhetkmers get_pairs_at_pos
+hetkmers aggregate
report different numbers of unique pairs. Manually inspecting a few pairs missing in either methods results did not yield any insight. Needs a more thorough and systematic inspection. Nevertheless, the outcomes are in the same ball park.