Generalized the middle_one_away function to make parallelization possible #71

RNieuwenhuis · 2020-08-28T13:55:17Z

As discussed in #65 I generalized the middle_one_away function to make parallelization possible.
It is rebuilt as get_pairs_at_pos function and is now deterministic. Thus it contains a fix for #70 .
Required was the implementation of a function aggregate, that is now also used for the all_one_away function.

Furthermore I made some stylistic changes that PyCharm was bugging about. Also included some more comments. Hope it improves readability for all.

Updated README accordingly.

TODO: A value error is raised but I think it is up to @KamilSJaron to decide on how to handle that.
TODO: Check README if it needs any further changes.
TODO: Find out why hetkmers all_one_away and hetkmers get_pairs_at_pos + hetkmers aggregate report different numbers of unique pairs. Manually inspecting a few pairs missing in either methods results did not yield any insight. Needs a more thorough and systematic inspection. Nevertheless, the outcomes are in the same ball park.

… still the same, except for an output file for each position. Also changed the corresponding commandline structure.

…e kmer. By doing this for all positions separately, it can be done in parallel using e.g. gnu parallel or a cluster scheduling system like slurm. For each position, 3 .tsv files are generated, each containg pairs of either sequences, coverages or kmer IDs. The pairs differ in that certain position. The IDs refers to the line number in the original kmer dump file. (0-based). Using smudgeplot.py aggregate on all produced indices.tsv files, unique pairs are selected that can be used for smudgeplot.py plot.

… aggregate function.

KamilSJaron · 2020-08-28T14:54:24Z

That's cool, thanks for resolving the deterministic problem (I am still not sure if I understand where was the problem and furthermore, sorry for misleading you with the SA, I used them in a different module I wrote for mapping of kmers).

re 3rd TODO: I actually never checked if all and middle find all the kmer pairs at least for the middle nucleotide. I will try to get it in the minimal reproducible framework.

.idea/inspectionProfiles/Project_Default.xml

KamilSJaron · 2020-09-07T12:23:35Z

Sorry, the timing is not the best for me. I am going to annual leave in a few days and I have quite a backlog of things I need to do. I am afraid I will get back to this PR in ~3 weeks the soonest.

…n#70 (comment) by renaming some variables so they won't be overwritten.

…ry. Still the memory footprint is very big, basically equal to the original all_one_away method.

… the first value in the tsv file

KamilSJaron · 2023-08-10T15:08:22Z

3 weeks? More like 3 years!

It was too big of a task for a single evening and I never managed to commit a longer stratch of time to this, I am very sorry. I went through the changes and they look really nice! I wish I would have appreciated more!!!

We actually returned to the development of smudgeplot, we have a working fully parallelised C version, which is the only reason why I won't integrate this pull request, but again, great job!

nieuw133 added 8 commits August 23, 2020 18:29

Changed the middle_one_away function into get_pairs_at_pos. Output is…

66ad99b

… still the same, except for an output file for each position. Also changed the corresponding commandline structure.

Improved readability, commenting and adherence to style guide I hope.

2d6dd99

Removed a piece of duplicated code.

e4b7b12

Fixed test for hetkmers all or positional mode.

b793ca3

Removed the duplicate reading of infile that was implemented with the…

164ef7d

… aggregate function.

Updated README for parallel use.

46f5f2b

Small changes to README.

360eda2

KamilSJaron reviewed Aug 29, 2020

View reviewed changes

.idea/inspectionProfiles/Project_Default.xml Outdated Show resolved Hide resolved

Removed IDE related file.

9779783

nieuw133 added 3 commits September 29, 2020 17:49

Added fix for faulty behaviour as mentioned by @sunzhig in KamilSJaro…

daee4ba

…n#70 (comment) by renaming some variables so they won't be overwritten.

Added cov only option so not all kmers are necessarily read into memo…

f7771c4

…ry. Still the memory footprint is very big, basically equal to the original all_one_away method.

Fixed bug in aggregate function that caused the larger coverage to be…

527a753

… the first value in the tsv file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized the middle_one_away function to make parallelization possible #71

Generalized the middle_one_away function to make parallelization possible #71

RNieuwenhuis commented Aug 28, 2020

KamilSJaron commented Aug 28, 2020

KamilSJaron commented Sep 7, 2020

KamilSJaron commented Aug 10, 2023

Generalized the middle_one_away function to make parallelization possible #71

Are you sure you want to change the base?

Generalized the middle_one_away function to make parallelization possible #71

Conversation

RNieuwenhuis commented Aug 28, 2020

KamilSJaron commented Aug 28, 2020

KamilSJaron commented Sep 7, 2020

KamilSJaron commented Aug 10, 2023