Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using h5 region references to index trials #250

Open
tensionhead opened this issue Mar 28, 2022 · 0 comments
Open

Using h5 region references to index trials #250

tensionhead opened this issue Mar 28, 2022 · 0 comments
Labels
Design Proposal to investigate/change code structure/layout Explore Examine novel functionality/proposed changes etc. Does not necessarily involve coding things.

Comments

@tensionhead
Copy link
Contributor

tensionhead commented Mar 28, 2022

Current Situation

At the moment we have an overloaded time axis: holding both the actual samples of a trial which actually has a time axis, and the trials even for data which does not have a time axis:

import syncopy as spy
import numpy as np

mockup = np.ones((15, 2))
ad = spy.AnalogData([mockup, mockup])
spec = spy.freqanalysis(ad)

Now looking at the shapes we get:

>>> ad.data.shape
(20, 2)
>>> spec.data.shape
(2, 1, 6, 2)

The time axis of the spec object is not a real time axis, but just holds the trials, whereas for ad it does both.
This leads to an overkill of index gymnastics and therefore mental load which leads to error prone and not very readable/maintainable code. This is also reflected in our bug history, where the confusion of sample indices and trials led to numerous bugs, for example: #180, #207, #239 and #240

Proposed Solution

Using h5 Region References allows for clearly and elegantly decouple the time axis from the trials, yet still allows for the same stacking and overlapping in the backing hdf5 dataset:

import h5py

myfile = h5py.File("test", driver='core', mode='w')                                                     
myds = myfile.create_dataset('dset', (5, 2))                                                            
# define pseudo trials                                                                                  
myds[:3] = 1                                                                                            
myds[2:] = 2                                                                                            
# indicate overlapping region                                                                           
myds[2:3] = 12                                                                                          
# define (overlapping) trials via region references                                                                   
trl1 = myds.regionref[:3]                                                                               
trl2 = myds.regionref[2:] 

Indexing is now straightforward:

>>> myds[trl1]
array([[ 1.,  1.],
       [ 1.,  1.],
       [12., 12.]], dtype=float32)
>>> myds[trl2]
array([[12., 12.],
      [ 2.,  2.],
      [ 2.,  2.]], dtype=float32)

Implementing this in our dataclasses would decrease code opaqueness, and vastly improve maintanability and the speed of feature additions. Also downstream from the pure dataclass implementations, streaming trials to the actual computations would be now almost trivial.

Implementation

Rather than creating these references only once, and then storing them along the actual data within the hdf containers, I suggest creating them on the fly from the trialdefinition provided by any Syncopy dataset. Something like this is anyways also happening in the current implementation to populate attributes like .trials. This would allow for a seamless integration into the current code base, and does not break compatibility to Syncopy data which is already saved somewhere.

I know that one argument for the current state is, that also without Syncopy the (raw) data is still somewhat directly accessible, IF ppl are able to parse the trialdefinition provided in the .info json file. I would argue that this isn't changing, as we still would provide the very same trialdefinitions in the respective .info files, and the actual arrays holding the data would not change a bit (pun intended) on disc compared to the current implementation :)

@tensionhead tensionhead added Design Proposal to investigate/change code structure/layout Explore Examine novel functionality/proposed changes etc. Does not necessarily involve coding things. labels Mar 28, 2022
@tensionhead tensionhead added this to To do in Codebase Improvements via automation Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Design Proposal to investigate/change code structure/layout Explore Examine novel functionality/proposed changes etc. Does not necessarily involve coding things.
Projects
Codebase Improvements
Under Consideration
Development

No branches or pull requests

1 participant