Using h5 region references to index trials #250

tensionhead · 2022-03-28T12:59:38Z

Current Situation

At the moment we have an overloaded time axis: holding both the actual samples of a trial which actually has a time axis, and the trials even for data which does not have a time axis:

import syncopy as spy
import numpy as np

mockup = np.ones((15, 2))
ad = spy.AnalogData([mockup, mockup])
spec = spy.freqanalysis(ad)

Now looking at the shapes we get:

>>> ad.data.shape
(20, 2)
>>> spec.data.shape
(2, 1, 6, 2)

The time axis of the spec object is not a real time axis, but just holds the trials, whereas for ad it does both.
This leads to an overkill of index gymnastics and therefore mental load which leads to error prone and not very readable/maintainable code. This is also reflected in our bug history, where the confusion of sample indices and trials led to numerous bugs, for example: #180, #207, #239 and #240

Proposed Solution

Using h5 Region References allows for clearly and elegantly decouple the time axis from the trials, yet still allows for the same stacking and overlapping in the backing hdf5 dataset:

import h5py

myfile = h5py.File("test", driver='core', mode='w')                                                     
myds = myfile.create_dataset('dset', (5, 2))                                                            
# define pseudo trials                                                                                  
myds[:3] = 1                                                                                            
myds[2:] = 2                                                                                            
# indicate overlapping region                                                                           
myds[2:3] = 12                                                                                          
# define (overlapping) trials via region references                                                                   
trl1 = myds.regionref[:3]                                                                               
trl2 = myds.regionref[2:]

Indexing is now straightforward:

>>> myds[trl1]
array([[ 1.,  1.],
       [ 1.,  1.],
       [12., 12.]], dtype=float32)
>>> myds[trl2]
array([[12., 12.],
      [ 2.,  2.],
      [ 2.,  2.]], dtype=float32)

Implementing this in our dataclasses would decrease code opaqueness, and vastly improve maintanability and the speed of feature additions. Also downstream from the pure dataclass implementations, streaming trials to the actual computations would be now almost trivial.

Implementation

Rather than creating these references only once, and then storing them along the actual data within the hdf containers, I suggest creating them on the fly from the trialdefinition provided by any Syncopy dataset. Something like this is anyways also happening in the current implementation to populate attributes like .trials. This would allow for a seamless integration into the current code base, and does not break compatibility to Syncopy data which is already saved somewhere.

I know that one argument for the current state is, that also without Syncopy the (raw) data is still somewhat directly accessible, IF ppl are able to parse the trialdefinition provided in the .info json file. I would argue that this isn't changing, as we still would provide the very same trialdefinitions in the respective .info files, and the actual arrays holding the data would not change a bit (pun intended) on disc compared to the current implementation :)

The text was updated successfully, but these errors were encountered:

tensionhead added Design Proposal to investigate/change code structure/layout Explore Examine novel functionality/proposed changes etc. Does not necessarily involve coding things. labels Mar 28, 2022

tensionhead added this to To do in Codebase Improvements via automation Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using h5 region references to index trials #250

Using h5 region references to index trials #250

tensionhead commented Mar 28, 2022 •

edited

Using h5 region references to index trials #250

Using h5 region references to index trials #250

Comments

tensionhead commented Mar 28, 2022 • edited

Current Situation

Proposed Solution

Implementation

tensionhead commented Mar 28, 2022 •

edited