Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when read_biosemi_bdf.m reads broken files #2318

Open
rbruna opened this issue Sep 8, 2023 · 4 comments
Open

Issues when read_biosemi_bdf.m reads broken files #2318

rbruna opened this issue Sep 8, 2023 · 4 comments

Comments

@rbruna
Copy link
Contributor

rbruna commented Sep 8, 2023

Hello,

I have recently got to some BDF files that appear to be broken for some reason, but still contain valid data (up to the moment they broke). The implementation of the BDF reader in FieldTrip failed to read it, and I found two issues:

  1. When no data length was indicated, is was calculated as 2-byte data, while Biosemi stores 3-byte data. This was fixed in FIX BDF file reading error #2221.
  2. When the read_24bit MEX file was called with an incorrect data length, the code stalled and finally broke.
  3. Both the MEX file and the Matlab implementation read the file per data epoch.

In my case, data was continuous, and each epoch was 1 sample. Therefore, the code 1) opened the file to read, 2) reads a 1 sample times nchannel 24-bit values; 3) closes the file; and 4) repeats. This finally led to a very inefficient data reading. I am not sure if this si common in BDF files, but it was the case in my files, at least.

I propose to change the reading function, getting rid of the MEX file (Matlab includes bit24 support natively), and reading all the data payload at once. That improves the efficiency of the reading, does not worsen the code readability (for educational purposes), and removes the dependency on an external MEX file.

I can implement the change if required.

Best,
Ricardo

@robertoostenveld
Copy link
Contributor

Hi Ricardo,

It appears you already had a good look at this, thanks!

I don't know the history of the FieldTrip implementation any more, but it might predate MATLAB support for 24 bit data types. As long as MATLAB 2018 supports it, then I am fine with making use of the bit24 by default (assuming there are no other disadvantages).

However, always reading all data might not fit with some frequent usage patterns. See also https://www.fieldtriptoolbox.org/faq/why_are_the_fileio_functions_stateless_does_the_fseek_not_make_them_very_slow/ which discusses the reading of many short segments versus reading a large segment at once. A common situation is that BDF files are very large (since often many channels, and a high sampling rate), even to the extent that they (when casted to double precision) won't fully fit into memory. That is not a problem per see with the current code, which would read only small blocks, pick out the trigger channel (and only keep that in memory), parse the triggers, and then continue reading only the segments around the triggers.

Reflecting on BDF and EDF readers: these two have always been a bit of a mess, with shared code (because BDF was derived from EDF), but also being sufficiently different (annotations in EDF+, variable sampling rates in EDF) that a shared function which only differed in the 16/24 bit reading also did not make sense.

Fixing the issue with the broken BDF file is one thing (1); improving the speed and code clarity another (2).

Re 1) does your problem still apply following the fix in #2221? Vladimir shared a broken file, and that one works now (it is part of our nightly regression test). If not, then something else appears to be broken in your files - if so, can you share one with me?

Re 2) Have you used the MATLAB profiler to check where the time is spent when reading a large file? That would help making decisions towards a better implementation.

@rbruna
Copy link
Contributor Author

rbruna commented Sep 8, 2023

Hello, Robert,

Regarding bit24, it is at least as old as Matlab 2016a, as I just checked it.

Regarding the block-read, indeed, it makes more sense reading a segment than reading the whole file. But with my proposed modification we can still read a segment. BDF seems to be stored in epochs, which could be short (as FIFFs epochs of 1 second), long (as 30 second epochs), or very short (in my case, 0.5 ms, one sample at 2000 samples per second). Still, reading a 500 ms segment with 0.5 ms epochs would require opening the file, reading one small set of data (my data was 39 samples, so reading 39*3 bytes), and closing the file, sequentially, 1000 times. This is a lot of wasted time in the communication with the OS.

Re 1) the issue seems solved with #2221. My broken file just stops in the middle of an epoch, but the floor in line 179 prevents the MEX file to try and read the incomplete epoch. It would fail in otherwise "complete" broken files, this is, files where there is a number of epochs defined in the header (not -1), and the file is broken.

Re 2) I have used tic/toc in my file and it went from 141.4 seconds (using current implementation, with Matlab) or 42.8 seconds (using current implementation, with the MEX file) to 7.5 seconds (using my implementation). I am not sure about how common is to have 1-sample epochs, but for any epoch length my implementation should be faster.

@robertoostenveld
Copy link
Contributor

Ah, you working with one-sample blocks clarifies one of the problems with the slowness. One-sample BDF (or EDF) files basically have the data multiplexed, treating that as blocks is indeed inefficient.

Just a question: looking at https://www.edfplus.info/specs/edf.html (which I believe also applies to BDF) and considering the header duration ("of a data record, in seconds", 8 ascii characters), are you specifying that as 1/fsample, truncated to 8 characters? With 2048Hz that would result in 0.000488, right?

I am all in favor of speeding up the code, even if it were only for those that work with one-sample block BDF files( I have no idea how common that is). But the challenge remains for current functionality (like reading the trigger channel from a 4GB BDF file on a laptop with only 8GB of RAM) not to break.

When all channels have the same sampling rate, an integral solution regardless of block size could consist of fread() with the skip option. That allows fseeking to the first block and read the desired channel. Hmm, I now realize that here I assume that there is a single trigger channel to be read, which is also not a given.

Perhaps the logic should be like

if blocksize is 1 sample
  if reading 1 channel
    ... 1) use fseek and skip
  elseif reading >1 channel
    ... 2) use fseek, read everything, and select the desired channels
  end
elseif blocksize is >1 sample
  if reading 1 channel
    ... 3) read the blocks with a for loop and select the desired channel
  elseif reading >1 channel
    ... 4) read the blocks with a for loop and select the desired channels (actually same as above)
  end
end

The memory overrun risk then only applies to "2".

What do you think?

@rbruna
Copy link
Contributor Author

rbruna commented Sep 11, 2023

In my case, the sampling rate is 2000 Hz, so the record length would be "0.0005 " (with two spaces). But I can see the issue with power-of-two sampling rates...

For the 4GB file in a 8GB laptop I think it would be wise to 1) read the data as 32-bit integers (using 'bit24=>uint32'), 2) keep only the trigger(s) channel, and 3) convert this channel(s) into double precision. That would put an overhead of only 8 bits per sample, and one 4GB file would take 5.3 GB of RAM. Nothing would be broken, and it will be more efficient in any scenario where the number of read channels is not the complete set.

I will do some test with my machine to see how efficient is the skip, it might be the case that it is not efficient at all and it makes more sense to read continuous chunks of data (say an integer number of epochs just smaller than 1GB) and taking the desired channel from there. The overhead penalty would be small, and the RAM price to pay, too. This could apply to any scenario, in fact (if skip is not efficient).

Note: I just checked the memory taken by Matlab, and it looks like, for a small amount of time, both the uint32 and the double version of the data are kept, so it would be less efficient, memory-wise, when the whole data is read. We can use the uint32 trick only when reading few channels...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants