Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] allow timestamps for physio table #1796

Open
bendichter opened this issue Apr 18, 2024 · 15 comments
Open

[ENH] allow timestamps for physio table #1796

bendichter opened this issue Apr 18, 2024 · 15 comments
Labels

Comments

@bendichter
Copy link
Contributor

Your idea

Tables of time series data, e.g. physio and motion, currently require SamplingFrequency, which assumes regularly sampled data. For irregularly sampled data, it would be helpful to have an OPTIONAL column "timestamp." (which acts similar to the "onset" column of the current events table). If this column is provided, SamplingFrequency should be optional. This feature would also be helpful when you want to capture temporal drift between two acquisition systems.

Motivated by #1792 (comment) by @oesteban

@sappelhoff
Copy link
Member

cc @sjeung @JuliusWelzel who have thought about this / worked on this.

@JuliusWelzel
Copy link
Collaborator

Hi, thanks for tagging us :)

In Motion-BIDS we introduced the channel type Latency.

On the topic of timestamp vs. latency, we decided that each data file assumes a recording start at 0 and each sample can be assigned a latency. Multiple recordings can be synchronized via the scans.tsv file, where the precise onset of each recording is stored. With the Merge of our BEP latency should be available as a common channel type in BIDS for other modalities as well.

Does this help with your idea?

@Remi-Gau
Copy link
Collaborator

For physio files I think that the StartTime metadata plays an equivalent role:

https://bids-specification.readthedocs.io/en/latest/modality-specific-files/physiological-recordings.html#physiological-recordings

@oesteban
Copy link
Collaborator

In my opinion, this is more to gauge the possibility that the SamplingFrequency metadata is allowed to have a special value (e.g., "nonregular") instead of a Number (which is currently the only possibility).

There's no specific limitation to having a column (channel?) called latency, time, or timestamp. The limitation is that, if SamplingFrequency is not modified, this column is just a mapping to the implicit row index (StartTime and the actual frequency tell you the mapping).

Currently, you cannot have a physio file with two rows that were sampled simultaneously (i.e., two rows with same value in the time/latency/timestamp column) AND you have to have a row for every sample at the given frequency. The latter can be workarounded by filling the row with n/a values in all columns, but that is pretty ugly nonetheless.

@sjeung
Copy link
Collaborator

sjeung commented Apr 19, 2024

Yeah there are definitely some limitations to relying on the sampling frequency value. We made a distinction between nominal sampling rate (SamplingFrequency) that is used as hardware spec description and the effective sampling rate (SamplingFrequencyEffective, link) computed by the measured duration and the number of samples. Effective sampling rate allows for deviation from nominal sampling rate, for example 50.5 Hz instead of 50 Hz, but still time stamps are needed for restoring precise timing information when sample intervals are irregular (like @oesteban described, missing samples or two consecutive timestamps repeating).

I was not aware that using a latency (or time stamps) channel, you are still not allowed to sample two data points in one channel simultaneously. The other case of having to fill the cells with n/a is also something I have not encountered myself.
Are these restrictions documented somewhere, or does BIDS validator check for it?

My current understanding is that by having a latency channel that goes for example

1.00, 1.01, 1.01, 1.02, 1.05

(repeated 1.01s and missing 1.03 and 1.04 when nominal sampling rate is 100 hz)
it can match with the indices of the data points in associated channels.

@effigies
Copy link
Collaborator

effigies commented Apr 19, 2024

I am +1 for adding an optional timestamp column, and either make SamplingFrequency optional or acknowledge that timestamp takes precedence and SamplingFrequency records the nominal frequency of the recording device and not a transform from index to timestamp.


Just to throw another proposal into the mix (and hoping I'm not stating the obvious), what we're looking at are sparse vs dense 1D arrays with vector values, where the index is the time stamp:

Dense

(Implicit) Timestamp Columns
StartTime + 0 / SamplingFrequency (col1[0], col2[0], ... colN[0])
StartTime + 1 / SamplingFrequency (col1[1], col2[1], ... colN[1])
... ...
StartTime + M / SamplingFrequency (col1[M], col2[M], ... colN[M])

Sparse

Timestamp (Other) Columns
timestamp[0] (col1[0], col2[0], ... colN[0])
timestamp[1] (col1[1], col2[1], ... colN[1])
... ...
timestamp[M] (col1[M], col2[M], ... colN[M])

BEP 17 has proposed adding .sparse.tsv and .dense.tsv as ways of indicating the interpretation of the contents. Would _<densesuffix>.sparse.tsv[.gz] be a workable alternative here?

It's actually not difficult in the schema to make metadata optional based on the contents of the file, but it might help other tools to know what they're getting into, if they've already hard-coded assumptions about what _<dense>.tsv[.gz] needs to be interpreted.

@oesteban
Copy link
Collaborator

the nominal frequency of the recording device

I agree with @sjeung and like @effigies' options. The best solution here, in my opinion, is to create an explicit NominalSamplingFrequency field.

Regarding the proposal of .sparse.tsv and .dense.tsv, I don't see why there should be a difference between the naming of dense and sparse approaches, basically because we will write dense signals with the sparse encoding you show above. Those timestamps are just the implicit timestamp plus the sampling instant error.

@sjeung
Copy link
Collaborator

sjeung commented Apr 19, 2024

It's nice to see the two cases clearly written out like that. "Synchronisation and latency channel" section in this Motion-BIDS preprint is describing those cases, depending on the availability of metadata/latency. But we did not think of assigning labels to make the distinction clearer.

"For tracking systems that do not provide single-sample timestamp information, the latency of each sample is can be reconstructed based on the effective sampling frequency (recommended field “SamplingFrequencyEffective”), if available, or the nominal sampling frequency (required field “SamplingFrequency”), both found in “motion.json” metadata file.

Synchronizing the onset of motion data between different tracking systems and/or with data from other modalities is achieved using the “sub-XX_scans.tsv” file, which contains an optional column “acq_time” that documents the onset of acquisition in datetime format."

If I can make a suggestion there it would be to stick with the name "latency", in numeric, instead of "time stamps", in datetime format?

@oesteban
Copy link
Collaborator

oesteban commented Apr 19, 2024

@sjeung I couldn't find where the timestamps are defined in datetime format. I prefer timestamp because it is generic -- it can be an integer (Unix sense) it can be a float in seconds (like latency), etc.

Perhaps we need to think about the format itself rather than the device:

For "sparse" (explicit indexing):

{
    "NominalSamplingFrequency": 1000.0,
    "StartTime": 23.3,
    "TabularDataIndex": "<column_name>"
}

where "<column_name>" can be anything such as latency, timestamp, time or index (depending on what section of the specs is being discussed and could even be left open to the user). Start time is still necessary to synchronize with other experiments/acquisitions.

For dense (implicit indexing):

{
    "SamplingFrequency": 1000.0,
    "StartTime": 23.3,
}

(i.e., current physio specs, just making "SamplingFrequency" optional)

@effigies
Copy link
Collaborator

effigies commented Apr 19, 2024

I would have expected latency to encode a $\pm\Delta t$ from nominal, not a time from 0, but setting that aside, I agree it would make more sense to have it be seconds since some zero time than an hh:mm:ss[.00000] time stamp. If latency already covers that, then I would vote for consistency with the existing spec and adopt that.

@oesteban Timestamps are defined in https://bids-specification.readthedocs.io/en/stable/common-principles.html#units.

Time stamp information MUST be expressed in the following format: hh:mm:ss[.000000] For example 13:45:30.

It's not a glossary term, so I'm not sure it's a "definition" so much as describing what to do when storing times without dates. This item would presumably need to be clarified in any PR.

FWIW I don't like TabularDataIndex. I would rather have an inconsistency where physio uses timestamp columns and motion uses LATENCY channels than make tools have to consult the JSON sidecar for the simplest actions.

@sjeung
Copy link
Collaborator

sjeung commented Apr 19, 2024

where "<column_name>" can be anything such as latency, timestamp, time or index (depending on what section of the specs is being discussed and could even be left open to the user). Start time is still necessary to synchronize with other experiments/acquisitions.

The start time approach was explored with respect to cross-modality synchronization and the issue we had back then was that some modalities like EEG do not come with the field and we would have to look at scans.tsv file to unambiguously determine the temporal alignment between recordings. Of course, scans.tsv file itself and acq_time fields are both optional....

@oesteban
Copy link
Collaborator

oesteban commented Apr 19, 2024

FWIW I don't like TabularDataIndex. I would rather have an inconsistency where physio uses timestamp columns and motion uses LATENCY channels than make tools have to consult the JSON sidecar for the simplest actions.

What about:

{
    "NominalSamplingFrequency": 1000.0,
    "StartTime": 23.3,
    "IndexColumn": 1
}

For dense (implicit indexing):

{
    "SamplingFrequency": 1000.0,
    "StartTime": 23.3,
    "IndexColumn": 0  # Or just omitted
}

I just realized that physio tsvs are headerless, so having a name there requires you resolve the "Columns" metadata too. What brings the other side of your distaste for this -- these tsv files are headerless and therefore you have to look at the JSON anyway.

@effigies
Copy link
Collaborator

effigies commented Apr 19, 2024

Good point that physio requires reading the JSON anyway. I was forgetting that.

In any event, I would want the logic as simple as possible. Index columns, if they exist, should come first. Ideally the name is set by the spec, not the user. It is less painful to me to consider the edge cases of "When I see motion, I look for latency, when I see physio, I look for timestamp because that's what the spec says" than "When I see motion, I look for latency, when I see physio, I look for an IndexColumn metadata field and then look for that value in Columns. Hope they ran the validator, or else this is likely to be a ValueError that most users won't be able to interpret." This isn't a hill I'm willing to die on, though.

As a minor aside, the schema uses index_column to indicate columns with uniqueness constraints (coming from database terminology). We may need to rework that if index columns take on this meaning, but the tail should not wag the dog here. Do what makes sense for the data, not the schema.

@bendichter
Copy link
Contributor Author

I think "timestamp" is a more natural choice than "latency", but I'd be fine with bending on that for the sake of consistency with the motion modality.

The plan for physio then would simply be to have "latency" be (optionally) the first item in the "Columns" list of the associated *_physio.json. The units would default to seconds but maybe the JSON could override that by specifying a different Unit for that column. Then the times of these samples would be the first column of the corresponding tsv file.

It looks like this is done a bit differently in motion, where the columns of the motion.tsv are described by the corresponding channels.tsv, in which one of the channels can be marked as the LATENCY type.

I like the idea of explicitly using NominalSamplingFrequency for irregularly sampled data. I would also be OK with keeping SamplingFrequency and having that be interpreted as the nominal frequency if the latency column is present, though I will point out that I think this might cause mistakes for downstream tools that are not aware of this new feature.

@JuliusWelzel
Copy link
Collaborator

JuliusWelzel commented Apr 24, 2024

To be honest I also like "timestamp" more than "latency". This would give the option to synchronize datastream across modalities. For now, recordings have to be synchronized based on the onset from the "scans.tsv" file and the latency channel.

I believe for BIDS 2.0 recordings.tsv will replace scans.tsv. Maybe the option to synchronize recordings via timestamps can be introduced in BIDS 2.0 where backward compatibility is broken?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants