You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Then CSV/TSV loading is mostly trivial, because the CSV is isomorphic to the underlying objects. In practical terms we can feed the dict into python csvreader/writer and it will do the right thing.
Of course, even with this limited subset, there are multiple annoying edge cases due to lack of adhered to standards around CSVs, in particular
When we start to extend the profile, it gets harder to have predictable behavior. Even something as simple as allowing multivalued has annoying edge cases. We can pick an internal delimiter, such as |. This mostly works, modulo the missing data case above, and also assuming sensible escaping rules. There are edge cases where we might want a range to be an any_of multivalued and single valued, and there is no way to distinguish these cases.
any_of is problematic in general - consider trying to distinguish "1" from 1 in a CSV.
Things get harder when we start to allow ranges to be classes; sometimes there are "obvious" ways to flatten this but usually not.
Another approach with multivalued is to use the relmodel_transformer, which introduces new linking tables, but here let's assume that by "tabular format" we mean wide table denormalized where all observations fit into one table rather than linked relational tables.
For modern tabular databases and formats, there is no problem
jsonl
parquet
arrow
duckdb
pickled data frames
allow for a richer tabular profiles, including mutlivalued and inlined object references, with clear non-YOLO distinctions between different base types.
An orthogonal concern is that there is a lack of standards for dataset-level metadata. A sensible paradigm is to follow sssom and include a schema-controlled yaml block in the header, but everyone does this differently.
But if we are limited to CSV/TSV, one approach is to treat this as a transformation problem. Different schemas can define different transforms for flattening data. See https://github.com/linkml/linkml-transformer. This is the most principled approach. However, it may be overkill for cases when the transformation is "obvious" (e.g. use a pipe to separate multivalued). The problem is when everyone's obvious transforms come together it can be hard to reason over the combination.
The current approach with the tsv loaders and dumpers is to try and do the "obvious" transforms, and it mostly works. It uses this under the hood: https://github.com/cmungall/json-flattener -- it will do things like use json serialization for nested objects, etc.
Can we do better than this? There is an argument for having a separate library just for tabular data, where we could escape from the "ismorphism assumption" underpinning the current runtime loaders/dumpers. This could also have a plugin architecture that would make it easy for people to use other tabular/columnar formats like parquet, dask, etc. It would allow for a certain amount of flexibility without requiring the use of linkml-transformers, with some reasonable defaults.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
If we assume a highly limited profile of LinkML:
multivalued
Then CSV/TSV loading is mostly trivial, because the CSV is isomorphic to the underlying objects. In practical terms we can feed the dict into python csvreader/writer and it will do the right thing.
Of course, even with this limited subset, there are multiple annoying edge cases due to lack of adhered to standards around CSVs, in particular
When we start to extend the profile, it gets harder to have predictable behavior. Even something as simple as allowing
multivalued
has annoying edge cases. We can pick an internal delimiter, such as|
. This mostly works, modulo the missing data case above, and also assuming sensible escaping rules. There are edge cases where we might want a range to be anany_of
multivalued and single valued, and there is no way to distinguish these cases.any_of
is problematic in general - consider trying to distinguish"1"
from1
in a CSV.Things get harder when we start to allow ranges to be classes; sometimes there are "obvious" ways to flatten this but usually not.
Another approach with multivalued is to use the relmodel_transformer, which introduces new linking tables, but here let's assume that by "tabular format" we mean wide table denormalized where all observations fit into one table rather than linked relational tables.
For modern tabular databases and formats, there is no problem
allow for a richer tabular profiles, including mutlivalued and inlined object references, with clear non-YOLO distinctions between different base types.
An orthogonal concern is that there is a lack of standards for dataset-level metadata. A sensible paradigm is to follow sssom and include a schema-controlled yaml block in the header, but everyone does this differently.
But if we are limited to CSV/TSV, one approach is to treat this as a transformation problem. Different schemas can define different transforms for flattening data. See https://github.com/linkml/linkml-transformer. This is the most principled approach. However, it may be overkill for cases when the transformation is "obvious" (e.g. use a pipe to separate multivalued). The problem is when everyone's obvious transforms come together it can be hard to reason over the combination.
The current approach with the tsv loaders and dumpers is to try and do the "obvious" transforms, and it mostly works. It uses this under the hood: https://github.com/cmungall/json-flattener -- it will do things like use json serialization for nested objects, etc.
Can we do better than this? There is an argument for having a separate library just for tabular data, where we could escape from the "ismorphism assumption" underpinning the current runtime loaders/dumpers. This could also have a plugin architecture that would make it easy for people to use other tabular/columnar formats like parquet, dask, etc. It would allow for a certain amount of flexibility without requiring the use of linkml-transformers, with some reasonable defaults.
Beta Was this translation helpful? Give feedback.
All reactions