Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF conversion/dumping does not honour list_elements_ordered and produces questionable prefixes #2069

Open
druimalban opened this issue Apr 16, 2024 · 1 comment
Labels
bug Something that should work but isn't, with an example and a test case. community-generated rdf-generator

Comments

@druimalban
Copy link

druimalban commented Apr 16, 2024

Describe the bug

There is some very odd behaviour as regards to multi-valued slots upon conversion to RDF. They are by default unordered.

I know of the list_elements_ordered slot, and I assumed that it would ensure that RDF lists (i.e., containers), would also be ordered.

Instead, the order is not preserved, i.e. the option appears to have no effect. I have tested adding this field and removing it from the desired multi-valued slot. The output is identical.

A strongly-related facet of this, is that when I was shaping my data model to make sure that every class instance has an element at the top level (i.e. RDF should know about slots which I've set as identifiers, and use these), is that I needed to manually specify @base. I have been using the prefix_map argument of an RDFLibDumper.dump(). This produces an invalid prefix line starting with @prefix @base:, in addition to the single expected @base:. I don't know enough RDF to say if this is invalid in all cases, however.

To reproduce

A minimum working example is as follows. Unfortunately, due to the prefix-map it is not very minimal, in the sense that I had to write a Python script rather than play around with the command-line tools.

I have attached a Python script called convert.py, in addition to a test manifest schema.ordered.yaml and test data file data.yaml. This rudimentary script takes three arguments: the 'mode' (from-turtle or to-turtle, you get the idea), the schema (see attached schema.ordered.yaml) and data file (attched data.ttl and then whatever the script generated, I redirected this to data.yaml and removed the duff second prefix line).

Python script:

# file `convert.py'
import logging
import sys

from linkml.generators.pythongen import PythonGenerator
from linkml_runtime.dumpers import RDFLibDumper, YAMLDumper
from linkml_runtime.loaders import RDFLibLoader, YAMLLoader

logging.basicConfig (level = logging.INFO)

args = sys.argv

mode          = args[1]
input_schema  = args[2] 
input_data    = args[3]

logging.info (f"Mode is {mode}, input schema is {input_schema}, input data is {input_data}")

logging.info (f"Calling `PythonGenerator ({input_schema})'")
schema_base = PythonGenerator (input_schema)
logging.info ("Compiling Python module")
schema_mod  = schema_base.compile_module()
logging.info ("Getting `schemaview' object")
schema_view = schema_base.schemaview

logging.info ("Polling target class")
schema_class = schema_mod.ColumnDesc

if (mode == "to-turtle"):
    loader = YAMLLoader   ()
    dumper = RDFLibDumper ()

    logging.info ("Loading data from YAML using `schemaview' object")
    data = loader.load (
        source       = input_data
      , target_class = schema_class
      , schemaview   = schema_view
    )
    logging.info ("Dumping turtle representation to command line")
    res = dumper.dumps (data, schemaview = schema_view, prefix_map = {"@base": "http://localhost/misc/scratch/"})

else:
    loader = RDFLibLoader ()
    dumper = YAMLDumper   ()

    logging.info ("Loading data from turtle using `schemaview' object")
    data = loader.load (
        source       = input_data
      , target_class = schema_class
      , schemaview  = schema_view
    )
    logging.info ("Dumping YAML representation to command line")
    res = dumper.dumps (data)
    
print (res)

Schema file:

# file `schema.ordered.yaml'
# MWE to demonstrate RDF conversion issues
id: http://localhost/misc/scratch/
name: scratch
title: Minimum working example to demonstrate RDF conversion issues
prefixes:
  scratch: http://localhost/misc/scratch/
  dc:      http://purl.org/dc/elements/1.1/
  linkml:  https://w3id.org/linkml/
imports:
  - linkml:types
default_prefix: scratch
default_range:  string
slots:
  atom:
    description:          Short-form name or atom
    range:                string
    broad_mappings:       dc:title
    pattern:              "^:?[a-z]+[[a-z]|_|]*$"
    exact_mappings:       dc:identifier
    identifier:           true
    required:             true
  scope:
    description:           A collection of column names
    range:                 string
    list_elements_ordered: true
    multivalued:           true
    required:              true
classes:
  ColumnDesc:
    tree_root:   true
    description: Class description of set of columns
    slots:
      - atom
      - scope

Data file:

# file `data.yaml'
atom:  sampled_data
scope:
  - latitude
  - longitude
  - easting
  - northing
  - depth
  - sampling_notes

Running the script

Here's the initial conversion from YAML data to turtle. Note the second prefix line, and that the elements of scope are not a container. Therefore, they are effectively in alphabetical order, which is not as expected, at all.

% python convert.py to-turtle schema.ordered.yaml data.yaml
INFO:root:Mode is to-turtle, input schema is schema.ordered.yaml, input data is data.yaml
INFO:root:Calling `PythonGenerator (schema.ordered.yaml)'
INFO:root:Compiling Python module
INFO:root:Importing linkml:types as /var/db/scratch/linkml/scratch/lib/python3.9/site-packages/linkml_runtime/linkml_model/model/schema/types from source schema.ordered.yaml; base_dir=None
test:86: FutureWarning: Possible nested set at position 10
INFO:root:Getting `schemaview' object
INFO:root:Polling target class
INFO:root:Loading data from YAML using `schemaview' object
INFO:root:Dumping turtle representation to command line

@base <http://localhost/misc/scratch/> .
@prefix @base: <http://localhost/misc/scratch/> .

<sampled_data> a <ColumnDesc> ;
    <scope> "depth",
        "easting",
        "latitude",
        "longitude",
        "northing",
        "sampling_notes" .

Here's the conversion of this turtle back to YAML, after removing the duff second prefix line. Interestingly, the order isn't coerced back into the schema's alphabetical order, which I would honestly have expected as well. This could be an incorrect assumption on my part.

% python convert.py from-turtle schema.ordered.yaml data.ttl
INFO:root:Mode is from-turtle, input schema is schema.ordered.yaml, input data is data.ttl
INFO:root:Calling `PythonGenerator (schema.ordered.yaml)'
INFO:root:Compiling Python module
INFO:root:Importing linkml:types as /var/db/scratch/linkml/scratch/lib/python3.9/site-packages/linkml_runtime/linkml_model/model/schema/types from source schema.ordered.yaml; base_dir=None
test:86: FutureWarning: Possible nested set at position 10
INFO:root:Getting `schemaview' object
INFO:root:Polling target class
INFO:root:Loading data from turtle using `schemaview' object
INFO:root:Triple processed = 7, unprocessed = 0
INFO:root:Dumping YAML representation to command line

atom: scratch:sampled_data
scope:
- depth
- easting
- latitude
- longitude
- northing
- sampling_notes

Expected behavior

Firstly, the most severe issue is that, as mentioned above, I expect that LinkML would honour the apparent behaviour of the list_elements_ordered slot, when it does not. This slot claims the following:

"If True, then the order of elements of a multivalued slot is guaranteed to be preserved. If False, the order may still be preserved but this is not guaranteed"


Secondly, there is this invalid second @prefix @base line. This does not appear to be valid RDF, but it's not clear what actually triggers this.

It should be from linkml-runtime's rdflib_dumper file, but that file seems to be a special case for @base, which implies that there may be something else going on here, i.e. it is possible that it is added later:
https://github.com/linkml/linkml-runtime/blob/main/linkml_runtime/dumpers/rdflib_dumper.py#L50

There seem to be various issues about prefixes and RDF generation open, but nothing which specifically highlights this issue which I am having with it.


Finally, when converting back to the YAML representation, the identfier field (atom) preserves the prefix. I don't think this is the correct behaviour, as we already have access to that and it should be implicit, even though the YAMLDumper.dump() methods don't accept a schemaview argument like the RDFLibDumper's.

This final thing is a quirk, but it's a fairly severe issue for me as I am using LinkML's YAML representation to make it easier for others to edit input data files, which are actually processed by a different computer program, as RDF. If there isn't a 1:1 mapping, it can be problematic, although it hasn't been so far in the way that the first two elements of this issue have been.

About your computer (if applicable, please complete the following information):

OS: Mac OS X Sonoma 14.4 Darwin darwin 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:51:37 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T8112 arm64

@druimalban druimalban added the bug Something that should work but isn't, with an example and a test case. label Apr 16, 2024
@druimalban
Copy link
Author

druimalban commented Apr 17, 2024

OK, I tried to fix the @base thing myself, and was adding more logging, since there were no logs per prefix being added. Only in the source did I find a comment inline which indicated that passing _base did the needful, which it did, without adding the second duff @prefix @base line. So, that solves that. This is a documentation/tooling issue, I think.

https://github.com/linkml/linkml-runtime/blob/main/linkml_runtime/dumpers/rdflib_dumper.py#L48-L69

Since RDF generation/dumping does have quirks, and I've being using it a lot, I'll try and distill how I've used it, and see about contributing to the documentation.

Still outstanding is this issue with lack of order of lists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that should work but isn't, with an example and a test case. community-generated rdf-generator
Projects
None yet
Development

No branches or pull requests

2 participants