Implement New Schema management system #3761

dafeder · 2022-03-06T19:55:47Z

The way schemas are defined and discovered in DKAN is still quite fragile, and not conducive to customization. Flexibility is the major value proposition of DKAN's unorthodox approach to metadata storage in DKAN, so this is technical debt that would be very useful to pay off, in terms of stability, developer experience and ease of adoption.

The current schema system infers a set of schemas by simply looking at any JSON files that exist in one of a few possible filesystem locations. There is no way to differentiate between core, required schemas and user-defined schemas -- if the user has a custom schema directory all core schemas will be ignored. As well, there is an assumption that schema filenames will match their machine names in the system and the property name they are referenced from, and that there will necessarily be a dataset and distribution schema with a few expected properties. Finally, there is no differentiation between the actual metastore schemas and the form-specific ui schemas, but rather we again relay on filename-based conditionals to chose one or the other when appropriate.

After thinking through a number of approaches I'm proposing what I believe is the best solution to this problem. It is less of a break from the current paradigm than other ideas in this direction (implementing schemas as PHP classes, or as Drupal entity bundles), has what I think is a better separation of concerns between PHP code, declarative definitions in YAML, and schema documents in JSON. It gives us a unified way to abstract the non-negotiable parts of schemas in the DKAN business logic and allow everything else to be fully override-able. This solves some inconsistencies in the referencing and storing logic between datasets, refrenced items like distributions and keywords, and "resources" (files).

This is accomplished through a standardized set of reference types and schema classes. The latter are equivalent to what we have been calling schema "behaviors" in previous planning documents. They allow us to create universal patterns in our DKAN methods to replace some very brittle conditionals that will begin to break down as soon as anyone significantly alters the default DCAT-US based metastore schemas.

Overview

A new schema system would define schemas in .yaml files in a module's root directory, similar to Drupal services and routes. No configuration is guessed/inferred simply from the presence or lack of certain files in the filesystem. The intention is for schemas to be tightly coupled to specific, focused modules.

The current admin pages where reference and trigger fields are set will be removed, as these are all covered in the new schemas file.

This will of course be a significant upgrade of DKAN and will involve these changes:

We will move the core default collection of schemas from the root dkan/schema folder into a new module called dkan_dcat_us. Sites using the default schema will need to enable this module, which we can do via hook_update. The module will be optional and only a dependency of the sample_content module.
Existing sites that have custom schema in their docroot/schema folder will need to create a new module, following the template of dkan_dcat_us, and be sure to have this module enabled and dkan_dcat_us disabled.
Much of what's described here can be refactored incrementally to take advantage of the new abstractions, but the minimum updates for release would be:

Updating the SchemaRetriever class to use the new yaml files, and provide new public methods to provide UI Schemas, reference and trigger information, but keeping it backwards compatible for the time being so that new functionality can be adopted gradually.
Updating the Referencer classes to use the new reference definitions from the YAML in place of property-name-based conditionals. This will hopefully not change any public method signatures and only affect internal logic and private methods.
Updating the datastore event subscriber to pull its list of triggering fields from the schemas YAML instead of the config.

Structure of a DKAN schemas file

Similar to Drupal's .services.yml files, a .schemas.yml file in a module's root directory will register schemas with DKAN and make them available to Drupal\metastore\SchemaReriever. The default DCAT-AP schemas can be described like this:

catalog:
  schema_path: schema/catalog.json
  references:
    - { property: dataset, schema: dataset }
  class: catalog
  endpoint_base: data

dataset:
  schema_path: schema/dataset.json
  ui_schema_path: schema/dataset.ui.json
  identifier: { property: identifier, type: uuid }
  references:
    - { property: distribution, schema: distribution }
    - { property: publisher, schema: organization }
    - { property: keyword, schema: keyword }
    - { property: theme, schema: theme }
  triggers:
    - { property: modified, trigger: datastore_import }
  class: dataset

distribution:
  schema_path: schema/distribution.json
  references:
    - { property: downloadURL, type: resource }
    - { property: describedBy, schema: data-dictionary, type: id }
  class: distribution

organization:
  schema_path: schema/organization.json
  ui_schema_path: schema/organization.ui.json

keyword:
  schema: { type: string }
  class: literal

theme:
  schema: { type: string }
  class: literal

Schema definitions

class: Schema class, to map an arbitrary schema to core DKAN functionality. Currently available classes are catalog, dataset, distribution, dictionary, and literal.
identifier: Object defining an arbitrary identifier property in the JSON, used to inject item ID into JSON body on creation.
- property: The JSON property. For now, only top-level properties supported.
- type: Format of identifier allowed. For now, only uuid supported which injects the internal item uuid as a value. Other types may be made available in the future.
endpoint_base: Pattern to use for the endpoint to this type of metastore item. For instance, if set to "data", the catalog JSON endpoint will be /data.json. Currently only supported for catalog class.
references: Object to tell the DKAN metastore which properties should have their values stored in referenced items rather directly in the item's JSON.
- type: Type of reference. Accepts one of the following values:
  - schema: Swap JSON object or string literal with corresponding item UUID.
  - id: Swap a URL to a metastore item with a simple item UUID (maintains portability).
  - resource: Swap a URL to a DKAN resource ID
  - ~~file~~: Possible future reference type if we replace the resource system with DKAN's core file entities, which has been proposed a few times recently.
- schema: The schema to use for the referenced item. Not needed if type is file.
schema: Full JSON schema, expressed in YAML.
schema_path: Path to JSON Schema file, relative to the schemas file.
triggers: An array of object with properties corresponding to "trigger" names. The trigger will run if the property value changes on item update.
- property The JSON property. For now, only top-level properties supported.
- trigger: Currently only datastore_import, which will re-import the datastores associated with this item, is available.
ui_schema: Full UI schema, expressed in YAML.
ui_schema_path: Path to the UI schema for use by form builder.
- property: The property name. Only top-level properties supported at the moment.

Classes

Schema classes are not references to PHP classes, but map roughly to classes from DCAT and related RDF vocabularies. DKAN can work with essentially any metadata schema as long as it maps to these basic DCAT concepts.

catalog: Schema used for custom catalog endpoint, usually data.json. Equivalent to dcat:Catalog. There may be only one catalog record in the
dataset: The main dataset schema in the metastore. Equivalent to dcat:Dataset
distribution: A specific representation of the dataset. Usually, a file. Equivalent to dcat:Distribution.
dictionary: A schema with column-level metadata, such as a table schema or a shared data dictionary. For now, implemented in core and there is no support for standards other than the Frictionless Table Schema. Closest equivalent would be dct:Standard, though in DKAN this class is much more specific.
literal: A "schema" whose items are unstructured values, such as strings. Literals will be wrapped in a JSON-LD structure. For instance, a keyword "health" would be stored as a simple string in the database, but retrieved from the API with the JSON {"@value": "health"}. Equivalent to rdfs:Literal and similar RDF literal types.

Schemas may map to other DCAT classes (for instance, foaf:Organization in the case of the default organization schema) but the play no special role in DKAN's architecture and do not need to be assigned a class in the schemas file.

For the moment, there can only be one schema defined for any class other than literal. In future iterations we may want to make this more flexible to address use-cases where multiple types of datasets or distributions are supported -- for example, a catalog might allow different schemas for geospatial or financial metadata.

Typical class relationships in a DKAN catalog:

classDiagram
    class catalog {
        array~dataset~ dataset
    }
    class dataset {
        array~distribution~ distribution
        literal keyword
        literal theme
    }
    class distribution {
        dictionary describedBy
    }

    class literal {
        string value
    }

    class dictionary {
        array fields
    }
    
    catalog "1" o-- "many" dataset
    dataset "1" o-- "many" distribution
    dataset "many" o-- "many" literal
    distribution "1" o-- "1" dictionary

Related changes

Specing out this work has provided some clarity on we could better handle identifiers and references in the API, will document this in a second issue.

The text was updated successfully, but these errors were encountered:

clayliddell · 2022-03-08T22:43:31Z

I like the approach being suggested here, however I do have some questions surrounding the proposed .schemas.yml file format:

What is the purpose of the class property? It seems like schema and schema_path already accomplish it's purpose.
The /data.json endpoint is confusing to me since it's separate from the rest of the DKAN API. What benefit does providing the /data.json endpoint provide, and is the endpoint_base property necessary for the standard?
Do references need to be specified in the schemas.yml file? I thought we were already specifying them in the individual schema files.

dafeder · 2022-03-09T00:20:23Z

"Class" is basically a recognition that certain schemas have unique business logic around them but we don't, IMO, want that to be hard-coded to the actual schema name. That way you could have, for instance, packages and resources (the frictionless terms) instead of datasets and distributions, and still know what to do with what schemas. But we could forget about that for now, and just say you need to use our names no matter what the filename or contents are. I do think marking certain schemas as literals is important, but that could be a boolean option of its own instead of a "class" at the same level as "dataset".
DCAT-US federal open data standard expects a catalog object at the URL /data.json. All US government data sites have a /data.json. Other open data standards (DCAT-AP, for instance) have something equivalent. Our datasets list endpoint is missing the required catalog schema and fields, and not at a standard URL. This could be handled outside the schemas file for now (for instance, in a custom route in the dkan_dcat_us module) if it seems like too much.
No, the schema files don't contain any information about references, and IMO they should not. The reference fields are currently specified in config and can be edited at /admin/dkan/properties. This is maybe the most useful part of this change IMO, that the reference settings are in code and that we can define different types of references.

dafeder · 2022-03-28T19:00:34Z

Update - after further discussion I am thinking more and more that at least for initial rollout it makes more sense to just have a few basic schema names -- dataset and distribution, plus anything like data-dictionary which is internal. The addition of "class" seems too complex -- if necessary we can find a way to "alias" these core types if people really want them to appear differently in the API paths.

I also don't think it's necessary to require "literals" to be flagged in a schemas file; simply having a schema be of type "string" should be enough.

Will update the above plan/diagram when possible.

grugnog · 2024-05-01T14:53:20Z

First pass at this - get the yaml in place and start using to compose schema.

catalog:
  schema_path: schema/catalog.json
  references:
    - { property: dataset, schema: dataset }

dataset:
  schema_path: schema/dataset.json
  ui_schema_path: schema/dataset.ui.json
  references:
    - { property: distribution, schema: distribution }
    - { property: publisher, schema: organization }
    - { property: keyword, schema: keyword }
    - { property: theme, schema: theme }

distribution:
  schema_path: schema/distribution.json
  references:
    - { property: downloadURL, type: resource }
    - { property: describedBy, schema: data-dictionary, type: id }

organization:
  schema_path: schema/organization.json
  ui_schema_path: schema/organization.ui.json

keyword:
  schema: { type: string }
  class: literal

theme:
  schema: { type: string }
  class: literal

dafeder created this issue from a note in High-level roadmap (To-do) Mar 6, 2022

github-actions bot added this to Incoming/Triage in DKAN 2 Issue Triage Mar 6, 2022

dafeder removed this from To-do in High-level roadmap Mar 6, 2022

dafeder changed the title ~~New Schema system~~ New Schema managment system Mar 6, 2022

dafeder mentioned this issue Mar 24, 2022

Standardize reference types and logic across schemas #3772

Open

nplathe mentioned this issue Aug 19, 2022

Migration Issues with DKAN for an already existing Drupal7-DKAN-based Data Platform #3818

Closed

This was referenced Aug 23, 2022

Metastore code refactor for clearer naming and flow #3736

Open

Refactor metastore and merge with common and dkan modules #3825

Closed

dafeder self-assigned this Aug 24, 2022

TheETupper changed the title ~~New Schema managment system~~ New Schema management system Apr 30, 2024

TheETupper changed the title ~~New Schema management system~~ Implement New Schema management system May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement New Schema management system #3761

Implement New Schema management system #3761

dafeder commented Mar 6, 2022 •

edited

clayliddell commented Mar 8, 2022

dafeder commented Mar 9, 2022

dafeder commented Mar 28, 2022

grugnog commented May 1, 2024

Implement New Schema management system #3761

Implement New Schema management system #3761

Comments

dafeder commented Mar 6, 2022 • edited

Overview

Structure of a DKAN schemas file

Schema definitions

Classes

Related changes

clayliddell commented Mar 8, 2022

dafeder commented Mar 9, 2022

dafeder commented Mar 28, 2022

grugnog commented May 1, 2024

dafeder commented Mar 6, 2022 •

edited