Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF Serialization - issue with the dct:format property value #231

Open
giorgialodi opened this issue Sep 14, 2018 · 2 comments
Open

RDF Serialization - issue with the dct:format property value #231

giorgialodi opened this issue Sep 14, 2018 · 2 comments

Comments

@giorgialodi
Copy link

giorgialodi commented Sep 14, 2018

in the RDF serialization of a dataset, the dct:format property may assume the value OP_DATPRO even if the source catalog correctly indicates the format using the EU controlled vocabulary, as requested by the DCAT-AP_IT specs.
This does not happen if the format of the distribution is CSV for instance. It seems happing during the harvesting phase and for specific formats (e.g., all those related to RDF serializations such as RDF_XML, RDF_TURTLE, RDF_N_TRIPLES, etc.)

Example:
Source Catalogue: Linked Data Platform with metadata compliant with DCAT-AP_IT

<http://dati.beniculturali.it/resource/Distribuzione/complessoArchivistico-GGASI-nt> a dcatapit:Distribution,
        dcat:Distribution ;
    dct:description "Distribuzione in formato N triples del dataset complessoArchivistico-GGASI " ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/OP_DATPRO> ;
    dct:license <https://w3id.org/italia/controlled-vocabulary/licences/C1_Unknown>,
        "https://creativecommons.org/licenses/by-nc/2.5/it/legalcode/" ;
    dct:title "Distribuzione in formato N triples del dataset complessoArchivistico-GGASI" ;
    dcat:downloadURL <http://dati.san.beniculturali.it/dataset/nt/complessoArchivistico-GGASI.nt> 

In this case the format is OP_DATPRO while in the source catalogue is correctly valorized with the following URI: http://publications.europa.eu/resource/authority/file-type/RDF_N_TRIPLES

It may be a problem of a limited set of format_mapping values https://github.com/geosolutions-it/ckanext-dcatapit/blob/master/ckanext/dcatapit/dcat/profiles.py#L76 ?
In any case, if the source correctly includes the format using the requested controlled vocabulary, no format mapping should be applied. We should simply use what is included in the source catalogue.

@tdipisa
Copy link
Member

tdipisa commented Sep 26, 2018

@giorgialodi

It may be a problem of a limited set of format_mapping values https://github.com/geosolutions-it/ckanext-dcatapit/blob/master/ckanext/dcatapit/dcat/profiles.py#L76 ?

The format_mapping object is used by this extension only during the serialization procedure not during the harvesting phase, so if the source you are harvesting is not a CKAN, with this extension properly installed, the problem is not related to the format_mapping. The format_mapping is just used as fallback by this extension, normally the resource's distribution_format is used if valorized. Obviously the format_mapping can be improved making it configurable in some way and adding additional formats to the mapping.

In any case, if the source correctly includes the format using the requested controlled vocabulary, no format mapping should be applied. We should simply use what is included in the source catalogue.

As mentioned above, the harvester does not use the format_mapping, he uses what is defined in the RDF element by the source catalog, so probably is the harvester's code that needs to be investigated for this particular behavior.

@giorgialodi
Copy link
Author

@tdipisa @etj @cezio with the help of a developer we were able to identify the issue which seems to be more complex than expected. The issue involves the format (CKAN metadata) and distribution_format (we introduced for DCAT-AP_IT profile) fields.
We have three possible cases:

  1. people upload metadata through web form --> no issue since the format and distribution_format are valorized. If I remember correct, we verified this with Tobia

  2. harvesting of a source which is not compliant with the DCAT-AP_IT profile --> there is no distribution_format but only the format. If the distribution_format is not available, in the serialization phase the code will call the format_mapping object which is very limited in the mapping. Hence, for the most common formats the mapping works, for all the others the "OP_DATAPRO" is used; in the visualization phase an empty field is visualized, since in visualization the code uses distribution_format;

  3. harvesting of a source that is compliant with the profile --> it seems, from the code, that the distribution_format is never materialized even if the data source uses the correct format from the EU controlled vocabulary. Only the format field is materialized. The result is the same as the case 2; that is, format_mapping object is called once again. For common formats (e.g., CSV, JSON) everthing is fine, for all the others "OP_DATAPRO" is used. This is why it happens what I reported above with the CSV format.
    The code involved should be the following:

    resource_dict[key] = value

    distribution_format is never valorized.

In general we have many OP_DATAPRO in the central registry because we harvest from RDF serializations of PAs that introduce this error.

Possible solution to fix the issue
We need to materialize distribution_format.

In the case 3. we need to take the last part of URI of the EU controlled vocabulary and set distribution_format. In the serialization phase we will then use it to create the right node of the graph, and in visualization we will visualize it. This is the basis and I may propose a PR.

Case 2 is more complex. If the data source is not compliant we do not have any EU controlled vocabulary reference, we just have CKAN's format (also very strange onces: we saw "geo json" or name of pdf files as formats!!!!). In this case, format_mapping should be applied. However, since it is limited we need to extend it to cover all the remaining formats. In case of strange things included by PAs we will use OP_DATAPRO. We may create a mapping file starting from these CKAN's formats https://github.com/ckan/ckan/blob/master/ckan/config/resource_formats.json. We will do this mapping once we get the data from the harvesting so that during serialization and visualization the distribution_format is anyway valorized. We should verify the feasibility of this solution.
Alternatively, another solution can be to dynamically derive the distribution_format from CKAN's format every time the serialization and the visualization are to be executed.

BTW: in the CKAN's filters it should be better visualizing the DCAT-AP_IT formats and not the current mess of CKAN which allows anyone to include a text free format if not included in the JSON I pointed out above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants