Skip to content

Download and merge DCAT RDF data from a DCAT catalog endpoint to query it locally.

Notifications You must be signed in to change notification settings

berlinonline/dcat-catalog-downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DCAT Catalog Downloader

TL;DR

Download and merge DCAT RDF data from a DCAT catalog endpoint to query it locally.

Description

The ckanext-dcat extension implements a DCAT metadata export for CKAN. Among other things, it implements a catalog endpoint to download CKAN's dataset metadata as RDF. The output of the catalog endpoint is paged, and there is no way to download the complete metadata dump at once. Also, CKAN and ckanext-dcat have no SPARQL endpoint to query the metadata online.

The DCAT Catalog Downloader makes it easier to query the DCAT data from a CKAN instance by paging through the catalog endpoint, downloading each page and merging it.

The repository contains:

  • a Python-based command line tool download_catalog that pages through the DCAT catalog endpoint to download the complete catalog
  • a Makefile to orchestrate downloading and merging the data
  • a handful of SPARQL queries that you can run on the downloaded data

Requirements

  • Python 3 for the downloader
  • To query RDF data locally, you need an RDF framework with SPARQL capabilities. The excellent rdflib library that is installed as a dependency of the downloader lets you use SPARQL on local data files, so you could just use that. I prefer to use the Java-based Apache Jena, because it is very performant and comes with a range of useful command line tools. Some package managers offer packaged versions of Jena. E.g., on Mac OS you can use homebrew to install Jena: brew install jena.

Installing the Downloader

  • Clone the repository:
$ git clone https://github.com/berlinonline/dcat-catalog-downloader
  • Create and activate a Python virtual environment:
$ cd dcat-catalog-downloader
$ python -m venv venv
$ . venv/bin/activate
  • Install the required python libraries:
(venv) $ pip install -r requirements.txt
  • Install the command line tool:
(venv) $ python setup.py develop

Running the Downloader

You now have the downloader available as a command line tool.

(venv) $ download_catalog --help                                                             
usage: download_catalog [-h] --endpoint ENDPOINT [--output OUTPUT]

Page through the DCAT-catalog endpoint of CKAN instance and save all pages into an output folder.

options:
  -h, --help           show this help message and exit
  --endpoint ENDPOINT  endpoint URL of the CKAN's DCAT catalog, e.g. https://datenregister.berlin.de/catalog.ttl
  --output OUTPUT      path to the output folder containing the catalog pages. Default is data/temp.

You can either use the downloader directly or use the Makefile to create a folder structure, download the data and merge it to a specific location.

(venv) $ make data/output/catalog.ttl                                                                
creating data/temp directory ...
creating data/temp/parts ...
downloading all catalog pages to data/temp/parts ...
INFO:root:Reading first catalog page from https://datenregister.berlin.de/catalog.ttl ...
INFO:root:Last page is 33 ...
INFO:root:Saving catalog page: https://datenregister.berlin.de/catalog.ttl?page=2 to data/temp/parts/catalog.ttl?page=2
INFO:root:Saving catalog page: https://datenregister.berlin.de/catalog.ttl?page=3 to data/temp/parts/catalog.ttl?page=3
...
INFO:root:Saving catalog page: https://datenregister.berlin.de/catalog.ttl?page=32 to data/temp/parts/catalog.ttl?page=32
INFO:root:Saving catalog page: https://datenregister.berlin.de/catalog.ttl?page=33 to data/temp/parts/catalog.ttl?page=33
creating data/output directory ...
merging all catalog pages in data/temp/parts to data/output/catalog.ttl ...

The Makefile downloads from https://datenregister.berlin.de. If you want to download from a different CKAN instance, you need to change ckan_base variable at the top of the file.

Querying the Data

How you run a query depends on which SPARQL engine you have installed. If you have Jena, you can use the sparql command line tool. There are some queries that come as part of the repository, which you can just use. Or you write your own queries, of course.

$ sparql --data data/output/catalog.ttl --query queries/distinct_authors.rq --results=csv
publisher
 Bezirksamt Friedrichshain-Kreuzberg von Berlin - Vermessung
Abt. III C - Freiraumplanung und Stadtgrün
Abteilung II - Integrativer Umweltschutz
Amt für Statistik Berlin-Brandenburg
Berlin.de
BerlinOnline Stadtportal GmbH & Co KG
BerlinOnline Stadtportal GmbH & Co. KG
Berliner Feuerwehr
Berliner Forsten
Berliner Stadtreinigung (BSR)
Berliner Umweltportal
...
Statistisches Landesamt Bremen
Stromnetz Berlin GmbH
Tegel Projekt GmbH
Tegel Projekt GmbH - Urban Data & Plattform
VBB - Verkehrsverbund Berlin-Brandenburg GmbH
Verkehrslenkung Berlin
WISTA Management GmbH
openstreetmap.org

License

This material is copyright ©BerlinOnline Stadtportal GmbH & Co. KG.

All software in this repository is published under the MIT License.

About

Download and merge DCAT RDF data from a DCAT catalog endpoint to query it locally.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published