Skip to content

curie-data-factory/health-data-metrics

Repository files navigation

logo

https://youtu.be/8YmNQUuj6-E

Goal

The main goal of HDM is to help asses data quality by running ad-hoc programs that "scan" databases regularly to compute metrics & calculate divergence whether in structure or content of databases. Generating alerts that gives Data Engineers insights on what broke down.

To do this we have developed the following features:

Calculate metrics on the data from our warehouses.

  • Set up rules to be able to apply operational / business constraints on the databases in connection with the calculated metrics.
  • Detect breaks and regressions in the database structure or in the data itself by generating alerts using business rules.
  • Allow constraints to be centralized and create a unified HUB to manage data quality in order to deliver the best possible quality data to doctors and researchers.
  • Create dashboards on metrics to be able to visualize and explore them.

Get Started

As you may have understood, Health Data Metrics needs an ecosystem of application in order to work.

Dependencies

  • Elasticsearch >=v7.10.0

    • Elasticsearch Installed and API Endpoint accessible.
  • Kibana >=v7.10.0

    • Kibana Installed and API Endpoint accessible.
  • Airflow >=v2.1.0

    • Airflow Installed and API Endpoint accessible.
    • HDM Pipeline imported, setup & running (See More on Airflow Pipeline).
  • Nexus >=3.29.2-02

    • Nexus Installed and API Endpoint accessible.
    • Default Repository
    • User / Password with rights to [Read artifacts, Search Queries]

Configuration

Application Configuration File

/var/www/html/conf/appli/conf-appli.json

See default file : docs/templates/conf-appli.json

Database Configuration File

/var/www/html/conf/db/conf-db.json :

See default file : docs/templates/conf-db.json

Ldap Configuration File

/var/www/html/conf/ldap/conf-ldap.json :

See default file : docs/templates/conf-ldap.json

Mail Configuration File

/var/www/html/conf/mail/msmtprc :

See default file : docs/templates/msmtprc

Run it !

You can run HDM from 3 different ways :

Docker Image

To run anywhere :

docker run -p 80:80 -v conf/:/var/www/html/conf/ ghcr.io/curie-data-factory/hdm:latest

Helm Chart

To deploy in production environments :

helm repo add curiedfcharts https://curie-data-factory.github.io/helm-charts
helm repo update

helm upgrade --install --namespace default --values ./my-values.yaml my-release curiedfcharts/hdm

More info Here

From sources

For dev purposes :

  1. Clone git repository :
git clone https://github.com/curie-data-factory/health-data-metrics.git
cd health-data-metrics/
  1. Create Conf files & folders :
touch conf/ldap/conf-ldap.json
  1. Set configuration variables see templates above
  2. Then run the Docker Compose stack.
docker-compose up -d
  1. Resolve composer package dependencies. See Here for installing and using composer.
docker exec -ti hdm sh -c "composer install --no-dev --optimize-autoloader"

Going deeper

You can install Airflow and run the entire stack on local if you have enough RAM & CPU (4 core & 16 Go RAM recommended). To see how : go Here

Screenshots & User Guide

home explorator rule-editor

Build Doc

The documentation is compiled from markdown sources using Material for MkDocs To compile the documentation :

  1. Go to your source directory :
cd health-data-metrics
  1. Run the docker build command :
docker run --rm -i -v "$PWD:/docs" squidfunk/mkdocs-material:latest build

Airflow Pipeline

See https://airflow.apache.org/docs/apache-airflow/stable/index.html


Data Factory - Institut Curie - 2021