Skip to content

equipe22/BioQuality

Repository files navigation

---
title: "Biology quality and CDW"
author: "Vincent Looten from the Anita Burgun's Lab"
date: "31 août 2017"
output:
  html_document: default
  pdf_document: default
---

The objective of these scripts is to provide diagnostic tools of the temporal quality of biological data in a clinical data warehouse.
These scripts are based on: "WHAT CAN MILLIONS OF LABORATORY VALUES TELL US ABOUT THE TEMPORAL ASPECT OF DATA QUALITY? STUDY OF DATA SPANNING 17 YEARS IN A CLINICAL DATA WAREHOUSE". Comput Methods Programs Biomed. 2018 Dec 29. pii: S0169-2607(18)30708-9. doi: 10.1016/j.cmpb.2018.12.030.
https://www.ncbi.nlm.nih.gov/pubmed/30612785

# The data input format

Two examples files have to be put in a directory called 'Example' in the working directory:
+ Count.csv
+ BIO.20170821101046.csv

These files helps you to adapt your SQL query for the proposed data profiling algorithm.
The file 'Count.csv' includes the example of the list of exams as an input.
The file 'BIO.20170821101046.csv' includes the require input format for lab tests data.

# How to use scripts without real data ?

To run for the first time the data profiling algorithm, we recommend to try it on simulated data:
+ Put all the R files in a R project directory.
+ Launch the master script called 'main.R'

Report and derived files will be create with the simulated data. The simulated data are generated with the script 'datasimu.R'. This script generates the differents patterns described in the article.

# How to use scripts with real data ?

After lauching programs with simulated data. You will show the data directory where to put your real data. You have also to generate Count.csv which contains the index of lab tests.
If you have an I2B2 data warehouse you can also adapt the datareal.R file.

# Landscape of R files

There are 13 R files :

+ init.R: install and load packages, create directories, source fun_ R files
+ main.R: master script call all script
+ datasimu.R: simulation of data to test the programs. This script generates the differents patterns described in the article.
+ datareal.R: extract files from i2b2 clinical data warehouse in the right format
+ movingquantiles.R: compute the moving quantiles
+ fun_movingquantiles.R: auxiliary file associated to the movingquantiles.R file
+ missingdata.R: detect missing data (not described in the article, previous version)
+ fun_missingdata.R: auxiliary file associated to the missingdata.R file (not described in the article, previous version)
+ discretization.R: detect discritization
+ fun_discretization.R: auxiliary file associated to the discretization.R file
+ breakpoints.R: detect breakpoints
+ fun_breakpoints.R: auxiliary file associated to the breakpoints.R file
+ trends.R: performe regressions
+ fun_trends.R: auxiliary file associated to the trends.R file