Skip to content

innerNULL/PLM-ICD-multi-label-classifier

Repository files navigation

PLM-ICD-multi-label-classifier

A non-official multi-label classifier based on PLM-ICD paper.

Basically this is my personal side project. The target is deep understanding paper. Finally, here provide a more concise and clear implementation, which can make things easier when need do some custimization or extension.

Although the model comes from paper, I tried my best to make this as a general program for text multi-label classification task.

Usage

Python Env

python -m venv ./_venv --copies
source ./_venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
# deactivate

Run Tests

python -m pytest ./test --cov=./src/plm_icd_multi_label_classifier --durations=0 -v

ETL

The ETL contain following steps:

  • Origin JSON line dataset preparation
  • Transform JSON line file to limited JOSN line file, which means all list or dict will be transformed to string.
  • Data dictionary generation.

Note, the final data folder should contains 4 files: train.jsonl, dev.jsonl, test.jsonl, dict.json.

Prepare (Specific) Original JSON Line Dataset

The data should be in JSON line format, here provide an MIMIC-III data ETL program:

python ./bin/etl/etl_mimic3_processing.py ${YOUR_MIMIC3_DATA_DIRECTORY} ${YOUR_TARGET_OUTPUT_DIRECTORY}

When you need use this program do text multi-label classification on your custimized data set, you can just transfer it into a JSON line file, and using training config file to specify which field is text and which is label.

NOTE, since here you are dealing a multi-label classification task, the format of label field should be as a CSV string, for example:

{"text": "this is a fake text.", "label": "label1,label2,label3,label4"}

But you can also use your specific dataset.

Transform To Limited JSON Line Dataset

Although using JSON line file, here do not allow list and dict contained in JOSN. I believe "flat" JSON can make things clear, so here provide a tool which can help to convert list and dict contained in JSON to string:

python ./bin/etl/etl_jsonl2limited_jsonl.py ${ORIGINAL_JSON_LINE_DATASET} ${TRANSFORMED_JSON_LINE_DATASET}

NOTE, alghouth you can put dataset in anly directory you like, but you HAVE TO naming you datasets as train.jsonl, dev.jsonl and test.jsonl.

Data Dictionary Generation

Generate (some) data dictionaries by scanning train, dev and test data. Run:

python ./bin/etl/etl_generate_data_dict.py ${TRAIN_CONFIG_JSON_FILE_PATH}

Training and Evaluation

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ${TRAIN_CONFIG_JSON_FILE_PATH}

Training Config File

The format should be JSON, most of parameters are easy to understand is your are a MLE or researcher:

  • chunk_size: Each chunks token ID number.
  • chunk_num: The number of chunk each text/document should have, padding first for short sentences.
  • hf_lm: HuggingFace language model name/path, each hf_lm may have different lm_hidden_dim, I personally tried 2 LMs:
    • "distilbert-base-uncased" with lm_hidden_dim as 768
    • "medicalai/ClinicalBERT" with lm_hidden_dim as 768
  • lm_hidden_dim: Language model's hidden output layer's dimension.
  • data_dir: Data directory, should at least contains two files generated by etl_mimic3_processing.py:
    • train.jsonl
    • dev.jsonl
    • (test.jsonl)
  • training_engine: Training engine, can be "torch" or "ray". Torch mode is mainly used for debugging purpose and not supporting distributed training.
  • single_worker_batch_size: Each worker's batch size. Note if training with "torch" engine, then only have one worker.
  • lr: Initial learning rate.
  • epochs: Training epochs.
  • gpu: If using GPU to train.
  • workers: Eorkers number in distrubued training. This is only effective when using "ray" as training engine.
  • single_worker_eval_size: Each worker's maximum evaluation sample size. Again when using "torch" as training engine, you only have one worker.
  • random_seed: Random seed, this can make sure you can 100% reproduce training.
  • text_col: Text column name in train/dev/test JSON line dataset.
  • label_col: Label column name in train/dev/test JSON line dataset.
  • ckpt_dir: Checkpoint directory name.
  • log_period: How many batchs passed before each time's evaluation log printing.
  • dump_period: How many steps passed before each time's checkpoint dumping.

Examples

Using MIMIC-III Data Training ICD10 Classification Model

Preparation - Get Raw MIMIC-III Data

Suppose you put original MIMIC-III data under ./_data/raw/mimic3/ like:

./_data/raw/mimic3/
├── DIAGNOSES_ICD.csv
├── NOTEEVENTS.csv
└── PROCEDURES_ICD.csv

0 directories, 3 files

ETL - Training Dataset Building

This is about join necessary tables' data together and build training dataset. Suppose we are going to put training data under ./_data/etl/mimic3/, as this programed rules, the directory should contain 3 files, train.jsonl, dev.jsonl and test.jsonl, like:

./_data/etl/mimic3/
├── dev.jsonl
├── dict.json
├── dim_processed_base_data.jsonl
├── test.jsonl
└── train.jsonl

0 directories, 5 files

You can run:

python ./bin/etl/etl_mimic3_processing.py ./_data/raw/mimic3/ ./_data/etl/mimic3/ 

Config - Prepare Your Training Config File

The data_dir in this config will be needed by next ETL step, can just refer to train_mimic3_icd.json.

ETL - Convert Training Dataset JSONL to Limited JSONL File

Note this step is unnecessary, since the outputs of ./bin/etl/etl_mimic3_processing.py have already been limited JSON line files, so even though you run following program, you will get exactly same files:

python ./bin/etl/etl_jsonl2limited_jsonl.py ./_data/raw/mimic3/${INPUT_JSONL_FILE} ./_data/raw/mimic3/${OUTPUT_JSONL_FILE}

Training - Training ICD10 Classification Model with MIMIC-II Dataset

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ./train_mimic3_icd.json

Other Implementation Details

  • After chunk_size and chunk_num defined, each text's token ID length are fixed to chunk_size * chunk_num. if not long enough then automatically padding first.