Skip to content

Official repo for "Characterizing Stigmatizing Language in Medical Records" (ACL 2023)

License

Notifications You must be signed in to change notification settings

kharrigian/ehr-stigma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Characterization of Stigmatizing Language in Medical Records

This is the official repository for the ACL 2023 paper, "Characterization of Stigmatizing Language in Medical Records." If you publish any research which uses the code, data, and/or models within this repo, we kindly ask you to cite us:

@inproceedings{harrigian2023characterizing,
  title={Characterization of Stigmatizing Language in Medical Records},
  author={Harrigian, Keith and 
          Zirikly, Ayah and 
          Chee, Brant and 
          Ahmad, Alya and 
          Links, {Anne R.} and 
          Saha, Somnath and 
          Beach, {Mary Catherine} and 
          Dredze, Mark},
  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2023}
}

If you encounter issues with the code in this repository, we encourage you to open a new GitHub issue or contact us directly via email. We will be more than happy to help you get up and running with our code, models, and data.

Status

Last Updated: November 6, 2023

Annotations and models have officially been published through the PhysioNet platform. You will need to complete PhysioNet's credentialing process prior to receiving access to annotations and models. This includes submitting documents which confirm you have completed appropriate human subjects IRB training.

Resources

This repository only provides an API for interacting with our data and models. The actual data (including annotations) and models are not hosted natively within this repository. To access these resources, you must first go through the appropriate credentialing process on PhysioNet. Once you have signed our usage agreement on PhysioNet, you can make full use of our toolkit.

Data (MIMIC-IV)

To replicate our experiments or train new models, you will need access to the MIMIC-IV and MIMIC-IV-Notes datasets (v2.2). Both of these resources are hosted on PhysioNet and require completion of IRB-related training.

Once you have completed the credentialing process, you can easily acquire the minimally necessary data resources using our utility script ./scripts/acquire/get_mimic.sh. You will be asked for your PhysioNet username and password. Files will be downloaded to data/resources/datasets/mimic-iv/.

Labels and Models

We have opted to keep our labels and models behind a gate for a few reasons. First, although we do not expect our training procedure to encode sensitive information regarding the MIMIC dataset, the risk is nonzero and worth respecting. Furthermore, if we release models in the future which do allow end-users to extract sensitive information, existing end-users will be able to acquire them seamlessly. Finally, by requiring end-users to complete IRB training prior to accessing our models, we can limit the risk of malevolent use.

Our models can be acquired from PhysioNet after completing a data usage agreement. If you already have access to MIMIC, downloading our models should only require you sign our data usage agreement.

Once you have completed this process, you can use our utility script ./scripts/acquire/get_models_and_labels.sh to download the pretrained models and annotations. Models will be downloaded to data/resources/models/, while annotations will be downloaded to data/resources/annotations/.

If you have downloaded the MIMIC-IV dataset, you can create an augmented annotated dataset for training new models using scripts/acquire/build_mimic.py.

python scripts/acquire/build_mimic.py \
    --annotations_dir data/resources/annotations/ \
    --keywords data/resources/keywords/keywords.json \
    --load_chunksize 1000 \
    --load_window_size 10

If you'd only like to run the search procedure for notes in the annotated dataset, you can do so by adding the --load_annotations_only flag to the command above.

Expected Resource Structure

If the utility scripts above worked appropriately, you should see the directory structure below.

data/
    resources/
            keywords/
                keywords.json
            annotations/
                annotations.csv
            datasets/
                mimic-iv/
                    admissions.csv.gz
                    diagnoses_icd.csv.gz
                    discharge.csv.gz
                    patients.csv.gz
                    services.csv.gz
                    transfers.csv.gz
            models/
                mimic-iv-discharge_clinical-bert/
                        adamant_fold-0/
                        compliance_fold-0/
                        other_fold-0/
                ...

Installation

We recommend interacting with the resources above using our stigma API (Python package). Installing this package should be relatively straightforward. However, please feel free to reach out to us on GitHub or via email if you encounter issues.

To install the package, you should run the command below from the root of this repository. This command will install all external dependencies, as well as the stigma package itself. It is extremely important to keep the -e environment flag, as it will ensure default data and model paths are preserved.

pip install -e .

We strongly recommend using a virtual environment manager (e.g., conda) when working with this codebase. This will help limit unintended consequences that arise due to e.g., dependency upgrades. The conda documentation provides all the information you need to set up your first environment.

Note: Our toolkit was developed and tested using Python 3.10. We cannot guarantee that other versions of Python will support the entirety of our codebase. That said, we expect the majority of functionality to be preserved as long as you are using Python >= 3.7.

Testing

To validate that data was downloaded correctly and the package was installed appropriately, you can make use of our small test suite.

  • pytest -v -Wignore tests/test_mimic.py: Ensures we are able to load the MIMIC-IV dataset and annotations as expected.
  • pytest -v -Wignore tests/test_api.py: Ensures we are able to load default models and arrive at expected predictions.

API Usage

For a quick introduction to our API, we recommend exploring our quickstart notebook. We have abstracted most of the codebase into a few modules to make interacting with the pretrained models easy.

## Import API Modules
from stigma import StigmaSearch
from stigma import StigmaBaselineModel, StigmaBertModel

## Examples of Clinical Notes
examples = [
    """
    Despite my best advice, the patient remains adamant about leaving the hospital today. 
    Social services is aware of the situation.
    """,
    """
    The patient claims they have remained sober since their last visit, though I smelled
    alcohol on their clothing.
    """
]

## Initialize Keyword Search Wrapper
search_tool = StigmaSearch(context_size=10)

## Run Keyword Search
search_results = search_tool.search(examples)

## Prepare Inputs for the Model
example_ids, example_keywords, example_text = search_tool.format_for_model(search_results=search_results,
                                                                           keyword_category="adamant")

## Initialize Model Wrapper
model = StigmaBertModel(model="mimic-iv-discharge_clinical-bert",
                        keyword_category="adamant",
                        batch_size=8,
                        device="cpu")

## Run Prediction Procedure
predictions = model.predict(text=example_text,
                            keywords=example_keywords)

A Note on Phrasing

Throughout the repository, you may notice certain naming conventions which do not align with what was presented in the ACL paper. The main differences to be aware of are as follows:

  1. keyword is what we use to denote the anchors referenced in the paper.
  2. keyword_category is what we use to refer to the 3 stigma classification tasks.
  3. adamant, compliance, and other are shorthand keyword categories which refer to the Credibility and Obstinance, Compliance, and Other Descriptors tasks, respectively.

Other Functionalities

Although the API shown above should be sufficient for most purposes, this repository contains a substantial amount of additional code which some users may find helpful. This includes scripts which may be used to reproduce our published results. The bash files contained in jobs/ showcase most of the functionalities. Please see the README file for more information about each set of commands.

License

Code, models, and data are released under the PhysioNet Credentialed Health Data License (Version 1.5.0). You may only use these resources for non-commercial, scientific research purposes.

Contributors

This toolkit would not have been possible to create without the help of the contributors below. If you are interested in extending the functionality of the toolkit and would like to contribute, please reach out!

  • Ayah Zirikly
  • Brant Chee
  • Yahan Li
  • Alya Ahmad
  • Anne R. Links
  • Somnath Saha
  • Mary Catherine Beach
  • Mark Dredze