Skip to content

loubnabnl/canine-mednli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CANINE for Medical Natural Language Inference on MedNLI data

We are interested in Natural Language Inference (NLI) on medical data using CANINE, a pre-trained tokenization-free encoder, that operates directly on character sequences without explicit tokenization and a fixed vocabulary, it is available in this repo. We want to predict the relation between a hypothesis and a premise as: Entailement, Contraction or Neutral using MedNLI, a medical dataset annotated by doctors for NLI. We will also use BERT.

This work is part of a project in the course Algorithms for speech and natural language processing at the MVA master program. The repo for the project with more experiments using CANINE for different NLP tasks can be found here.

Setup

# Clone this repository
git clone https://github.com/loubnabnl/canine-mednli.git
cd canine-mednli/
# Install packages
pip install -r requirements.txt

Data

Access for the data can be requested here. It contains a training, validation and test set with pairs of sentences along with the label of their relation. The data must be placed in the folder data/ .

NLI

To use our fine-tuned BERT and CANINE models on MedNLI, you can download the weights in this link, and you should place them in the folder trained-models/. To train a new model on MedNLI you can run the following command

python main.py --model canine --noisy False

Noise robustness

Since CANINE doesn't use a fixed vocabulary, it can be intresting to use it on noisy data where there are many out-of-vocabulary words, mispellings and errors. We provide code to generate noisy versions of MedNLI for a given noise level, by adding, deleting replacing and swapping letters in the words. You can run the following commands:

cd ./utils
python noisy_data.py --noise_level 0.4

To train and evaluate CANINE on noisy data, you can run:

python main.py --model canine --noisy True 

Results

Results on clean data:

Model Test accuracy
BERT 77.6±0.6
CANINE-C 73.07±0.3

Results of noise robustness experiments: the left plot correponds to training on clean data and testing on noisy data and the right plot corresponds to the training on noisy data as well

nli_noise2

For the NLI task on clean MedNLI we get an accuracy of 77.6% using BERT and an accuracy of 73.07% using CANINE. However when we add a noise with probability 0.4 to the test data, the performance of BERT drops to 59.92% while the accuarcy of CANINE drops only to 65.75%. Training the models on noisy data results in an improvement for both models but CANINE is still preferred to BERT with a 1.4% difference in accuracy. This suggests that CANINE can be more suitable for noisy text than BERT, but for clean data we didn't see and advantadge for CANINE in this task.

About

CANINE for Medical Natural Language Inference on MedNLI data, as part of the Algorithms for Speech and NLP course of the MVA master program.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages