Skip to content

vgaraujov/CPC-NLP-PyTorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contrastive Predictive Coding for Natural Language

This repository contains a PyTorch implementation of CPC v1 for Natural Language (section 3.3) from the paper Representation Learning with Contrastive Predictive Coding .

Implementation Details

I followed the details mentioned in section 3.3. Also, I got missing details directly from one of the paper's authors.

Embedding layer

  • vocabulary size: 20 000
  • dimension: 620

Encoder layer (g_enc)

  • 1D-convolution + ReLU + mean-pooling
  • output dimension: 2400

Recurrent Layer (g_ar)

  • GRU
  • dimension: 2400

Prediction Layer {W_k}

  • Fully connected
  • timesteps: 3

Training details

  • input is 6 sentences
  • maximum sequence length of 32
  • negative samples are drawn from both batch and time dimension in the minibatch
  • uses Adam optimizer with a learning rate of 2e-4
  • trained on 8 GPUs, each with a batch size of 64

Requirements

Usage Instructions

1. Pretraining

Configuration File

This implementation uses a configuration file for convenient configuration of the model. The config_cpc.yaml file includes original parameters by default. You have to adjust the following parameters to get started:

  • logging_dir: directory for logging files
  • books_path: directory containing the dataset

Optionally, if you want to log your experiments with comet.ml, you just need to install the library and write your api_key.

Dataset

This model uses BookCorpus dataset for pretrainig. You have to organize your data according to the following structure:

├── BookCorpus
│   └── data
│       ├── file_1.txt
│       ├── file_2.txt 

Then you have to write the path of your dataset in the books_path parameter of the config_cpc.yaml file.

Note: You could use publicly available files provided by Igor Brigadir at your own risk.

Training

When you have completed all the steps above, you can run:

python main.py

The implementation automatically saves a log of the experiment with the name cpc-date-hour and also saves the model checkpoints with the same name.

Resume Training

If you want to resume your model training, you just need to write the name of your experiment (cpc-date-hour) in the resume_name parameter of the config_cpc.yaml file and then run train.py.

2. Vocabulary Expansion

The CPC model employs vocabulary expansion in the same way as the Skip-Thought model. You just need to modify the run_name and word2vec_path parameters to then execute:

python vocab_expansion.py

The result is a numpy file of embeddings and a pickle file of the vocabulary. They will appear in a folder named vocab_expansion/.

3. Training a Classifier

Configuration File

This implementation uses a configuration file for configuration of the classfier. You have to set the following parameters of the config_clf.yaml file:

  • logging_dir: directory for logging files
  • cpc_path: path of the pretrained cpc model file
  • expanded_vocab: True if you want to use expanded vocabulary
  • dataset_path: directory containing all the benchmark
  • dataset_name: name of the task (e.g. CR, TREC, etc.)

Dataset

This classifier uses a common NLP benchmark. You have to organize your data according to the following structure:

├── dataset_name
│   └── data
│       └── task_name
│           ├── task_name.train.txt
│           ├── task_name.dev.txt 

Then you have to set the path of your data (dataset_path) and task name (dataset_name) in the config_cpc.yaml file.

Note: You could use publicly available files provided by zenRRan.

Training

When you have completed the steps above, you can run:

python main_clf.py

The implementation automatically saves a log of the experiment with the name cpc-clf-date-hour and also saves the model checkpoints with the same name.

Disclaimer

The model should be trained for 1e8 steps with a batch size of 64 * 8 GPUs. The authors provided me a snapshot of the first 1M training steps that you can find here, and you can find the results of my implementation here. There is a slight difference which may be due to various factors such as dataset or initialization. I have not been able to train the model entirely, so I did not replicate the results with the benchmark.

If anyone can fully train the model, feel free to share the results. I will be attentive to any questions or comments.

References

Releases

No releases published

Packages

No packages published

Languages