Skip to content

Machine Translation Data Augmentation Methods Maintaining Part of Speech

Notifications You must be signed in to change notification settings

jennalandy/mt_augment_pos

Repository files navigation

No Syntaxation Without Representation: Syntactic Considerations for Neural Machine Translation Data Augmentation

Paper: NoSyntaxationWithoutRepresentation.pdf

files and descriptions

  • notebooks obtain the data:
    • Download_dataset_iwslt2017.ipynb: download and produce 10% of the data for the paper
  • notebooks to train models:
    • TrainLSTM.ipynb: all LSTM methods excluding sequence matching methods
    • similarity_ds_k=2.ipynb and similarity_ds_k=10.ipynb: LSTM sequence matching methods using similarity
    • TrainTransformer.ipynb: all Transformer methods
    • LanguageModel.ipynb: training the language model used in LMsample and soft methods
  • other notebooks:
    • BEAM_BLEU.ipynb: evaluation, re-compute BLEU score with beam search, compute POS BLEU score
    • LM_POS_Experiments.ipynb: experiment, looking at how well the language model matches part of speech
    • CustomTransformer.ipynb: development, developing and testing the transformer architecture, contains links to transformer resources
  • functions for transformer models:
    • embeddingTF.py: Embedder and PositionalEncoding
    • sublayersTF.py: SublayerConnection (layer norm & residual connection), FeedForward, attention, MultiHeadedAttention, and clones (replicates layers)
    • layersTF.py: EncoderLayer and DecoderLayer
    • stacksTF.py: Encoder and Decoder, which construct the encoder and decoder stacks from the encoder and decoder layers, respectively
    • encoderTF.py: FullEncoder, which allows for augmentation to occur in the embedding - positional encoding - encoder structure
    • decoderTF.py: FullDecoder, which allows for augmentation to occur in the embedding - positional encoding - decoder structure
    • seq2seqTF.py: Seq2SeqTF, which contains the custom encoder and decoders and fully defines the transformer seq2seq model
    • batchTF.py: BatchTF, which formats source and target inputs to yield shifted targets, source mask, and target mask (future_mask provides decoder-specific masking)
    • trainTF.py: train, which uses train_epoch and val_epoch to create the training scheme, greedy_decode, and translate_corpus
  • functions for lstm models:
    • train.py: training functions for LSTM models
    • Seq2Seq.py: model class
    • EncoderLSTM.py: encoder class, including functions for all augmentations
    • DecoderLSTM.py: decoder class, including functions for seqmix augmentations
  • other functions:
    • load_data.py: creating and loading pickled datasets and dataloaders
    • load_lm.py: load the language model developed in LanguageModel.ipynb

data access

option 1. download and re-build dataloaders

  • Download the full data from torchtext:

from torchtext.datasets import IWSLT2017

train_iter, valid_iter, test_iter = IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))

  • Run Download_dataset_iwslt2017.ipynb to get 10% sample of dataset and save as pickles
  • Run load_and_save(batch1 = True) from load_data.py to build the dataloaders used in our LSTM models and save to pickle files
  • Run load_and_save(batch1 = False) from load_data.py to build the dataloaders used in our Transformer models and save to pickle files
  • In our code, we use load_pickled_dataloaders(batch1 = True) and load_pickled_dataloaders(batch1 = False) from load_data.py to load dataloaders from the pickle files. You'll need to pass in PARENT_DIR as the location of your data folder.

option 2. use our pre-built dataloaders

Packages

  • sys 3.7.12
  • tqdm 4.62.3
  • numpy 1.19.5
  • matplotlib 3.2.2
  • pandas 1.1.5
  • torch 1.10.0+cu111
  • torchtext 0.11.0
  • spacy 2.2.4
  • transformers 4.6.0
  • sentence_transformers 2.1.0
  • os *
  • typing *
  • pickle *
  • timeit *
  • operator *
  • collections *
  • copy *
  • random *
  • math *

* (Python 3.6.9 Standard Library)

About

Machine Translation Data Augmentation Methods Maintaining Part of Speech

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published