No Syntaxation Without Representation: Syntactic Considerations for Neural Machine Translation Data Augmentation

Paper: NoSyntaxationWithoutRepresentation.pdf

files and descriptions

notebooks obtain the data:
- Download_dataset_iwslt2017.ipynb: download and produce 10% of the data for the paper
notebooks to train models:
- TrainLSTM.ipynb: all LSTM methods excluding sequence matching methods
- similarity_ds_k=2.ipynb and similarity_ds_k=10.ipynb: LSTM sequence matching methods using similarity
- TrainTransformer.ipynb: all Transformer methods
- LanguageModel.ipynb: training the language model used in LMsample and soft methods
other notebooks:
- BEAM_BLEU.ipynb: evaluation, re-compute BLEU score with beam search, compute POS BLEU score
- LM_POS_Experiments.ipynb: experiment, looking at how well the language model matches part of speech
- CustomTransformer.ipynb: development, developing and testing the transformer architecture, contains links to transformer resources
functions for transformer models:
- embeddingTF.py: Embedder and PositionalEncoding
- sublayersTF.py: SublayerConnection (layer norm & residual connection), FeedForward, attention, MultiHeadedAttention, and clones (replicates layers)
- layersTF.py: EncoderLayer and DecoderLayer
- stacksTF.py: Encoder and Decoder, which construct the encoder and decoder stacks from the encoder and decoder layers, respectively
- encoderTF.py: FullEncoder, which allows for augmentation to occur in the embedding - positional encoding - encoder structure
- decoderTF.py: FullDecoder, which allows for augmentation to occur in the embedding - positional encoding - decoder structure
- seq2seqTF.py: Seq2SeqTF, which contains the custom encoder and decoders and fully defines the transformer seq2seq model
- batchTF.py: BatchTF, which formats source and target inputs to yield shifted targets, source mask, and target mask (future_mask provides decoder-specific masking)
- trainTF.py: train, which uses train_epoch and val_epoch to create the training scheme, greedy_decode, and translate_corpus
functions for lstm models:
- train.py: training functions for LSTM models
- Seq2Seq.py: model class
- EncoderLSTM.py: encoder class, including functions for all augmentations
- DecoderLSTM.py: decoder class, including functions for seqmix augmentations
other functions:
- load_data.py: creating and loading pickled datasets and dataloaders
- load_lm.py: load the language model developed in LanguageModel.ipynb

data access

option 1. download and re-build dataloaders

Download the full data from torchtext:

from torchtext.datasets import IWSLT2017

train_iter, valid_iter, test_iter = IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))

Run Download_dataset_iwslt2017.ipynb to get 10% sample of dataset and save as pickles
Run load_and_save(batch1 = True) from load_data.py to build the dataloaders used in our LSTM models and save to pickle files
Run load_and_save(batch1 = False) from load_data.py to build the dataloaders used in our Transformer models and save to pickle files
In our code, we use load_pickled_dataloaders(batch1 = True) and load_pickled_dataloaders(batch1 = False) from load_data.py to load dataloaders from the pickle files. You'll need to pass in PARENT_DIR as the location of your data folder.

option 2. use our pre-built dataloaders

Save download the following directories and save to your own data folder
- dataloaders10perc: used for LSTM models and the LM https://drive.google.com/drive/folders/18K6XpYgTmLZkPtLQw4-8gqUyeAGWPF-u?usp=sharing
- dataloaders10perc_batchsize32: used for transformer models, larger batch size https://drive.google.com/drive/folders/16_hx53i473FjJfn4sfdLQ4ZTdxDehUBT?usp=sharing
In our code, we use load_pickled_dataloaders(batch1 = True) and load_pickled_dataloaders(batch1 = False) from load_data.py to load dataloaders from the pickle files for the LSTM and transformer, respectively. You'll need to pass in PARENT_DIR as the location of your data folder.

Packages

sys 3.7.12
tqdm 4.62.3
numpy 1.19.5
matplotlib 3.2.2
pandas 1.1.5
torch 1.10.0+cu111
torchtext 0.11.0
spacy 2.2.4
transformers 4.6.0
sentence_transformers 2.1.0
os *
typing *
pickle *
timeit *
operator *
collections *
copy *
random *
math *

* (Python 3.6.9 Standard Library)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
BEAM_BLEU.ipynb		BEAM_BLEU.ipynb
CustomTransformer.ipynb		CustomTransformer.ipynb
DecoderLSTM.py		DecoderLSTM.py
Download_dataset_iwslt2017.ipynb		Download_dataset_iwslt2017.ipynb
EncoderLSTM.py		EncoderLSTM.py
LM_POS_Experiments.ipynb		LM_POS_Experiments.ipynb
LanguageModel.ipynb		LanguageModel.ipynb
NoSyntaxationWithoutRepresentation.pdf		NoSyntaxationWithoutRepresentation.pdf
README.md		README.md
Seq2Seq.py		Seq2Seq.py
TrainLSTM.ipynb		TrainLSTM.ipynb
TrainTransformer.ipynb		TrainTransformer.ipynb
batchTF.py		batchTF.py
decoderTF.py		decoderTF.py
embeddingTF.py		embeddingTF.py
encoderTF.py		encoderTF.py
layersTF.py		layersTF.py
load_data.py		load_data.py
load_lm.py		load_lm.py
seq2seqTF.py		seq2seqTF.py
similarity_ds_k=10.ipynb		similarity_ds_k=10.ipynb
similarity_ds_k=2.ipynb		similarity_ds_k=2.ipynb
stacksTF.py		stacksTF.py
sublayersTF.py		sublayersTF.py
train.py		train.py
trainTF.py		trainTF.py

jennalandy/mt_augment_pos

Folders and files

Latest commit

History

Repository files navigation

No Syntaxation Without Representation: Syntactic Considerations for Neural Machine Translation Data Augmentation

files and descriptions

data access

option 1. download and re-build dataloaders

option 2. use our pre-built dataloaders

Packages

About

Resources

Stars

Watchers

Forks

Languages