Investigate BERT on Non-linearity and Layer Commutativity

In this repository, we implemented some empirical experiments based on the official BERT code.

Paper refers to Of Non-Linearity and Commutativity in BERT.

Get started

Code Structure:

modeling, optimization, tokenization. Define BERT structures, optimizer, tokenizer.
run_pretraining, run_classifier. Pre-training and fine-tuning on GLUE tasks.
create_pretraining_data. Create pre-processed data for pre-training (Unlabeled large corpus --> TFRecord files).
run_finetune_glue.sh. Script for fine-tuning BERT on all GLUE tasks.
data_utils. Data processor for GLUE fine-tuning.
graph-mode. Refactorize run_pre-training.py and run_classifier.py in graph mode instead of using Tensorflow Estimator API
non-linearity. Experiments including training linear/non-linear approximators, replacing, removing, freezing and extracting hidden embeddings.
layer-commutativity. Experiments including swapping and shuffling
comparison. Compare with simple MLP and CNN models.

Main dependencies:

Python 3.7.5
Tensorflow 1.14.0
Pytorch 1.5.0

Others refer to the requirements file.

Pre-training & Fine-tuning

In this work, we mainly experiment on BERT-base and BERT-small. For BERT-base, we use the pre-trained weights provided by the official. For BERT-small, we pre-trained by ourselves, but you can also use the official pre-trained weights.

Model	Layer	Head	Hidden Size	Max Seq Length	#Params	Pre-trained Weights
BERT-base	12	12	768	512	110M	Official
BERT-small	6	8	512	128	35M	Official

BERT official team did not release the pre-processed data for pre-training, and the corpora they used like English Wikipedia and Book Corpus are not available. So we use the OpenWebText to pre-train the BERT-small, which is composed of 38GB text. Note that, it took around 8 days to pre-train the BERT-small on a single Nvidia Tesla V100 GPU card and 5 days on two cards.

Converting the original unlabeded corpus to TFRecord files is both time- and resource- comsuming. We recommend that you directly use the pre-trained weights. But if you want to use your own text to pre-train BERT, you can use following commands.

To save your time, we provided our pre-trained weights of BERT-small. Download

Here are some commands for pre-training. Create pretraining data

python create_pretraining_data.py \
  --input_file=$RAW_TEXT_DIR \
  --output_file=$TFRECORD_DIR \
  --vocab_file=$MODEL_DIR/vocab.txt \
  --do_lower_case=true \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5

Pre-training using Estimator API single GPU card

CUDA_VISIBLE_DEVICES="0" python run_pretraining.py \
  --input_file=$TFRECORD_DIR/*.tfrecord \
  --output_dir=$MODEL_DIR \
  --do_train=true \
  --do_eval=true \
  --bert_config_file=$MODEL_DIR/bert_config.json \
  --train_batch_size=256 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=1000000 \
  --num_warmup_steps=10000 \
  --learning_rate=1e-4 \
  --model_type=origin

Multiple GPU cards, implemented using hovorod

CUDA_VISIBLE_DEVICES="2,3" mpirun -np 2 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    python run_pretraining_hvd.py \
    --input_file=$TFRECORD_DIR/*.tfrecord \
    --output_dir=$MODEL_DIR \
    --do_train=true \
    --do_eval=true \
    --bert_config_file=$MODEL_DIR/bert_config.json \
    --train_batch_size=256 \
    --max_seq_length=128 \
    --max_predictions_per_seq=20 \
    --num_train_steps=1000000 \
    --num_warmup_steps=10000 \
    --learning_rate=1e-4

Pre-training using refactorized version, on a single GPU card

cd graph-mode/
bash run_pretrain.sh

In this work, we use GLUE benchmark for fine-tuning. Here we have several methods.

Fine-tune on a GLUE task, e.g. MNLI-matched. Use Estimator API.

CUDA_VISIBLE_DEVICES="0"　python run_classifier.py \
    --task_name=MNLIM  \
    --do_train=true  \
    --do_eval=true  \
    --do_predict=true  \
    --data_dir=$GLUE_DIR/MNLI  \
    --vocab_file=$MODEL_DIR/vocab.txt  \
    --bert_config_file=$MODEL_DIR/bert_config.json \
    --init_checkpoint=$MODEL_DIR/bert_model.ckpt  \
    --max_seq_length=128  \
    --train_batch_size=32  \
    --learning_rate=2e-5  \
    --num_train_epochs=3.0  \
    --output_dir=$MODEL_Finetune_DIR/mnlim_output/

Use refactorized version (Need to modify the parameters inside the bash script)

cd graph-mode/
bash run_finetune.sh

Fine-tune all GLUE tasks.

bash run_finetune_glue.sh

For easily reading the fine-tuning results.

python glue_result.reader.py --dir=$MODEL_Finetune_DIR

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
__pycache__		__pycache__
comparison		comparison
graph-mode		graph-mode
head-masking		head-masking
layer-commutativity		layer-commutativity
non-linearity		non-linearity
README.md		README.md
__init__.py		__init__.py
calculate_ensemble_result.py		calculate_ensemble_result.py
create_pretraining_data.py		create_pretraining_data.py
data_utils.py		data_utils.py
generate_shuffling_order.py		generate_shuffling_order.py
glue_result_reader.py		glue_result_reader.py
modeling.py		modeling.py
modeling_new_layernorm.py		modeling_new_layernorm.py
modeling_replace_sa_ff.py		modeling_replace_sa_ff.py
modeling_wo_FFN.py		modeling_wo_FFN.py
optimization.py		optimization.py
optimization_hvd.py		optimization_hvd.py
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_finetune_glue.sh		run_finetune_glue.sh
run_pretraining.py		run_pretraining.py
run_pretraining_hvd.py		run_pretraining_hvd.py
tokenization.py		tokenization.py

sumuzhao/Investigate-BERT-Non-linearity-Commutativity

Folders and files

Latest commit

History

Repository files navigation

Investigate BERT on Non-linearity and Layer Commutativity

Get started

Pre-training & Fine-tuning

About

Topics

Resources

Stars

Watchers

Forks

Languages