Skip to content

brunotech/iwslt-autodub-task

 
 

Repository files navigation

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Introduction

This repo contains both the data and code to train and run automatic dubbing: translating the speech in a video into a new language such that the new speech is natural when overlayed on the original video.

Our system takes in German videos like this, along with their transcripts (e.g. "Sein Hauptgebiet waren es, romantische und gotische Poesie aus dem Englischen ins Japanische zu übersetzen."):

test_set.178.mp4

And produce videos like this, dubbed into English:

test_set.178.en.mp4

Setting up the environment

Install Miniconda/Anaconda if needed. For training models, it is assumed that you have at least 1 GPU, with CUDA drivers set up. This has been tested on 1 NVIDIA V100 GPU with CUDA 11.7, on Ubuntu 20.04.

sudo apt install git-lfs awscli ffmpeg
git lfs install --skip-repo
git clone https://github.com/amazon-science/iwslt-autodub-task.git --recursive
cd iwslt-autodub-task

# Create a conda environment
conda env create --file environment.yml
conda activate iwslt-autodub

# Download Prism model for evaluation
cd third_party/prism
conda create -n prism python=3.7 -y
conda activate prism
pip install -r requirements.txt
conda deactivate
wget http://data.statmt.org/prism/m39v1.tar
tar xf m39v1.tar
rm m39v1.tar
cd ../..

Download and extract data

Download the CoVoST2 en-de dataset following these steps, or directly follow instructions at https://github.com/facebookresearch/covost#covost-2.

  • First, download Common Voice audio clips and transcripts (English, version 4). Note that after filling out the form you can copy the url to download from the download button and download it with wget.
  • Next, extract validated.tsv from it:
mkdir covost_tsv
tar -xvf en.tar.gz validated.tsv
mv validated.tsv covost_tsv/
  • Then extract the required TSV files:
# Download and split CoVoST2 TSV files
pushd covost_tsv
wget https://dl.fbaipublicfiles.com/covost/covost_v2.en_de.tsv.tar.gz https://raw.githubusercontent.com/facebookresearch/covost/main/get_covost_splits.py
tar -xzf covost_v2.en_de.tsv.tar.gz
python3 get_covost_splits.py -v 2 --src-lang en --tgt-lang de --root . --cv-tsv validated.tsv
popd

You should now have covost_v2.en_de.dev.tsv, covost_v2.en_de.test.tsv, and covost_v2.en_de.train.tsv in the covost_tsv directory.

  • Then extract MFA files:
mkdir covost_mfa
tar -xf data/training/covost2_mfa.tz -C covost_mfa
mv covost_mfa/covost2_mfa covost_mfa/data

Now, all the json files should be in covost_mfa/data.

Compute the distribution of speech durations

python3 get_durations_frequencies.py ./covost_mfa/data 

This computes how often each speech duration is observed in our training data, so that we do the binning correctly.

Build dataset

Create the processed datasets for training/evaluation ("text and noised binned segments -> phones and durations" is the provided baseline):

# text -> text. Used to generate references for automatic evaluation of translation quality.
python3 build_datasets.py --en en-text-without-durations --de de-text-without-durations

# text and noised binned segments -> phones and durations. For the baseline model.
python3 build_datasets.py --en en-phones-durations --de de-text-noisy-durations --noise-std 0.1 --upsampling 10 --write-segments-to-file

For full usage options, run build_datasets.py -h.
For use with factored models, make sure you use the --write-segments-to-file option, since that will generate some files required for generating the factored data.

Prepare target factor files

For the factored baselines, you need to prepare the datasets in the factored formats and generate the auxiliary factors.

python3 separate_factors.py -i processed_datasets/de-text-noisy-durations0.1-en-phones-durations -o multi_factored

This will generate target factor input files in processed_datasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/.

  • *.en.text: Main output containing the original text, with <shift> tokens to account for internal factor shifts so that the factors are conditioned on the main output.
  • *.en.duration: Main target factor to predict durations. Contains the durations (number of frames) corresponding to each phoneme in *.en.text.
  • *.en.total_duration_remaining: Auxiliary factor to count down the number of frames remaining in each line. This is calculated from the (noised) segment durations, and counts down by the number of frames generated at each time step. Note that this may not count down to 0 due to the noise added to the segment durations.
  • *.en.segment_duration_remaining: Auxiliary factor to count down the number of frames remaining in each segment, i.e. until a [pause] token is encountered. Similar to the previous factor, but initialized by the corresponding target segment duration for each segment within a line.
  • *.en.pauses_remaining: Auxiliary factor that counts down the number of [pause] tokens remaining in a line.

This is what they should look like:

$ head -2 processed_datasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/test.en.{text,duration,total_duration_remaining,segment_duration_remaining,pauses_remaining}
==> processed_datasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/test.en.text <==
<shift> F AO1 R CH AH0 N AH0 T L IY0 <eow> DH AH1 <eow> R EY1 T S <eow> AH1 V <eow> D EH1 TH S <eow> IH1 N <eow> DH IY0 <eow> Y UW0 N AY1 T IH0 D <eow> K IH1 NG D AH0 M <eow> HH AE1 V <eow> R IH0 D UW1 S T <eow> sp
<shift> AY1 <eow> D IH1 D <eow> HH AE1 V <eow> sp T AH0 <eow> K AH1 T <eow> sp AH0 <eow> sp F Y UW1 <eow> sp K AO1 R N ER0 Z <eow>

==> processed_datasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/test.en.duration <==
0 12 3 8 12 4 5 5 9 8 14 0 5 7 0 7 17 5 5 0 10 10 0 3 12 13 3 0 13 13 0 2 3 0 3 5 5 13 8 6 10 0 5 8 12 6 8 7 0 5 8 7 0 3 12 8 16 12 14 0 24
0 11 0 17 9 15 0 1 4 5 0 9 3 5 0 20 6 7 0 8 7 0 7 5 4 16 0 3 11 13 4 6 26 24 0

==> processed_datasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/test.en.total_duration_remaining <==
413 401 398 390 378 374 369 364 355 347 333 333 328 321 321 314 297 292 287 287 277 267 267 264 252 239 236 236 223 210 210 208 205 205 202 197 192 179 171 165 155 155 150 142 130 124 116 109 109 104 96 89 89 86 74 66 50 38 24 24 0
246 235 235 218 209 194 194 193 189 184 184 175 172 167 167 147 141 134 134 126 119 119 112 107 103 87 87 84 73 60 56 50 24 0 0

==> processed_datasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/test.en.segment_duration_remaining <==
413 401 398 390 378 374 369 364 355 347 333 333 328 321 321 314 297 292 287 287 277 267 267 264 252 239 236 236 223 210 210 208 205 205 202 197 192 179 171 165 155 155 150 142 130 124 116 109 109 104 96 89 89 86 74 66 50 38 24 24 0
246 235 235 218 209 194 194 193 189 184 184 175 172 167 167 147 141 134 134 126 119 119 112 107 103 87 87 84 73 60 56 50 24 0 0

==> processed_datasets/de-text-noisy-durations0.1-en-phones-durations/multi_factored/test.en.pauses_remaining <==
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Decode test set and evaluate using provided Sockeye baseline model

NOTE: Test set here does not refer to the specific subsets in data/test/; rather it refers to the full test sets generated from CoVoST2.

We provide a baseline model checkpoint in models/sockeye/trained_baselines/baseline_factored_noised0.1. This uses a target factor to predict durations and additional target factors to help the model keep track of time. The training segment durations have Gaussian noise (std. dev. 0.1) added to teach the model to be flexible about timing in hopes of striking a balance between speech overlap, speech naturalness, and translation quality. (Note that the speech overlap in real human dubs is only about 70%.)

Before you proceed, in sockeye_scripts/config, set ROOT as the path of this repo. For example, ROOT=~/iwslt-autodub-task.

Before decoding, please make sure you have run the data and factor preparation steps, so that you have at least processed_datasets/de-text-noised-durations0.1-en-phones-durations prepared with the multi_factored subdirectory, processed_datasets/de-text-without-durations-en-text-without-durations for the translation reference text files. If you ran the steps in the previous section, you will have these already.

Decoding using baseline_factored_noised0.1

The input format for decoding is a specific JSON format that can be prepared using:

$ python3 sockeye_scripts/decoding/create-json-inputs.py -d processed_datasets/de-text-noisy-durations0.1-en-phones-durations --subset test --output-segment-durations -o processed_datasets/de-text-noisy-durations0.1-en-phones-durations/test.de.json

# Check JSON file looks like this
$ head -2 ~/iwslt-autodub-task/processed_datasets/de-text-noisy-durations0.1-en-phones-durations/test.de.json | jq
{
  "text": "Glück@@ licherweise sind die Ster@@ ber@@ aten im Vereinigten Königreich ges@@ unken <||> <bin87>",
  "target_prefix": "<shift>",
  "target_prefix_factors": [
    "0",
    "413",
    "413",
    "0"
  ],
  "target_segment_durations": [
    413
  ],
  "use_target_prefix_all_chunks": false
}
{
  "text": "Ich musste einige Ab@@ str@@ ich@@ e mach@@ en@@ . <||> <bin44>",
  "target_prefix": "<shift>",
  "target_prefix_factors": [
    "0",
    "246",
    "246",
    "0"
  ],
  "target_segment_durations": [
    246
  ],
  "use_target_prefix_all_chunks": false
}

To decode using baseline_factored_noised0.1, run

mkdir -p models/sockeye/trained_baselines/baseline_factored_noised0.1/eval
sockeye-translate \
    -i processed_datasets/de-text-noisy-durations0.1-en-phones-durations/test.de.json \
    -o models/sockeye/trained_baselines/baseline_factored_noised0.1/eval/test.en.output \
    --models models/sockeye/trained_baselines/baseline_factored_noised0.1/model \
    --checkpoints 78 \
    -b 5 \
    --batch-size 32 \
    --chunk-size 20000 \
    --output-type translation_with_factors \
    --max-output-length 768 \
    --force-factors-stepwise frames total_remaining segment_remaining pauses_remaining \
    --json-input \
    --quiet

This should take around 5 minutes on 1 V100 GPU, or around an hour without a GPU.

Evaluate baseline_factored_noised0.1 output

./sockeye_scripts/evaluation/evaluate-factored.sh processed_datasets/de-text-noisy-durations0.1-en-phones-durations/test.en models/sockeye/trained_baselines/baseline_factored_noised0.1/eval/test.en.output

This will print:

  • Translation quality metrics
    • BLEU
    • Prism
    • COMET
  • Speech overlap metrics

Reproduce Sockeye baseline models

Scripts are included to reproduce the Sockeye baselines included here. Before launching training for the factored models, you need specially created vocab files which can be generated using

cd sockeye_scripts/training
wget https://raw.githubusercontent.com/Proyag/sockeye/factor-pe/sockeye_contrib/create_seq_vocab.py
python create_seq_vocab.py --min-val -4000 --max-val 5000  --output seq_vocab_expanded.json

And now, the training can be launched using

./train_factored_noised0.1.sh

If you're using >1 GPU, adjust the following settings in the script first

# Set the number of GPUs for distributed training
# Adjust BATCH_SIZE and UPDATE_INTERVAL according to your GPU situation.
# For example, if you change N_GPU to 2, you should set update-interval to 8 to have the same effective batch size
N_GPU=1
BATCH_SIZE=4096
UPDATE_INTERVAL=16

The trained models will be in models/sockeye.

Generate test set dubs with Sockeye models

There are German videos for two subsets of the test set in data/test/subset{1,2}. We want to generate English dubbed videos for these.

Extract the test set audio/video in data/test

pushd data/test
tar -xzf subset1/subset1.tgz -C subset1
tar -xzf subset2/subset2.tgz -C subset2
popd

Set up FastSpeech2 (only for the first usage)

sudo apt-get install ffmpeg
cd third_party/FastSpeech2
# Create separate environment for FastSpeech2 dependencies
conda create -n fastspeech2 python=3.8
conda activate fastspeech2
pip install -r requirements.txt
pip install gdown subword-nmt==0.3.7
# Download and extract pretrained model
mkdir -p output/ckpt/LJSpeech output/result/LJSpeech preprocessed_data/LJSpeech/duration
cd output/ckpt/LJSpeech
gdown https://drive.google.com/uc?id=1r3fYhnblBJ8hDKDSUDtidJ-BN-xAM9pe
unzip LJSpeech_900000.zip
cd ../../../hifigan
unzip generator_LJSpeech.pth.tar.zip
conda deactivate

Generate dubbed audios for the test set subsets using FastSpeech2. Note: This will take some time to run — it took us ~1 hour per subset with 1 V100 GPU.

cd ~/iwslt-autodub-task  # Or the path to the repo home, if different
python synthesize_speech.py --subset 1
python synthesize_speech.py --subset 2

Corresponding to each file *.X.mov or *.X.mp4 in data/test/subset{1,2}, a dubbed video data/test/subset{1,2}/dubbed/*.X.mp4 will be created.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.2%
  • Shell 8.8%