Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task ASR Reported: Caught ZeroDivisionError in DataLoader worker process 0. #2547

Closed
sunsdy2018 opened this issue May 12, 2024 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@sunsdy2018
Copy link

sunsdy2018 commented May 12, 2024

Describe the bug

With the version of speechbrain being 1.0.0, the unchanged source code and the Mini LibriSpeech dataset, I worked according to the Google Colab (https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=6xcAJ4OlYZCh). When using the command line python train.py /mnt/e/05_AIAudio/01_Codes/speechbrain/templates/speech_recognition/ASR/train.yaml --number_of_epochs 1 --batch_size 2 --enable_add_reverb False to train the ASR recognizer, after loading some batches data correctly, it reported the following exception:

ZeroDivisionError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Caught ZeroDivisionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/dataio/batch.py", line 129, in __init__
    padded = PaddedData(*padding_func(values, **padding_kwargs))
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/utils/data_utils.py", line 505, in batch_pad_right
    padded, valid_percent = pad_right_to(
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/utils/data_utils.py", line 444, in pad_right_to
    valid_vals.append(tensor.shape[j] / target_shape[j])
ZeroDivisionError: division by zero
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/core.py", line 1426, in _fit_train
    for batch in t:
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/core.py", line 1607, in fit
    self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/templates/speech_recognition/ASR/train.py", line 497, in <module>
    asr_brain.fit(
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/dataio/batch.py", line 129, in __init__
    padded = PaddedData(*padding_func(values, **padding_kwargs))
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/utils/data_utils.py", line 505, in batch_pad_right
    padded, valid_percent = pad_right_to(
  File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/utils/data_utils.py", line 444, in pad_right_to
    valid_vals.append(tensor.shape[j] / target_shape[j])
ZeroDivisionError: division by zero

The following screenshot shows that when running pad_right_to(), the target_shape[0] is 0.
image

Back to last frame, the following screenshot shows that the two tensors in this batch are both empty.
image

Back to last frame, the following screenshot shows that when the key is 'tokens', the two sensors of the batch are both empty.

image

The train data according to the first tensor of this batch is normal as follows:
image

After some debugging, I found this might be caused by the empty result "tokens_list" returned by "hparams["tokenizer"].encode_as_ids(words)", therefore the empty result "tokens" return by "tokens = torch.LongTensor(tokens_list)", shown as follows:
image

Expected behaviour

When the input wav and its text are not empty, the "tokens_list" returned "hparams["tokenizer"].encode_as_ids(words)" and the "tokens" returned by "torch.LongTensor(tokens_list)" shouldn't be empty

To Reproduce

No response

Environment Details

No response

Relevant Log Output

No response

Additional Context

  1. tokenizer.yaml
# ############################################################################
# Tokenizer: subword BPE tokenizer with unigram 1K
# Training: Mini-LibriSpeech
# Authors:  Abdel Heba 2021
#           Mirco Ravanelli 2021
# ############################################################################


# Set up folders for reading from and writing to
data_folder: /mnt/i/02_AIData/03_AudioData/06_MiniLibriSpeech
output_folder: ./save

# Path where data-specification files are stored
skip_prep: True
train_annotation: ../train.json
valid_annotation: ../valid.json
test_annotation: ../test.json

# Tokenizer parameters
token_type: unigram  # ["unigram", "bpe", "char"]
token_output: 1000  # index(blank/eos/bos/unk) = 0
character_coverage: 1.0
annotation_read: words # field to read

# Tokenizer object
tokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiece
   model_dir: !ref <output_folder>
   vocab_size: !ref <token_output>
   annotation_train: !ref <train_annotation>
   annotation_read: !ref <annotation_read>
   model_type: !ref <token_type> # ["unigram", "bpe", "char"]
   character_coverage: !ref <character_coverage>
   annotation_list_to_check: [!ref <train_annotation>, !ref <valid_annotation>]
   annotation_format: json

================
3. RNNLM.yaml

# ############################################################################
# Model: Language model with a recurrent neural network (RNNLM)
# Training: mini-librispeech transcripts
# Authors:  Ju-Chieh Chou 2020, Jianyuan Zhong 2021, Mirco Ravanelli 2021
# ############################################################################

# Seed needs to be set at top of yaml, before objects with parameters are made
seed: 2602
__set_seed: !apply:torch.manual_seed [!ref <seed>]
data_folder: /mnt/e/05_AIAudio/01_Codes/speechbrain/templates/speech_recognition/LM/data/
output_folder: !ref results/RNNLM/
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# If you plan to train a system on an HPC cluster with a big dataset,
# we strongly suggest doing the following:
# 1- Compress the dataset in a single tar or zip file.
# 2- Copy your dataset locally (i.e., the local disk of the computing node).
# 3- Uncompress the dataset in the local folder.
# 4- Set lm_{train,valid,test}_data with the local path.
# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.
# It allows you to read the data much faster without slowing down the shared filesystem.
lm_train_data: !ref <data_folder>/train.txt
lm_valid_data: !ref <data_folder>/valid.txt
lm_test_data: !ref <data_folder>/test.txt

# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>

# Tokenizer model (you must use the same tokenizer for LM and ASR training)
tokenizer_file: /mnt/e/05_AIAudio/01_Codes/speechbrain/save/1000_unigram.model

# Training parameters
number_of_epochs: 200
batch_size: 80
lr: 0.001
grad_accumulation_factor: 1 # Gradient accumulation to simulate large batch training
ckpt_interval_minutes: 15 # save checkpoint every N min

# Dataloader options
train_dataloader_opts:
    batch_size: !ref <batch_size>
    shuffle: True

valid_dataloader_opts:
    batch_size: 1

test_dataloader_opts:
    batch_size: 1

# Model parameters
emb_dim: 256 # dimension of the embeddings
rnn_size: 512 # dimension of hidden layers
layers: 2 # number of hidden layers

# Outputs
# output_neurons: 1000 # index(blank/eos/bos) = 0
# blank_index: 0
bos_index: 0
eos_index: 0


# To design a custom model, either just edit the simple CustomModel
# class that's listed here, or replace this `!new` call with a line
# pointing to a different file you've defined..
model: !new:custom_model.CustomModel
    embedding_dim: !ref <emb_dim>
    rnn_size: !ref <rnn_size>
    layers: !ref <layers>


# Cost function used for training the model
compute_cost: !name:speechbrain.nnet.losses.nll_loss

# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
optimizer: !name:torch.optim.Adam
    lr: !ref <lr>
    betas: (0.9, 0.98)
    eps: 0.000000001

# This function manages learning rate annealing over the epochs.
# We here use the NewBoB algorithm, that anneals the learning rate if
# the improvements over two consecutive epochs is less than the defined
# threshold.
lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0


# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class.
modules:
    model: !ref <model>

# Tokenizer initialization
tokenizer: !new:sentencepiece.SentencePieceProcessor

# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        scheduler: !ref <lr_annealing>
        counter: !ref <epoch_counter>

# Pretrain the tokenizer
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        tokenizer: !ref <tokenizer>
    paths:
        tokenizer: !ref <tokenizer_file>

================
4. train.yaml

# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN
# Decoder: GRU + beamsearch + RNNLM
# Tokens: 1000 BPE
# losses: CTC+ NLL
# Training: mini-librispeech
# Pre-Training: librispeech 960h
# Authors:  Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga, Samuele Cornell 2020
# # ############################################################################

# Seed needs to be set at top of yaml, before objects with parameters are instantiated
seed: 2602
__set_seed: !apply:torch.manual_seed [!ref <seed>]

# If you plan to train a system on an HPC cluster with a big dataset,
# we strongly suggest doing the following:
# 1- Compress the dataset in a single tar or zip file.
# 2- Copy your dataset locally (i.e., the local disk of the computing node).
# 3- Uncompress the dataset in the local folder.
# 4- Set data_folder with the local path
# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.
# It allows you to read the data much faster without slowing down the shared filesystem.

data_folder: /mnt/i/02_AIData/03_AudioData/06_MiniLibriSpeech #../data # In this case, data will be automatically downloaded here.
data_folder_noise: !ref <data_folder>/noise # The noisy sequences for data augmentation will automatically be downloaded here.
data_folder_rir: !ref <data_folder>/rir # The impulse responses used for data augmentation will automatically be downloaded here.

# Data for augmentation
NOISE_DATASET_URL: https://www.dropbox.com/scl/fi/a09pj97s5ifan81dqhi4n/noises.zip?rlkey=j8b0n9kdjdr32o1f06t0cw5b7&dl=1
RIR_DATASET_URL: https://www.dropbox.com/scl/fi/linhy77c36mu10965a836/RIRs.zip?rlkey=pg9cu8vrpn2u173vhiqyu743u&dl=1

output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>
test_wer_file: !ref <output_folder>/wer_test.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# Language model (LM) pretraining
# NB: To avoid mismatch, the speech recognizer must be trained with the same
# tokenizer used for LM training. Here, we download everything from the
# speechbrain HuggingFace repository. However, a local path pointing to a
# directory containing the lm.ckpt and tokenizer.ckpt may also be specified
# instead. E.g if you want to use your own LM / tokenizer.
pretrained_path: speechbrain/asr-crdnn-rnnlm-librispeech


# Path where data manifest files will be stored. The data manifest files are created by the
# data preparation script
train_annotation: ../train.json
valid_annotation: ../valid.json
test_annotation: ../test.json
noise_annotation: ../noise.csv
rir_annotation: ../rir.csv

skip_prep: False

# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>

# Training parameters
number_of_epochs: 15
number_of_ctc_epochs: 5
batch_size: 8
lr: 1.0
ctc_weight: 0.5
sorting: ascending
ckpt_interval_minutes: 15 # save checkpoint every N min
label_smoothing: 0.1

#Optimal Number of Workers for Data Reading
#The ideal value depends on your machine's hardware, such as the number of available CPUs.
num_workers: 4

# Dataloader options
train_dataloader_opts:
    batch_size: !ref <batch_size>
    num_workers: !ref <num_workers>

valid_dataloader_opts:
    batch_size: !ref <batch_size>
    num_workers: !ref <num_workers>

test_dataloader_opts:
    batch_size: !ref <batch_size>
    num_workers: !ref <num_workers>

# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# NOTE ON DATA AUGMENTATION
# This template demonstrates the use of all available data augmentation strategies
# to illustrate how they work and how you can combine them with the augmenter.
# In practical applications (e.g., refer to other recipes), it is usually advisable
# to select a subset of these strategies for better performance.

# Waveform Augmentation Functions
snr_low: 0  # Min SNR for noise augmentation
snr_high: 15  # Max SNR for noise augmentation
speed_changes: [85, 90, 95, 105, 110, 115]  # List of speed changes for time-stretching
drop_freq_low: 0  # Min frequency band dropout probability
drop_freq_high: 1  # Max frequency band dropout probability
drop_freq_count_low: 1  # Min number of frequency bands to drop
drop_freq_count_high: 3  # Max number of frequency bands to drop
drop_freq_width: 0.05  # Width of frequency bands to drop
drop_chunk_count_low: 1  # Min number of audio chunks to drop
drop_chunk_count_high: 3  # Max number of audio chunks to drop
drop_chunk_length_low: 1000  # Min length of audio chunks to drop
drop_chunk_length_high: 2000  # Max length of audio chunks to drop
clip_low: 0.1  # Min amplitude to clip
clip_high: 0.5  # Max amplitude to clip
amp_low: 0.05  # Min waveform amplitude
amp_high: 1.0  # Max waveform amplitude
babble_snr_low: 5  # Min SNR for babble (batch sum noise)
babble_snr_high: 15  # Max SNR for babble (batch sum noise)

# Feature Augmentation Functions
min_time_shift: 0  # Min random shift of spectrogram in time
max_time_shift: 15  # Max random shift of spectrogram in time
min_freq_shift: 0  # Min random shift of spectrogram in frequency
max_freq_shift: 5  # Max random shift of spectrogram in frequency
time_drop_length_low: 5  # Min length for temporal chunk to drop in spectrogram
time_drop_length_high: 15  # Max length for temporal chunk to drop in spectrogram
time_drop_count_low: 1  # Min number of chunks to drop in time in the spectrogram
time_drop_count_high: 3  # Max number of chunks to drop in time in the spectrogram
time_drop_replace: "zeros"  # Method of dropping chunks
freq_drop_length_low: 1  # Min length for chunks to drop in frequency in the spectrogram
freq_drop_length_high: 5  # Max length for chunks to drop in frequency in the spectrogram
freq_drop_count_low: 1  # Min number of chunks to drop in frequency in the spectrogram
freq_drop_count_high: 3  # Max number of chunks to drop in frequency in the spectrogram
freq_drop_replace: "zeros"  # Method of dropping chunks
time_warp_window: 20  # Length of time warping window
time_warp_mode: "bicubic"  # Time warping method
freq_warp_window: 4  # Length of frequency warping window
freq_warp_mode: "bicubic"  # Frequency warping method

# Enable Waveform Augmentation Flags (useful for hyperparameter tuning)
enable_codec_augment: False
enable_add_reverb: True
enable_add_noise: True
enable_speed_perturb: True
enable_drop_freq: True
enable_drop_chunk: True
enable_clipping: True
enable_rand_amp: True
enable_babble_noise: True
enable_drop_resolution: True

# Enable Feature Augmentations Flags (useful for hyperparameter tuning)
enable_time_shift: True
enable_freq_shift: True
enable_time_drop: True
enable_freq_drop: True
enable_time_warp: True
enable_freq_warp: True

# Waveform Augmenter (combining augmentations)
time_parallel_augment: False  # Apply augmentations in parallel if True, or sequentially if False
time_concat_original: True  # Concatenate original signals to the training batch if True
time_repeat_augment: 1  # Number of times to apply augmentation
time_shuffle_augmentations: True  # Shuffle order of augmentations if True, else use specified order
time_min_augmentations: 1  # Min number of augmentations to apply
time_max_augmentations: 10  # Max number of augmentations to apply
time_augment_prob: 1.0     # Probability to apply time augmentation

# Feature Augmenter (combining augmentations)
fea_parallel_augment: False  # Apply feature augmentations in parallel if True, or sequentially if False
fea_concat_original: True  # Concatenate original signals to the training batch if True
fea_repeat_augment: 1  # Number of times to apply feature augmentation
fea_shuffle_augmentations: True  # Shuffle order of feature augmentations if True, else use specified order
fea_min_augmentations: 1  # Min number of feature augmentations to apply
fea_max_augmentations: 6  # Max number of feature augmentations to app
fea_augment_prob: 1.0     # Probability to apply feature augmentation

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # Number of tokens (same as LM)
blank_index: 0
bos_index: 0
eos_index: 0

# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 1.0
valid_beam_size: 8
test_beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
ctc_weight_decode: 0.0
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25

# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>

# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>

# Feature normalization (mean and std)
normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global


# Download and prepare the dataset of noisy sequences for augmentation
prepare_noise_data: !name:speechbrain.augment.preparation.prepare_dataset_from_URL
    URL: !ref <NOISE_DATASET_URL>
    dest_folder: !ref <data_folder_noise>
    ext: wav
    csv_file: !ref <noise_annotation>

# Download and prepare the dataset of room impulse responses for augmentation
prepare_rir_data: !name:speechbrain.augment.preparation.prepare_dataset_from_URL
    URL: !ref <RIR_DATASET_URL>
    dest_folder: !ref <data_folder_rir>
    ext: wav
    csv_file: !ref <rir_annotation>


# ----- WAVEFORM AUGMENTATION ----- #

# Codec augmentation
codec_augment: !new:speechbrain.augment.codec.CodecAugment
    sample_rate: !ref <sample_rate>

# Add reverberation to input signal
add_reverb: !new:speechbrain.augment.time_domain.AddReverb
    csv_file: !ref <rir_annotation>
    reverb_sample_rate: !ref <sample_rate>
    clean_sample_rate: !ref <sample_rate>
    num_workers: !ref <num_workers>

# Add noise to input signal
add_noise: !new:speechbrain.augment.time_domain.AddNoise
    csv_file: !ref <noise_annotation>
    snr_low: !ref <snr_low>
    snr_high: !ref <snr_high>
    noise_sample_rate: !ref <sample_rate>
    clean_sample_rate: !ref <sample_rate>
    num_workers: !ref <num_workers>

# Speed perturbation
speed_perturb: !new:speechbrain.augment.time_domain.SpeedPerturb
    orig_freq: !ref <sample_rate>
    speeds: !ref <speed_changes>

# Frequency drop: randomly drops a number of frequency bands to zero.
drop_freq: !new:speechbrain.augment.time_domain.DropFreq
    drop_freq_low: !ref <drop_freq_low>
    drop_freq_high: !ref <drop_freq_high>
    drop_freq_count_low: !ref <drop_freq_count_low>
    drop_freq_count_high: !ref <drop_freq_count_high>
    drop_freq_width: !ref <drop_freq_width>

# Time drop: randomly drops a number of temporal chunks.
drop_chunk: !new:speechbrain.augment.time_domain.DropChunk
    drop_length_low: !ref <drop_chunk_length_low>
    drop_length_high: !ref <drop_chunk_length_high>
    drop_count_low: !ref <drop_chunk_count_low>
    drop_count_high: !ref <drop_chunk_count_high>

# Clipping
clipping: !new:speechbrain.augment.time_domain.DoClip
    clip_low: !ref <clip_low>
    clip_high: !ref <clip_high>

# Random Amplitude
rand_amp: !new:speechbrain.augment.time_domain.RandAmp
    amp_low: !ref <amp_low>
    amp_high: !ref <amp_high>

# Noise sequence derived by summing up all the signals in the batch
# It is similar to babble noise
sum_batch: !name:torch.sum
    dim: 0
    keepdim: True

babble_noise: !new:speechbrain.augment.time_domain.AddNoise
    snr_low: !ref <babble_snr_low>
    snr_high: !ref <babble_snr_high>
    noise_funct: !ref <sum_batch>

drop_resolution: !new:speechbrain.augment.time_domain.DropBitResolution
    target_dtype: 'random'


# Augmenter: Combines previously defined augmentations to perform data augmentation
wav_augment: !new:speechbrain.augment.augmenter.Augmenter
    parallel_augment: !ref <time_parallel_augment>
    concat_original: !ref <time_concat_original>
    repeat_augment: !ref <time_repeat_augment>
    shuffle_augmentations: !ref <time_shuffle_augmentations>
    min_augmentations: !ref <time_min_augmentations>
    max_augmentations: !ref <time_max_augmentations>
    augment_prob: !ref <time_augment_prob>
    augmentations: [
        !ref <codec_augment>,
        !ref <add_reverb>,
        !ref <add_noise>,
        !ref <babble_noise>,
        !ref <speed_perturb>,
        !ref <clipping>,
        !ref <drop_freq>,
        !ref <drop_chunk>,
        !ref <rand_amp>,
        !ref <drop_resolution>]
    enable_augmentations: [
        !ref <enable_codec_augment>,
        !ref <enable_add_reverb>,
        !ref <enable_add_noise>,
        !ref <enable_babble_noise>,
        !ref <enable_speed_perturb>,
        !ref <enable_clipping>,
        !ref <enable_drop_freq>,
        !ref <enable_drop_chunk>,
        !ref <enable_rand_amp>,
        !ref <enable_drop_resolution>]


# ----- FEATURE AUGMENTATION ----- #

# Time shift
time_shift: !new:speechbrain.augment.freq_domain.RandomShift
    min_shift: !ref <min_time_shift>
    max_shift: !ref <max_time_shift>
    dim: 1

# Frequency shift
freq_shift: !new:speechbrain.augment.freq_domain.RandomShift
    min_shift: !ref <min_freq_shift>
    max_shift: !ref <max_freq_shift>
    dim: 2

# Time Drop
time_drop: !new:speechbrain.augment.freq_domain.SpectrogramDrop
    drop_length_low: !ref <time_drop_length_low>
    drop_length_high: !ref <time_drop_length_high>
    drop_count_low: !ref <time_drop_count_low>
    drop_count_high: !ref <time_drop_count_high>
    replace: !ref <time_drop_replace>
    dim: 1

# Frequency Drop
freq_drop: !new:speechbrain.augment.freq_domain.SpectrogramDrop
    drop_length_low: !ref <freq_drop_length_low>
    drop_length_high: !ref <freq_drop_length_high>
    drop_count_low: !ref <freq_drop_count_low>
    drop_count_high: !ref <freq_drop_count_high>
    replace: !ref <freq_drop_replace>
    dim: 2

# Time warp
time_warp: !new:speechbrain.augment.freq_domain.Warping
    warp_window: !ref <time_warp_window>
    warp_mode: !ref <time_warp_mode>
    dim: 1

freq_warp: !new:speechbrain.augment.freq_domain.Warping
    warp_window: !ref <freq_warp_window>
    warp_mode: !ref <freq_warp_mode>
    dim: 2

fea_augment: !new:speechbrain.augment.augmenter.Augmenter
    parallel_augment: !ref <fea_parallel_augment>
    concat_original: !ref <fea_concat_original>
    repeat_augment: !ref <fea_repeat_augment>
    shuffle_augmentations: !ref <fea_shuffle_augmentations>
    min_augmentations: !ref <fea_min_augmentations>
    max_augmentations: !ref <fea_max_augmentations>
    augment_start_index: !ref <batch_size> # This leaves original inputs unchanged
    concat_end_index: !ref <batch_size> # This leaves original inputs unchanged
    augment_prob: !ref <fea_augment_prob>
    augmentations: [
        !ref <time_shift>,
        !ref <freq_shift>,
        !ref <time_drop>,
        !ref <freq_drop>,
        !ref <time_warp>,
        !ref <freq_warp>]
    enable_augmentations: [
        !ref <enable_time_shift>,
        !ref <enable_freq_shift>,
        !ref <enable_time_drop>,
        !ref <enable_freq_drop>,
        !ref <enable_time_warp>,
        !ref <enable_freq_warp>]

# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.
encoder: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, !ref <n_mels>]
    activation: !ref <activation>
    dropout: !ref <dropout>
    cnn_blocks: !ref <cnn_blocks>
    cnn_channels: !ref <cnn_channels>
    cnn_kernelsize: !ref <cnn_kernelsize>
    inter_layer_pooling_size: !ref <inter_layer_pooling_size>
    time_pooling: True
    using_2d_pooling: False
    time_pooling_size: !ref <time_pooling_size>
    rnn_class: !ref <rnn_class>
    rnn_layers: !ref <rnn_layers>
    rnn_neurons: !ref <rnn_neurons>
    rnn_bidirectional: !ref <rnn_bidirectional>
    rnn_re_init: True
    dnn_blocks: !ref <dnn_blocks>
    dnn_neurons: !ref <dnn_neurons>
    use_rnnp: False

# Embedding (from indexes to an embedding space of dimension emb_size).
embedding: !new:speechbrain.nnet.embedding.Embedding
    num_embeddings: !ref <output_neurons>
    embedding_dim: !ref <emb_size>

# Attention-based RNN decoder.
decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>

# Linear transformation on the top of the encoder.
ctc_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dnn_neurons>
    n_neurons: !ref <output_neurons>

# Linear transformation on the top of the decoder.
seq_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dec_neurons>
    n_neurons: !ref <output_neurons>

# Final softmax (for log posteriors computation).
log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

# Cost definition for the CTC part.
ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
    blank_index: !ref <blank_index>


# Tokenizer initialization
tokenizer: !new:sentencepiece.SentencePieceProcessor

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class
modules:
    encoder: !ref <encoder>
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    ctc_lin: !ref <ctc_lin>
    seq_lin: !ref <seq_lin>
    normalize: !ref <normalize>
    lm_model: !ref <lm_model>

# Gathering all the submodels in a single model object.
model: !new:torch.nn.ModuleList
    - - !ref <encoder>
      - !ref <embedding>
      - !ref <decoder>
      - !ref <ctc_lin>
      - !ref <seq_lin>

# This is the RNNLM that is used according to the Huggingface repository
# NB: It has to match the pre-trained RNNLM!!
lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
    output_neurons: !ref <output_neurons>
    embedding_dim: !ref <emb_size>
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: True  # For inference

# Define scorers for beam search

# If ctc_scorer is set, the decoder uses CTC + attention beamsearch. This
# improves the performance, but slows down decoding.
ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    ctc_fc: !ref <ctc_lin>

# If coverage_scorer is set, coverage penalty is applied based on accumulated
# attention weights during beamsearch.
coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer
    vocab_size: !ref <output_neurons>

# If the lm_scorer is set, a language model
# is applied (with a weight specified in scorer).
rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer
    language_model: !ref <lm_model>
    temperature: !ref <temperature_lm>

# Gathering all scorers in a scorer instance for beamsearch:
# - full_scorers are scorers which score on full vocab set, while partial_scorers
# are scorers which score on pruned tokens.
# - The number of pruned tokens is decided by scorer_beam_scale * beam_size.
# - For some scorers like ctc_scorer, ngramlm_scorer, putting them
# into full_scorers list would be too heavy. partial_scorers are more
# efficient because they score on pruned tokens at little cost of
# performance drop. For other scorers, please see the speechbrain.decoders.scorer.
test_scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    scorer_beam_scale: 1.5
    full_scorers: [
        !ref <rnnlm_scorer>,
        !ref <coverage_scorer>]
    partial_scorers: [!ref <ctc_scorer>]
    weights:
        rnnlm: !ref <lm_weight>
        coverage: !ref <coverage_penalty>
        ctc: !ref <ctc_weight_decode>

valid_scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    full_scorers: [!ref <coverage_scorer>]
    weights:
        coverage: !ref <coverage_penalty>

# Beamsearch is applied on the top of the decoder. For a description of
# the other parameters, please see the speechbrain.decoders.S2SRNNBeamSearcher.

# It makes sense to have a lighter search during validation. In this case,
# we don't use scorers during decoding.
valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <valid_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <valid_scorer>

# The final decoding on the test set can be more computationally demanding.
# In this case, we use the LM + CTC probabilities during decoding as well,
# which are defined in scorer.
# Please, remove scorer if you need a faster decoder.
test_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <test_scorer>

# This function manages learning rate annealing over the epochs.
# We here use the NewBoB algorithm, that anneals the learning rate if
# the improvements over two consecutive epochs is less than the defined
# threshold.
lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0

# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
opt_class: !name:torch.optim.Adadelta
    lr: !ref <lr>
    rho: 0.95
    eps: 1.e-8

# Functions that compute the statistics to track during the validation step.
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
    split_tokens: True

# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        scheduler: !ref <lr_annealing>
        normalizer: !ref <normalize>
        counter: !ref <epoch_counter>

# This object is used to pretrain the language model and the tokenizers
# (defined above). In this case, we also pretrain the ASR model (to make
# sure the model converges on a small amount of data)
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    collect_in: !ref <save_folder>
    loadables:
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
        model: !ref <model>
    paths:
        lm: !ref <pretrained_path>/lm.ckpt
        tokenizer: !ref <pretrained_path>/tokenizer.ckpt
        model: !ref <pretrained_path>/asr.ckpt

================
This code is running ins WSL 2 and a virtual conda environment, the packages are as follows:

(envSpeechBrainTorch221Py310) (base) ubuntu@hp-lmj:/mnt/e/05_AIAudio/01_Codes/speechbrain$ pip list
Package                   Version      Editable project location
------------------------- ------------ --------------------------------------
accelerate                0.29.2
aiofiles                  23.2.1
aiohttp                   3.9.4
aiosignal                 1.3.1
altair                    5.3.0
annotated-types           0.6.0
antlr4-python3-runtime    4.13.1
anyio                     4.3.0
async-timeout             4.0.3
attrs                     23.2.0
audioread                 3.0.1
augly                     1.0.0
av                        11.0.0
bitarray                  2.9.2
bitsandbytes              0.43.1
black                     24.3.0
blinker                   1.7.0
certifi                   2024.2.2
cffi                      1.16.0
cfgv                      3.4.0
charset-normalizer        3.3.2
click                     8.1.7
coloredlogs               15.0.1
contourpy                 1.2.1
ctranslate2               4.1.0
cycler                    0.12.1
dataclasses               0.6
datasets                  2.18.0
decorator                 5.1.1
dill                      0.3.8
distlib                   0.3.8
docstring_parser_fork     0.0.5
evaluate                  0.4.1
exceptiongroup            1.2.0
fastapi                   0.110.1
faster-whisper            1.0.1
ffmpeg-python             0.2.0
ffmpy                     0.3.2
filelock                  3.14.0
flake8                    7.0.0
Flask                     3.0.3
flatbuffers               24.3.25
fonttools                 4.51.0
frozenlist                1.4.1
fsspec                    2024.2.0
future                    1.0.0
gradio                    4.26.0
gradio_client             0.15.1
h11                       0.14.0
httpcore                  1.0.5
httpx                     0.27.0
huggingface-hub           0.22.2
humanfriendly             10.0
HyperPyYAML               1.2.2
identify                  2.5.36
idna                      3.7
importlib_resources       6.4.0
iniconfig                 2.0.0
iopath                    0.1.10
isort                     5.13.2
itsdangerous              2.1.2
Jinja2                    3.1.2
jiwer                     3.0.3
joblib                    1.4.0
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
lazy_loader               0.4
librosa                   0.10.1
llvmlite                  0.42.0
markdown-it-py            3.0.0
MarkupSafe                2.1.3
matplotlib                3.8.4
mccabe                    0.7.0
mdurl                     0.1.2
mpmath                    1.3.0
msgpack                   1.0.8
multidict                 6.0.5
multiprocess              0.70.16
mypy-extensions           1.0.0
networkx                  3.2.1
nodeenv                   1.8.0
numba                     0.59.1
numpy                     1.26.3
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.19.3
nvidia-nvjitlink-cu12     12.1.105
nvidia-nvtx-cu12          12.1.105
onnxruntime               1.17.3
opencv-contrib-python     4.9.0.80
opencv-python             4.9.0.80
orjson                    3.10.0
packaging                 24.0
pandas                    2.2.2
pathspec                  0.12.1
peft                      0.10.0
pillow                    10.2.0
pip                       23.3.1
platformdirs              4.2.0
pluggy                    1.5.0
pooch                     1.8.1
portalocker               2.8.2
pre-commit                3.7.0
protobuf                  5.26.1
psutil                    5.9.8
pyarrow                   15.0.2
pyarrow-hotfix            0.6
pycodestyle               2.11.0
pycparser                 2.22
pydantic                  2.7.0
pydantic_core             2.18.1
pydoclint                 0.4.1
pydub                     0.25.1
pyflakes                  3.2.0
Pygments                  2.17.2
pygtrie                   2.5.0
pyparsing                 3.1.2
pytest                    7.4.0
python-dateutil           2.9.0.post0
python-magic              0.4.27
python-multipart          0.0.9
pytz                      2024.1
PyYAML                    6.0.1
rapidfuzz                 3.8.1
referencing               0.34.0
regex                     2023.12.25
requests                  2.31.0
responses                 0.18.0
rich                      13.7.1
rpds-py                   0.18.0
ruamel.yaml               0.18.6
ruamel.yaml.clib          0.2.8
ruff                      0.3.7
safetensors               0.4.2
scikit-learn              1.4.2
scipy                     1.12.0
semantic-version          2.10.0
sentencepiece             0.2.0
setuptools                68.2.2
shellingham               1.5.4
six                       1.16.0
sniffio                   1.3.1
SoundCard                 0.4.3
soundfile                 0.12.1
soxr                      0.3.7
speechbrain               1.0.0        /mnt/e/05_AIAudio/01_Codes/speechbrain
starlette                 0.37.2
sympy                     1.12
tensorboardX              2.6.2.2
threadpoolctl             3.4.0
tokenizers                0.15.2
tomli                     2.0.1
tomlkit                   0.12.0
toolz                     0.12.1
torch                     2.2.1+cu121
torchaudio                2.2.1+cu121
torchvision               0.17.1+cu121
tqdm                      4.66.2
transformers              4.39.3
triton                    2.2.0
typer                     0.12.3
typing_extensions         4.8.0
tzdata                    2024.1
urllib3                   2.2.1
uvicorn                   0.29.0
virtualenv                20.26.1
websockets                11.0.3
Werkzeug                  3.0.2
wheel                     0.41.2
xxhash                    3.4.1
yamllint                  1.35.1
yarl                      1.9.4
zhconv                    1.4.3
@sunsdy2018 sunsdy2018 added the bug Something isn't working label May 12, 2024
@sunsdy2018
Copy link
Author

To give extra report. To avoid ZeroDivisionError shown above, I modified the code in speechbrain/utils/data_utils.py as follows:

in func pad_right_to()

from:

valid_vals.append(tensor.shape[j] / target_shape[j])

to:

if target_shape[j] != 0:
valid_vals.append(tensor.shape[j] / target_shape[j])
else:
valid_vals.append(1.0)

So it looks like the following:
image

After the such a modification and given a new batchsize of 8, the training process went on well and finished one epoch as following:
image

However, before going any further, it reported an error "sentencepiece_processor.cc(954) LOG(ERROR) src/sentencepiece_processor.cc(294) [model_] Model is not initialized."

And throwed an exception as follows:
image

Back to the frame of compute_objectives(), the information is as follows:
image

@sunsdy2018
Copy link
Author

This problem was caused by my mistake.

@scj0709
Copy link

scj0709 commented May 24, 2024

I am also experiencing the same problem.
How did you solve this problem?

@sunsdy2018
Copy link
Author

Sorry to see your comment until now.

My problem is caused because my mistake that I commented the initialization codes of the pretrainer. As shown in the red box of the following figure, I commented them and got the zerodivision error. Besides, I didn't give the correct configure args for pretrainer section in train.yaml .
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants