You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the version of speechbrain being 1.0.0, the unchanged source code and the Mini LibriSpeech dataset, I worked according to the Google Colab (https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=6xcAJ4OlYZCh). When using the command line python train.py /mnt/e/05_AIAudio/01_Codes/speechbrain/templates/speech_recognition/ASR/train.yaml --number_of_epochs 1 --batch_size 2 --enable_add_reverb False to train the ASR recognizer, after loading some batches data correctly, it reported the following exception:
ZeroDivisionError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Caught ZeroDivisionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/dataio/batch.py", line 129, in __init__
padded = PaddedData(*padding_func(values, **padding_kwargs))
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/utils/data_utils.py", line 505, in batch_pad_right
padded, valid_percent = pad_right_to(
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/utils/data_utils.py", line 444, in pad_right_to
valid_vals.append(tensor.shape[j] / target_shape[j])
ZeroDivisionError: division by zero
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
raise exception
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/core.py", line 1426, in _fit_train
for batch in t:
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/core.py", line 1607, in fit
self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/templates/speech_recognition/ASR/train.py", line 497, in <module>
asr_brain.fit(
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/ubuntu/anaconda3/envs/envSpeechBrainTorch221Py310/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/dataio/batch.py", line 129, in __init__
padded = PaddedData(*padding_func(values, **padding_kwargs))
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/utils/data_utils.py", line 505, in batch_pad_right
padded, valid_percent = pad_right_to(
File "/mnt/e/05_AIAudio/01_Codes/speechbrain/speechbrain/utils/data_utils.py", line 444, in pad_right_to
valid_vals.append(tensor.shape[j] / target_shape[j])
ZeroDivisionError: division by zero
The following screenshot shows that when running pad_right_to(), the target_shape[0] is 0.
Back to last frame, the following screenshot shows that the two tensors in this batch are both empty.
Back to last frame, the following screenshot shows that when the key is 'tokens', the two sensors of the batch are both empty.
The train data according to the first tensor of this batch is normal as follows:
After some debugging, I found this might be caused by the empty result "tokens_list" returned by "hparams["tokenizer"].encode_as_ids(words)", therefore the empty result "tokens" return by "tokens = torch.LongTensor(tokens_list)", shown as follows:
Expected behaviour
When the input wav and its text are not empty, the "tokens_list" returned "hparams["tokenizer"].encode_as_ids(words)" and the "tokens" returned by "torch.LongTensor(tokens_list)" shouldn't be empty
To Reproduce
No response
Environment Details
No response
Relevant Log Output
No response
Additional Context
tokenizer.yaml
# ############################################################################# Tokenizer: subword BPE tokenizer with unigram 1K# Training: Mini-LibriSpeech# Authors: Abdel Heba 2021# Mirco Ravanelli 2021# ############################################################################# Set up folders for reading from and writing todata_folder: /mnt/i/02_AIData/03_AudioData/06_MiniLibriSpeechoutput_folder: ./save# Path where data-specification files are storedskip_prep: Truetrain_annotation: ../train.jsonvalid_annotation: ../valid.jsontest_annotation: ../test.json# Tokenizer parameterstoken_type: unigram # ["unigram", "bpe", "char"]token_output: 1000# index(blank/eos/bos/unk) = 0character_coverage: 1.0annotation_read: words # field to read# Tokenizer objecttokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiecemodel_dir: !ref <output_folder>vocab_size: !ref <token_output>annotation_train: !ref <train_annotation>annotation_read: !ref <annotation_read>model_type: !ref <token_type> # ["unigram", "bpe", "char"]character_coverage: !ref <character_coverage>annotation_list_to_check: [!ref <train_annotation>, !ref <valid_annotation>]annotation_format: json
================
3. RNNLM.yaml
# ############################################################################# Model: Language model with a recurrent neural network (RNNLM)# Training: mini-librispeech transcripts# Authors: Ju-Chieh Chou 2020, Jianyuan Zhong 2021, Mirco Ravanelli 2021# ############################################################################# Seed needs to be set at top of yaml, before objects with parameters are madeseed: 2602__set_seed: !apply:torch.manual_seed [!ref <seed>]data_folder: /mnt/e/05_AIAudio/01_Codes/speechbrain/templates/speech_recognition/LM/data/output_folder: !ref results/RNNLM/save_folder: !ref <output_folder>/savetrain_log: !ref <output_folder>/train_log.txt# If you plan to train a system on an HPC cluster with a big dataset,# we strongly suggest doing the following:# 1- Compress the dataset in a single tar or zip file.# 2- Copy your dataset locally (i.e., the local disk of the computing node).# 3- Uncompress the dataset in the local folder.# 4- Set lm_{train,valid,test}_data with the local path.# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.# It allows you to read the data much faster without slowing down the shared filesystem.lm_train_data: !ref <data_folder>/train.txtlm_valid_data: !ref <data_folder>/valid.txtlm_test_data: !ref <data_folder>/test.txt# The train logger writes training statistics to a file, as well as stdout.train_logger: !new:speechbrain.utils.train_logger.FileTrainLoggersave_file: !ref <train_log># Tokenizer model (you must use the same tokenizer for LM and ASR training)tokenizer_file: /mnt/e/05_AIAudio/01_Codes/speechbrain/save/1000_unigram.model# Training parametersnumber_of_epochs: 200batch_size: 80lr: 0.001grad_accumulation_factor: 1# Gradient accumulation to simulate large batch trainingckpt_interval_minutes: 15# save checkpoint every N min# Dataloader optionstrain_dataloader_opts:
batch_size: !ref <batch_size>shuffle: Truevalid_dataloader_opts:
batch_size: 1test_dataloader_opts:
batch_size: 1# Model parametersemb_dim: 256# dimension of the embeddingsrnn_size: 512# dimension of hidden layerslayers: 2# number of hidden layers# Outputs# output_neurons: 1000 # index(blank/eos/bos) = 0# blank_index: 0bos_index: 0eos_index: 0# To design a custom model, either just edit the simple CustomModel# class that's listed here, or replace this `!new` call with a line# pointing to a different file you've defined..model: !new:custom_model.CustomModelembedding_dim: !ref <emb_dim>rnn_size: !ref <rnn_size>layers: !ref <layers># Cost function used for training the modelcompute_cost: !name:speechbrain.nnet.losses.nll_loss# This optimizer will be constructed by the Brain class after all parameters# are moved to the correct device. Then it will be added to the checkpointer.optimizer: !name:torch.optim.Adamlr: !ref <lr>betas: (0.9, 0.98)eps: 0.000000001# This function manages learning rate annealing over the epochs.# We here use the NewBoB algorithm, that anneals the learning rate if# the improvements over two consecutive epochs is less than the defined# threshold.lr_annealing: !new:speechbrain.nnet.schedulers.NewBobSchedulerinitial_value: !ref <lr>improvement_threshold: 0.0025annealing_factor: 0.8patient: 0# The first object passed to the Brain class is this "Epoch Counter"# which is saved by the Checkpointer so that training can be resumed# if it gets interrupted at any point.epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounterlimit: !ref <number_of_epochs># Objects in "modules" dict will have their parameters moved to the correct# device, as well as having train()/eval() called on them by the Brain class.modules:
model: !ref <model># Tokenizer initializationtokenizer: !new:sentencepiece.SentencePieceProcessor# This object is used for saving the state of training both so that it# can be resumed if it gets interrupted, and also so that the best checkpoint# can be later loaded for evaluation or inference.checkpointer: !new:speechbrain.utils.checkpoints.Checkpointercheckpoints_dir: !ref <save_folder>recoverables:
model: !ref <model>scheduler: !ref <lr_annealing>counter: !ref <epoch_counter># Pretrain the tokenizerpretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainerloadables:
tokenizer: !ref <tokenizer>paths:
tokenizer: !ref <tokenizer_file>
================
4. train.yaml
# ############################################################################# Model: E2E ASR with attention-based ASR# Encoder: CRDNN# Decoder: GRU + beamsearch + RNNLM# Tokens: 1000 BPE# losses: CTC+ NLL# Training: mini-librispeech# Pre-Training: librispeech 960h# Authors: Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga, Samuele Cornell 2020# # ############################################################################# Seed needs to be set at top of yaml, before objects with parameters are instantiatedseed: 2602__set_seed: !apply:torch.manual_seed [!ref <seed>]# If you plan to train a system on an HPC cluster with a big dataset,# we strongly suggest doing the following:# 1- Compress the dataset in a single tar or zip file.# 2- Copy your dataset locally (i.e., the local disk of the computing node).# 3- Uncompress the dataset in the local folder.# 4- Set data_folder with the local path# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.# It allows you to read the data much faster without slowing down the shared filesystem.data_folder: /mnt/i/02_AIData/03_AudioData/06_MiniLibriSpeech #../data # In this case, data will be automatically downloaded here.data_folder_noise: !ref <data_folder>/noise # The noisy sequences for data augmentation will automatically be downloaded here.data_folder_rir: !ref <data_folder>/rir # The impulse responses used for data augmentation will automatically be downloaded here.# Data for augmentationNOISE_DATASET_URL: https://www.dropbox.com/scl/fi/a09pj97s5ifan81dqhi4n/noises.zip?rlkey=j8b0n9kdjdr32o1f06t0cw5b7&dl=1RIR_DATASET_URL: https://www.dropbox.com/scl/fi/linhy77c36mu10965a836/RIRs.zip?rlkey=pg9cu8vrpn2u173vhiqyu743u&dl=1output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>test_wer_file: !ref <output_folder>/wer_test.txtsave_folder: !ref <output_folder>/savetrain_log: !ref <output_folder>/train_log.txt# Language model (LM) pretraining# NB: To avoid mismatch, the speech recognizer must be trained with the same# tokenizer used for LM training. Here, we download everything from the# speechbrain HuggingFace repository. However, a local path pointing to a# directory containing the lm.ckpt and tokenizer.ckpt may also be specified# instead. E.g if you want to use your own LM / tokenizer.pretrained_path: speechbrain/asr-crdnn-rnnlm-librispeech# Path where data manifest files will be stored. The data manifest files are created by the# data preparation scripttrain_annotation: ../train.jsonvalid_annotation: ../valid.jsontest_annotation: ../test.jsonnoise_annotation: ../noise.csvrir_annotation: ../rir.csvskip_prep: False# The train logger writes training statistics to a file, as well as stdout.train_logger: !new:speechbrain.utils.train_logger.FileTrainLoggersave_file: !ref <train_log># Training parametersnumber_of_epochs: 15number_of_ctc_epochs: 5batch_size: 8lr: 1.0ctc_weight: 0.5sorting: ascendingckpt_interval_minutes: 15# save checkpoint every N minlabel_smoothing: 0.1#Optimal Number of Workers for Data Reading#The ideal value depends on your machine's hardware, such as the number of available CPUs.num_workers: 4# Dataloader optionstrain_dataloader_opts:
batch_size: !ref <batch_size>num_workers: !ref <num_workers>valid_dataloader_opts:
batch_size: !ref <batch_size>num_workers: !ref <num_workers>test_dataloader_opts:
batch_size: !ref <batch_size>num_workers: !ref <num_workers># Feature parameterssample_rate: 16000n_fft: 400n_mels: 40# NOTE ON DATA AUGMENTATION# This template demonstrates the use of all available data augmentation strategies# to illustrate how they work and how you can combine them with the augmenter.# In practical applications (e.g., refer to other recipes), it is usually advisable# to select a subset of these strategies for better performance.# Waveform Augmentation Functionssnr_low: 0# Min SNR for noise augmentationsnr_high: 15# Max SNR for noise augmentationspeed_changes: [85, 90, 95, 105, 110, 115] # List of speed changes for time-stretchingdrop_freq_low: 0# Min frequency band dropout probabilitydrop_freq_high: 1# Max frequency band dropout probabilitydrop_freq_count_low: 1# Min number of frequency bands to dropdrop_freq_count_high: 3# Max number of frequency bands to dropdrop_freq_width: 0.05# Width of frequency bands to dropdrop_chunk_count_low: 1# Min number of audio chunks to dropdrop_chunk_count_high: 3# Max number of audio chunks to dropdrop_chunk_length_low: 1000# Min length of audio chunks to dropdrop_chunk_length_high: 2000# Max length of audio chunks to dropclip_low: 0.1# Min amplitude to clipclip_high: 0.5# Max amplitude to clipamp_low: 0.05# Min waveform amplitudeamp_high: 1.0# Max waveform amplitudebabble_snr_low: 5# Min SNR for babble (batch sum noise)babble_snr_high: 15# Max SNR for babble (batch sum noise)# Feature Augmentation Functionsmin_time_shift: 0# Min random shift of spectrogram in timemax_time_shift: 15# Max random shift of spectrogram in timemin_freq_shift: 0# Min random shift of spectrogram in frequencymax_freq_shift: 5# Max random shift of spectrogram in frequencytime_drop_length_low: 5# Min length for temporal chunk to drop in spectrogramtime_drop_length_high: 15# Max length for temporal chunk to drop in spectrogramtime_drop_count_low: 1# Min number of chunks to drop in time in the spectrogramtime_drop_count_high: 3# Max number of chunks to drop in time in the spectrogramtime_drop_replace: "zeros"# Method of dropping chunksfreq_drop_length_low: 1# Min length for chunks to drop in frequency in the spectrogramfreq_drop_length_high: 5# Max length for chunks to drop in frequency in the spectrogramfreq_drop_count_low: 1# Min number of chunks to drop in frequency in the spectrogramfreq_drop_count_high: 3# Max number of chunks to drop in frequency in the spectrogramfreq_drop_replace: "zeros"# Method of dropping chunkstime_warp_window: 20# Length of time warping windowtime_warp_mode: "bicubic"# Time warping methodfreq_warp_window: 4# Length of frequency warping windowfreq_warp_mode: "bicubic"# Frequency warping method# Enable Waveform Augmentation Flags (useful for hyperparameter tuning)enable_codec_augment: Falseenable_add_reverb: Trueenable_add_noise: Trueenable_speed_perturb: Trueenable_drop_freq: Trueenable_drop_chunk: Trueenable_clipping: Trueenable_rand_amp: Trueenable_babble_noise: Trueenable_drop_resolution: True# Enable Feature Augmentations Flags (useful for hyperparameter tuning)enable_time_shift: Trueenable_freq_shift: Trueenable_time_drop: Trueenable_freq_drop: Trueenable_time_warp: Trueenable_freq_warp: True# Waveform Augmenter (combining augmentations)time_parallel_augment: False # Apply augmentations in parallel if True, or sequentially if Falsetime_concat_original: True # Concatenate original signals to the training batch if Truetime_repeat_augment: 1# Number of times to apply augmentationtime_shuffle_augmentations: True # Shuffle order of augmentations if True, else use specified ordertime_min_augmentations: 1# Min number of augmentations to applytime_max_augmentations: 10# Max number of augmentations to applytime_augment_prob: 1.0# Probability to apply time augmentation# Feature Augmenter (combining augmentations)fea_parallel_augment: False # Apply feature augmentations in parallel if True, or sequentially if Falsefea_concat_original: True # Concatenate original signals to the training batch if Truefea_repeat_augment: 1# Number of times to apply feature augmentationfea_shuffle_augmentations: True # Shuffle order of feature augmentations if True, else use specified orderfea_min_augmentations: 1# Min number of feature augmentations to applyfea_max_augmentations: 6# Max number of feature augmentations to appfea_augment_prob: 1.0# Probability to apply feature augmentation# Model parametersactivation: !name:torch.nn.LeakyReLUdropout: 0.15cnn_blocks: 2cnn_channels: (128, 256)inter_layer_pooling_size: (2, 2)cnn_kernelsize: (3, 3)time_pooling_size: 4rnn_class: !name:speechbrain.nnet.RNN.LSTMrnn_layers: 4rnn_neurons: 1024rnn_bidirectional: Truednn_blocks: 2dnn_neurons: 512emb_size: 128dec_neurons: 1024output_neurons: 1000# Number of tokens (same as LM)blank_index: 0bos_index: 0eos_index: 0# Decoding parametersmin_decode_ratio: 0.0max_decode_ratio: 1.0valid_beam_size: 8test_beam_size: 80eos_threshold: 1.5using_max_attn_shift: Truemax_attn_shift: 240lm_weight: 0.50ctc_weight_decode: 0.0coverage_penalty: 1.5temperature: 1.25temperature_lm: 1.25# The first object passed to the Brain class is this "Epoch Counter"# which is saved by the Checkpointer so that training can be resumed# if it gets interrupted at any point.epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounterlimit: !ref <number_of_epochs># Feature extractioncompute_features: !new:speechbrain.lobes.features.Fbanksample_rate: !ref <sample_rate>n_fft: !ref <n_fft>n_mels: !ref <n_mels># Feature normalization (mean and std)normalize: !new:speechbrain.processing.features.InputNormalizationnorm_type: global# Download and prepare the dataset of noisy sequences for augmentationprepare_noise_data: !name:speechbrain.augment.preparation.prepare_dataset_from_URLURL: !ref <NOISE_DATASET_URL>dest_folder: !ref <data_folder_noise>ext: wavcsv_file: !ref <noise_annotation># Download and prepare the dataset of room impulse responses for augmentationprepare_rir_data: !name:speechbrain.augment.preparation.prepare_dataset_from_URLURL: !ref <RIR_DATASET_URL>dest_folder: !ref <data_folder_rir>ext: wavcsv_file: !ref <rir_annotation># ----- WAVEFORM AUGMENTATION ----- ## Codec augmentationcodec_augment: !new:speechbrain.augment.codec.CodecAugmentsample_rate: !ref <sample_rate># Add reverberation to input signaladd_reverb: !new:speechbrain.augment.time_domain.AddReverbcsv_file: !ref <rir_annotation>reverb_sample_rate: !ref <sample_rate>clean_sample_rate: !ref <sample_rate>num_workers: !ref <num_workers># Add noise to input signaladd_noise: !new:speechbrain.augment.time_domain.AddNoisecsv_file: !ref <noise_annotation>snr_low: !ref <snr_low>snr_high: !ref <snr_high>noise_sample_rate: !ref <sample_rate>clean_sample_rate: !ref <sample_rate>num_workers: !ref <num_workers># Speed perturbationspeed_perturb: !new:speechbrain.augment.time_domain.SpeedPerturborig_freq: !ref <sample_rate>speeds: !ref <speed_changes># Frequency drop: randomly drops a number of frequency bands to zero.drop_freq: !new:speechbrain.augment.time_domain.DropFreqdrop_freq_low: !ref <drop_freq_low>drop_freq_high: !ref <drop_freq_high>drop_freq_count_low: !ref <drop_freq_count_low>drop_freq_count_high: !ref <drop_freq_count_high>drop_freq_width: !ref <drop_freq_width># Time drop: randomly drops a number of temporal chunks.drop_chunk: !new:speechbrain.augment.time_domain.DropChunkdrop_length_low: !ref <drop_chunk_length_low>drop_length_high: !ref <drop_chunk_length_high>drop_count_low: !ref <drop_chunk_count_low>drop_count_high: !ref <drop_chunk_count_high># Clippingclipping: !new:speechbrain.augment.time_domain.DoClipclip_low: !ref <clip_low>clip_high: !ref <clip_high># Random Amplituderand_amp: !new:speechbrain.augment.time_domain.RandAmpamp_low: !ref <amp_low>amp_high: !ref <amp_high># Noise sequence derived by summing up all the signals in the batch# It is similar to babble noisesum_batch: !name:torch.sumdim: 0keepdim: Truebabble_noise: !new:speechbrain.augment.time_domain.AddNoisesnr_low: !ref <babble_snr_low>snr_high: !ref <babble_snr_high>noise_funct: !ref <sum_batch>drop_resolution: !new:speechbrain.augment.time_domain.DropBitResolutiontarget_dtype: 'random'# Augmenter: Combines previously defined augmentations to perform data augmentationwav_augment: !new:speechbrain.augment.augmenter.Augmenterparallel_augment: !ref <time_parallel_augment>concat_original: !ref <time_concat_original>repeat_augment: !ref <time_repeat_augment>shuffle_augmentations: !ref <time_shuffle_augmentations>min_augmentations: !ref <time_min_augmentations>max_augmentations: !ref <time_max_augmentations>augment_prob: !ref <time_augment_prob>augmentations: [!ref <codec_augment>,!ref <add_reverb>,!ref <add_noise>,!ref <babble_noise>,!ref <speed_perturb>,!ref <clipping>,!ref <drop_freq>,!ref <drop_chunk>,!ref <rand_amp>,!ref <drop_resolution>]enable_augmentations: [!ref <enable_codec_augment>,!ref <enable_add_reverb>,!ref <enable_add_noise>,!ref <enable_babble_noise>,!ref <enable_speed_perturb>,!ref <enable_clipping>,!ref <enable_drop_freq>,!ref <enable_drop_chunk>,!ref <enable_rand_amp>,!ref <enable_drop_resolution>]# ----- FEATURE AUGMENTATION ----- ## Time shifttime_shift: !new:speechbrain.augment.freq_domain.RandomShiftmin_shift: !ref <min_time_shift>max_shift: !ref <max_time_shift>dim: 1# Frequency shiftfreq_shift: !new:speechbrain.augment.freq_domain.RandomShiftmin_shift: !ref <min_freq_shift>max_shift: !ref <max_freq_shift>dim: 2# Time Droptime_drop: !new:speechbrain.augment.freq_domain.SpectrogramDropdrop_length_low: !ref <time_drop_length_low>drop_length_high: !ref <time_drop_length_high>drop_count_low: !ref <time_drop_count_low>drop_count_high: !ref <time_drop_count_high>replace: !ref <time_drop_replace>dim: 1# Frequency Dropfreq_drop: !new:speechbrain.augment.freq_domain.SpectrogramDropdrop_length_low: !ref <freq_drop_length_low>drop_length_high: !ref <freq_drop_length_high>drop_count_low: !ref <freq_drop_count_low>drop_count_high: !ref <freq_drop_count_high>replace: !ref <freq_drop_replace>dim: 2# Time warptime_warp: !new:speechbrain.augment.freq_domain.Warpingwarp_window: !ref <time_warp_window>warp_mode: !ref <time_warp_mode>dim: 1freq_warp: !new:speechbrain.augment.freq_domain.Warpingwarp_window: !ref <freq_warp_window>warp_mode: !ref <freq_warp_mode>dim: 2fea_augment: !new:speechbrain.augment.augmenter.Augmenterparallel_augment: !ref <fea_parallel_augment>concat_original: !ref <fea_concat_original>repeat_augment: !ref <fea_repeat_augment>shuffle_augmentations: !ref <fea_shuffle_augmentations>min_augmentations: !ref <fea_min_augmentations>max_augmentations: !ref <fea_max_augmentations>augment_start_index: !ref <batch_size> # This leaves original inputs unchangedconcat_end_index: !ref <batch_size> # This leaves original inputs unchangedaugment_prob: !ref <fea_augment_prob>augmentations: [!ref <time_shift>,!ref <freq_shift>,!ref <time_drop>,!ref <freq_drop>,!ref <time_warp>,!ref <freq_warp>]enable_augmentations: [!ref <enable_time_shift>,!ref <enable_freq_shift>,!ref <enable_time_drop>,!ref <enable_freq_drop>,!ref <enable_time_warp>,!ref <enable_freq_warp>]# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.encoder: !new:speechbrain.lobes.models.CRDNN.CRDNNinput_shape: [null, null, !ref <n_mels>]activation: !ref <activation>dropout: !ref <dropout>cnn_blocks: !ref <cnn_blocks>cnn_channels: !ref <cnn_channels>cnn_kernelsize: !ref <cnn_kernelsize>inter_layer_pooling_size: !ref <inter_layer_pooling_size>time_pooling: Trueusing_2d_pooling: Falsetime_pooling_size: !ref <time_pooling_size>rnn_class: !ref <rnn_class>rnn_layers: !ref <rnn_layers>rnn_neurons: !ref <rnn_neurons>rnn_bidirectional: !ref <rnn_bidirectional>rnn_re_init: Truednn_blocks: !ref <dnn_blocks>dnn_neurons: !ref <dnn_neurons>use_rnnp: False# Embedding (from indexes to an embedding space of dimension emb_size).embedding: !new:speechbrain.nnet.embedding.Embeddingnum_embeddings: !ref <output_neurons>embedding_dim: !ref <emb_size># Attention-based RNN decoder.decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoderenc_dim: !ref <dnn_neurons>input_size: !ref <emb_size>rnn_type: gruattn_type: locationhidden_size: !ref <dec_neurons>attn_dim: 1024num_layers: 1scaling: 1.0channels: 10kernel_size: 100re_init: Truedropout: !ref <dropout># Linear transformation on the top of the encoder.ctc_lin: !new:speechbrain.nnet.linear.Linearinput_size: !ref <dnn_neurons>n_neurons: !ref <output_neurons># Linear transformation on the top of the decoder.seq_lin: !new:speechbrain.nnet.linear.Linearinput_size: !ref <dec_neurons>n_neurons: !ref <output_neurons># Final softmax (for log posteriors computation).log_softmax: !new:speechbrain.nnet.activations.Softmaxapply_log: True# Cost definition for the CTC part.ctc_cost: !name:speechbrain.nnet.losses.ctc_lossblank_index: !ref <blank_index># Tokenizer initializationtokenizer: !new:sentencepiece.SentencePieceProcessor# Objects in "modules" dict will have their parameters moved to the correct# device, as well as having train()/eval() called on them by the Brain classmodules:
encoder: !ref <encoder>embedding: !ref <embedding>decoder: !ref <decoder>ctc_lin: !ref <ctc_lin>seq_lin: !ref <seq_lin>normalize: !ref <normalize>lm_model: !ref <lm_model># Gathering all the submodels in a single model object.model: !new:torch.nn.ModuleList
- - !ref <encoder>
- !ref<embedding>
- !ref<decoder>
- !ref<ctc_lin>
- !ref<seq_lin># This is the RNNLM that is used according to the Huggingface repository# NB: It has to match the pre-trained RNNLM!!lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLMoutput_neurons: !ref <output_neurons>embedding_dim: !ref <emb_size>activation: !name:torch.nn.LeakyReLUdropout: 0.0rnn_layers: 2rnn_neurons: 2048dnn_blocks: 1dnn_neurons: 512return_hidden: True # For inference# Define scorers for beam search# If ctc_scorer is set, the decoder uses CTC + attention beamsearch. This# improves the performance, but slows down decoding.ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorereos_index: !ref <eos_index>blank_index: !ref <blank_index>ctc_fc: !ref <ctc_lin># If coverage_scorer is set, coverage penalty is applied based on accumulated# attention weights during beamsearch.coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorervocab_size: !ref <output_neurons># If the lm_scorer is set, a language model# is applied (with a weight specified in scorer).rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorerlanguage_model: !ref <lm_model>temperature: !ref <temperature_lm># Gathering all scorers in a scorer instance for beamsearch:# - full_scorers are scorers which score on full vocab set, while partial_scorers# are scorers which score on pruned tokens.# - The number of pruned tokens is decided by scorer_beam_scale * beam_size.# - For some scorers like ctc_scorer, ngramlm_scorer, putting them# into full_scorers list would be too heavy. partial_scorers are more# efficient because they score on pruned tokens at little cost of# performance drop. For other scorers, please see the speechbrain.decoders.scorer.test_scorer: !new:speechbrain.decoders.scorer.ScorerBuilderscorer_beam_scale: 1.5full_scorers: [!ref <rnnlm_scorer>,!ref <coverage_scorer>]partial_scorers: [!ref <ctc_scorer>]weights:
rnnlm: !ref <lm_weight>coverage: !ref <coverage_penalty>ctc: !ref <ctc_weight_decode>valid_scorer: !new:speechbrain.decoders.scorer.ScorerBuilderfull_scorers: [!ref <coverage_scorer>]weights:
coverage: !ref <coverage_penalty># Beamsearch is applied on the top of the decoder. For a description of# the other parameters, please see the speechbrain.decoders.S2SRNNBeamSearcher.# It makes sense to have a lighter search during validation. In this case,# we don't use scorers during decoding.valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcherembedding: !ref <embedding>decoder: !ref <decoder>linear: !ref <seq_lin>bos_index: !ref <bos_index>eos_index: !ref <eos_index>min_decode_ratio: !ref <min_decode_ratio>max_decode_ratio: !ref <max_decode_ratio>beam_size: !ref <valid_beam_size>eos_threshold: !ref <eos_threshold>using_max_attn_shift: !ref <using_max_attn_shift>max_attn_shift: !ref <max_attn_shift>temperature: !ref <temperature>scorer: !ref <valid_scorer># The final decoding on the test set can be more computationally demanding.# In this case, we use the LM + CTC probabilities during decoding as well,# which are defined in scorer.# Please, remove scorer if you need a faster decoder.test_search: !new:speechbrain.decoders.S2SRNNBeamSearcherembedding: !ref <embedding>decoder: !ref <decoder>linear: !ref <seq_lin>bos_index: !ref <bos_index>eos_index: !ref <eos_index>min_decode_ratio: !ref <min_decode_ratio>max_decode_ratio: !ref <max_decode_ratio>beam_size: !ref <test_beam_size>eos_threshold: !ref <eos_threshold>using_max_attn_shift: !ref <using_max_attn_shift>max_attn_shift: !ref <max_attn_shift>temperature: !ref <temperature>scorer: !ref <test_scorer># This function manages learning rate annealing over the epochs.# We here use the NewBoB algorithm, that anneals the learning rate if# the improvements over two consecutive epochs is less than the defined# threshold.lr_annealing: !new:speechbrain.nnet.schedulers.NewBobSchedulerinitial_value: !ref <lr>improvement_threshold: 0.0025annealing_factor: 0.8patient: 0# This optimizer will be constructed by the Brain class after all parameters# are moved to the correct device. Then it will be added to the checkpointer.opt_class: !name:torch.optim.Adadeltalr: !ref <lr>rho: 0.95eps: 1.e-8# Functions that compute the statistics to track during the validation step.error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStatscer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStatssplit_tokens: True# This object is used for saving the state of training both so that it# can be resumed if it gets interrupted, and also so that the best checkpoint# can be later loaded for evaluation or inference.checkpointer: !new:speechbrain.utils.checkpoints.Checkpointercheckpoints_dir: !ref <save_folder>recoverables:
model: !ref <model>scheduler: !ref <lr_annealing>normalizer: !ref <normalize>counter: !ref <epoch_counter># This object is used to pretrain the language model and the tokenizers# (defined above). In this case, we also pretrain the ASR model (to make# sure the model converges on a small amount of data)pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainercollect_in: !ref <save_folder>loadables:
lm: !ref <lm_model>tokenizer: !ref <tokenizer>model: !ref <model>paths:
lm: !ref <pretrained_path>/lm.ckpttokenizer: !ref <pretrained_path>/tokenizer.ckptmodel: !ref <pretrained_path>/asr.ckpt
================
This code is running ins WSL 2 and a virtual conda environment, the packages are as follows:
if target_shape[j] != 0:
valid_vals.append(tensor.shape[j] / target_shape[j])
else:
valid_vals.append(1.0)
So it looks like the following:
After the such a modification and given a new batchsize of 8, the training process went on well and finished one epoch as following:
However, before going any further, it reported an error "sentencepiece_processor.cc(954) LOG(ERROR) src/sentencepiece_processor.cc(294) [model_] Model is not initialized."
And throwed an exception as follows:
Back to the frame of compute_objectives(), the information is as follows:
My problem is caused because my mistake that I commented the initialization codes of the pretrainer. As shown in the red box of the following figure, I commented them and got the zerodivision error. Besides, I didn't give the correct configure args for pretrainer section in train.yaml .
Describe the bug
With the version of speechbrain being 1.0.0, the unchanged source code and the Mini LibriSpeech dataset, I worked according to the Google Colab (https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=6xcAJ4OlYZCh). When using the command line
python train.py /mnt/e/05_AIAudio/01_Codes/speechbrain/templates/speech_recognition/ASR/train.yaml --number_of_epochs 1 --batch_size 2 --enable_add_reverb False
to train the ASR recognizer, after loading some batches data correctly, it reported the following exception:The following screenshot shows that when running pad_right_to(), the target_shape[0] is 0.
Back to last frame, the following screenshot shows that the two tensors in this batch are both empty.
Back to last frame, the following screenshot shows that when the key is 'tokens', the two sensors of the batch are both empty.
The train data according to the first tensor of this batch is normal as follows:
After some debugging, I found this might be caused by the empty result "tokens_list" returned by "hparams["tokenizer"].encode_as_ids(words)", therefore the empty result "tokens" return by "tokens = torch.LongTensor(tokens_list)", shown as follows:
Expected behaviour
When the input wav and its text are not empty, the "tokens_list" returned "hparams["tokenizer"].encode_as_ids(words)" and the "tokens" returned by "torch.LongTensor(tokens_list)" shouldn't be empty
To Reproduce
No response
Environment Details
No response
Relevant Log Output
No response
Additional Context
================
3. RNNLM.yaml
================
4. train.yaml
================
This code is running ins WSL 2 and a virtual conda environment, the packages are as follows:
The text was updated successfully, but these errors were encountered: