Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

online mix noise audio data in training step #2622

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

mychiux413
Copy link
Contributor

Mixing noisy data into training file before runtime could cause data monotonicity, but mixing noisy data in runtime could cause very bad performance, if we read each noise audio to augment each training row. (Ex: for HDD disk, the duration of mixing one audio is almost 100 times slower than freq_time_mask does).

To reduce online mixing time, I use another tf.Dataset to cache noise audio array, then mix them to training data.

usage:

python -u DeepSpeech.py --noshow_progressbar \
  --train_files data/ldc93s1/ldc93s1.csv \
  --test_files data/ldc93s1/ldc93s1.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 200 \
  --epochs 200 \
  --checkpoint_dir <checkpoint_dir> \
  --audio_aug_mix_noise_walk_dirs <directory1-contains-wav-files>,<directory2-contains-wav-files>
  • Just specify the noise file directory, the process will automatically walk through the whole directory recursively, and collect .wav files (but it doesn't checkout the sample rate).
  • This program assume every volume of noise audio have been maximized, to save the calculation time of each speech/noise volume balance, it just simply divide speech audio with value between 0~-10 db, and divide noise audio with value between -25~-50 db
  • The augment time can be as fast as freq_time_mask
  • --audio_aug_mix_noise_walk_dirs can set multi dirs with comma separated.

To manually adjust volume loudness suppression:

python -u DeepSpeech.py \
...
--audio_aug_mix_noise_max_noise_db -25 \
--audio_aug_mix_noise_min_noise_db -50 \
--audio_aug_mix_noise_max_audio_db 0 \
--audio_aug_mix_noise_min_audio_db -10 \
...
  • If your noise files are pure non-speaker noise, my experience paramters is --audio_aug_mix_noise_max_noise_db -15, --audio_aug_mix_noise_min_noise_db -25
  • If your noise files are from speakers, like cocktail party, my experience paramters is --audio_aug_mix_noise_max_noise_db -30, --audio_aug_mix_noise_min_noise_db -50, otherwise, the voice can have a chance to cover the main speaker's volume.
  • If you want to cache audio array into local disk, set --audio_aug_mix_noise_cache <your cache path>, otherwise cache in memory.

@community-tc-integration
Copy link

No Taskcluster jobs started for this pull request
The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.

@DanBmh
Copy link
Contributor

DanBmh commented Feb 5, 2020

I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data.
Everything worked as intended. Only i didnt see a great difference in my training results (using Voxforge DE dataset). Maybe its too small.

@alokprasad
Copy link

I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data.
Everything worked as intended. Only i didnt see a great difference in my training results (using Voxforge DE dataset). Maybe its too small.

DId u mean.. the noise dataset is small or Voxforge dataset is small, comparatively.
One suggestion: If your feel noise dataset is small, you can use rnnoise's (https://people.xiph.org/~jm/demo/rnnoise/rnnoise_contributions.tar.gz) dataset

@DanBmh
Copy link
Contributor

DanBmh commented Feb 6, 2020

I did mean the voxforge dataset. It has only around 32h of speech data.

i think rnnoise dataset is smaller than the freesound one (6 vs 22 gb, did not find the length in hours).

Also the noise files of rnnoise are in .raw format and freesound already has .wav format. So you need to convert them before to wav somehow.

Copy link
Contributor

@DanBmh DanBmh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also think about replacing the cache() call with prefetch(tf.data.experimental.AUTOTUNE). For me it reduced the memory usage about 64gb without an impact on training speed.

util/feeding.py Outdated Show resolved Hide resolved
… for memory cost [MOD] deprecate FLAGS.audio_aug_mix_noise_cache
@mychiux413
Copy link
Contributor Author

To use rnnoise datasets, we should normalize the volume and convert frame rate to 16000 manually, and many of rnnoise audio are almost no sound without normalizing volume.
This mix noise process assume every single noise file volume were maximized, so it doesn't calculate dBFS to balance speech/noise volume when processing.

@alokprasad
Copy link

@mychiux413 any idea how can be this done ? it should be online process ?

@mychiux413
Copy link
Contributor Author

@mychiux413 any idea how can be this done ? it should be online process ?

you should prepare normalized noise files by yourself before training start.

there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well.

notice:

  1. I use pydub in the example, before pip install pydub, you should install ffmpeg by sudo apt-get install ffmpeg
  2. the raw data I've downloaded from rnnoise is .raw, which should be manually specified frame rate, sample size, channel
  3. some rnnoise data duration are almost 5 minutes, which is unnecessary in online mixing, so the example split them into 30 secs around.
  4. the script is under python environment 3.7 (typing supported)

usage:

python <python_file.py> --from_dir <directory include rnnoise data> --to_dir <directory to output normalized data>
from __future__ import absolute_import, division, print_function
from pydub import AudioSegment
from multiprocessing import Pool
from functools import partial
import math
import argparse
import sys
import os


def detect_silence(sound: AudioSegment, silence_threshold=-50.0,
                   chunk_size=10) -> (int, int):
    start_trim = 0  # ms
    sound_size = len(sound)
    assert chunk_size > 0  # to avoid infinite loop
    while sound[start_trim:(
            start_trim +
            chunk_size)].dBFS < silence_threshold and start_trim < sound_size:
        start_trim += chunk_size

    end_trim = sound_size
    while sound[(end_trim - chunk_size):end_trim].dBFS < silence_threshold \
            and end_trim > 0:
        end_trim -= chunk_size

    start_trim = min(sound_size, start_trim)
    end_trim = max(0, end_trim)

    return min([start_trim, end_trim]), max([start_trim, end_trim])


def trim_silence_audio(sound: AudioSegment,
                       silence_threshold=-50.0,
                       chunk_size=10) -> AudioSegment:
    start_trim, end_trim = detect_silence(sound, silence_threshold, chunk_size)
    return sound[start_trim:end_trim]


def convert(filename: str, src_dir: str, dst_dirpath: str, dirpath: str,
            normalize: bool, trim_silence: bool, min_duration_seconds: float,
            max_duration_seconds: float):
    if not filename.endswith(('.wav', '.raw')):
        return
    filepath = os.path.join(dirpath, filename)
    if filename.endswith('.wav'):
        sound: AudioSegment = AudioSegment.from_file(filepath)
    else:
        try:
            sound: AudioSegment = AudioSegment.from_raw(filepath,
                                                        sample_width=2,
                                                        frame_rate=44100,
                                                        channels=1)
        except Exception as err:
            print('[retry] {}'.format(err))
            try:
                sound: AudioSegment = AudioSegment.from_raw(filepath,
                                                            sample_width=2,
                                                            frame_rate=48000,
                                                            channels=1)
            except Exception as err:
                print('bypass audio {}, got error: {}'.format(filepath, err))
                return
        try:
            sound = sound.set_frame_rate(16000)
        except Exception as err:
            print('[bypass] {}'.format(err))
            return

    n_splits: int = max(
        1, math.floor(sound.duration_seconds / max_duration_seconds))
    chunk_duration_ms = math.ceil(len(sound) / n_splits)
    chunks = []
    for i in range(n_splits):
        end_ms = min((i + 1) * chunk_duration_ms, len(sound))
        chunk = sound[(i * chunk_duration_ms):end_ms]
        chunks.append(chunk)
    for i, chunk in enumerate(chunks):
        dst_path = os.path.join(dst_dirpath, str(i) + '_' + filename)
        if dst_path.endswith('.raw'):
            dst_path = dst_path[:-4] + '.wav'
        if os.path.exists(dst_path):
            print('audio exists: {}'.format(dst_path))
            return
        if normalize:
            chunk = chunk.normalize()
            if chunk.dBFS < -30.0:
                chunk = chunk.compress_dynamic_range().normalize()
            if chunk.dBFS < -30.0:
                chunk = chunk.compress_dynamic_range().normalize()
        if trim_silence:
            chunk = trim_silence_audio(chunk)

        if chunk.duration_seconds < min_duration_seconds:
            return
        chunk.export(dst_path, format='wav')


def main(src_dir: str,
         dst_dir: str,
         min_duration_seconds: float,
         max_duration_seconds: float,
         normalize=True,
         trim_silence=True):
    assert os.path.exists(src_dir)
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir, exist_ok=False)
    src_dir = os.path.abspath(src_dir)
    dst_dir = os.path.abspath(dst_dir)

    # n_data = 0
    for dirpath, _, filenames in os.walk(src_dir):
        dirpath = os.path.abspath(dirpath)
        dst_dirpath = os.path.join(dst_dir,
                                   dirpath.replace(src_dir, '').lstrip('/'))
        print('converting dirpath: {} -> {}'.format(dirpath, dst_dirpath))
        if not os.path.exists(dst_dirpath):
            os.makedirs(dst_dirpath, exist_ok=False)

        convert_func = partial(convert,
                               src_dir=src_dir,
                               dst_dirpath=dst_dirpath,
                               dirpath=dirpath,
                               normalize=normalize,
                               trim_silence=trim_silence,
                               min_duration_seconds=min_duration_seconds,
                               max_duration_seconds=max_duration_seconds)
        p = Pool()
        p.map(convert_func, filenames)


if __name__ == "__main__":
    PARSER = argparse.ArgumentParser(description='Optimize noise files')
    PARSER.add_argument('--from_dir',
                        help='Convert wav from directory',
                        type=str)
    PARSER.add_argument('--to_dir', help='save wav to directory', type=str)
    PARSER.add_argument('--min_sec',
                        help='min duration seconds of saved file',
                        type=float,
                        default=1.0)
    PARSER.add_argument('--max_sec',
                        help='max duration seconds of saved file',
                        type=float,
                        default=30.0)
    PARSER.add_argument('--normalize',
                        action='store_true',
                        help='Trim silence, default is true',
                        default=True)
    PARSER.add_argument('--trim',
                        action='store_true',
                        help='Trim silence, default is true',
                        default=True)
    PARAMS = PARSER.parse_args()

    main(PARAMS.from_dir, PARAMS.to_dir, PARAMS.min_sec, PARAMS.max_sec,
         PARAMS.normalize, PARAMS.trim)

@DanBmh
Copy link
Contributor

DanBmh commented Feb 17, 2020

there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well.

Could you add this script to your pull request?

I added a progressbar and a summary to it, feel free to copy it back. The updated code is here: https://github.com/DanBmh/deepspeech-german/blob/master/data/normalize_noise_audio.py

@mychiux413
Copy link
Contributor Author

I added bin/normalize_noise_audio.py, and did some modifications:

  1. Removed typing for environment compatibility
  2. Fixed pylint error, added warning message for ImportError of tqdm & pydub, because they are not standard packages in requirement.txt
  3. Replaced seconds_to_hours() with util/feeding.py::secs_to_hours()

Usage:

python bin/normalize_noise_audio.py --from_dir <directory include noise data> --to_dir <directory to output normalized data>

@alokprasad
Copy link

@mychiux413 anyway we can dump the mixed files and see how effective is the mixing of noise to speech file.just to make sure mixing is proper

@mychiux413
Copy link
Contributor Author

@alokprasad You're right, in fact, all the augmented audio should be able to be reviewed in pipeline, even augment on spectrogram like pitch/tempo/mask..., or we would not have a concept to tune the proper parameters.
But in tensorflow's pipeline, it's not as simple as offline augmentation does, we should dump audio data into tensorboard by tf.summary.audio, and I'm still study this method, also trying to figure out how much refactoring this will affect.

@alokprasad
Copy link

alokprasad commented Feb 20, 2020

@mychiux413
I also tried to save the audio using tf.print 's output_stream option in following function

"def augment_noise"
    noise_ratio = tf.math.pow(10.0, choosen_noise_db / 10)
    mixed_audio = tf.multiply(audio, audio_ratio) + tf.multiply(mixed_noise, noise_ratio)
    #save to wav file              
    final_pcm = contrib_audio.encode_wav(mixed_audio,16000)
    tf.print(final_pcm,output_stream="file:///tmp/test.wav",summarize=-1)
    return mixed_audio
    #return tf.multiply(audio, audio_ratio) + tf.multiply(mixed_noise, noise_ratio)

but two problems i am facing

  1. i am not able to change parameter of output_stream dynamically so that multiple wave file is saved.
    2.Files size keeps growing so we have to stop training ctrl+c after few steps.

anyway if i listen the audio , i dont think noise is getting augmented to the speech at all.

@mychiux413
Copy link
Contributor Author

@alokprasad I tried tf.print and listened the audio, it's really augmented, maybe my default parameters are too conservative (because some noise data are "speech noise", I don't know what would them cause if too loud), and the process will not augment every single audio time step, but just randomly augment an interval for each audio, and many intervals in noise file are actually silence.
Don't forget to delete test.wav before each execution, or you will always hear the same output.
Try an extreme example: --audio_aug_mix_noise_max_noise_db=5, --audio_aug_mix_noise_min_noise_db=10, to make sure the noise does exist.

Here is another tip, you can also try --audio_aug_mix_noise_max_audio_db=10, this could simulate microphone over boosted sound effect.

@alokprasad
Copy link

@mychiux413 "process will not augment every single audio time step, but just randomly augment an interval for each audio" I think this might not produce good result , i think each interval should be mixed with noise.( i.e complete file should be mixed with noise)

Infact it would be good that same audio is fed twice to the network

  1. mixed with noise
  2. without noise.

i have added a flag in transcript.csv file with extra flag "noise_flag" whose value is 0 or 1 .
eg.csv file will have follwing

wav_filename,wav_filesize,transcript,noise_flag
test1.wav,3423,"where are you?",1
test1.wav,3423,"where are you?",0

1 is to mix noise and 0 donto mix noise.

relevant code changes

if train_phase and noise_iterator :
        audio = tf.cond(noise_flag > 0 ,
            lambda:augment_noise(
            audio,
            noise_iterator.get_next(),
            change_audio_db_max=FLAGS.audio_aug_mix_noise_max_audio_db,
            change_audio_db_min=FLAGS.audio_aug_mix_noise_min_audio_db,
            change_noise_db_max=FLAGS.audio_aug_mix_noise_max_noise_db,
            change_noise_db_min=FLAGS.audio_aug_mix_noise_min_noise_db,
        )

            ),
            lambda:audio)    

@DanBmh
Copy link
Contributor

DanBmh commented Mar 25, 2020

Yes, it make sense, I will try it, but the arguments would be twice than previous version, and how about specifying the number of sub-speakers for each speech? is this helpful for your experiments?

Do you mean augmenting with not only one but multiple background speech or noise files at once? If you dont think its to complicated this is an interesting idea. It would make the backgound noises even more realistic. In this case I would suggest to make the number not fixed, but random with an upper boundary to simulate different environments.

@dabinat
Copy link
Collaborator

dabinat commented Mar 28, 2020

Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER.

Daniel added 4 commits March 29, 2020 12:49
…setest

# Conflicts:
#	DeepSpeech.py
#	evaluate.py
#	util/feeding.py
#	util/flags.py
@mychiux413
Copy link
Contributor Author

Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER.

Here is my recent experiment result below (continuing...), I trained 20 epochs for every model with different parameters

  • noise file: rnnoise, pointsources noise
  • train dataset: librivox clean-100.csv clean-300.csv other-500.csv
  • test dataset: test-clean.csv
  • the loss records are final step (epoch = 19)
  • in addition to this, I also mixed the zh-tw speech into librivox, and test the WER.
Name min_audio_dbfs max_audio_dbfs min_snr_db max_snr_db limit_audio_peak_dbfs limit_noise_peak_dbfs train loss dev loss test loss test wer test loss (mix TW speech) test wer (mix TW speech)
Baseline (No Augmentation) 27.685342 24.046401 23.756416 0.137232 121.442734 0.454246
Default mix noise 0 -35 3 30 7 3 69.323678 21.669104 21.383959 0.112958 60.703743 0.270337
speech non over boosted 0 -35 3 30 0 3 64.432057 21.491052 21.344168 0.11471 60.352631 0.261519
noise non over boosted 0 -35 3 30 7 0 66.458655 21.09868 21.09868 0.111596 62.270283 0.269928
Wide speech volume 0 -45 3 30 7 3 67.366901 21.060449 20.68895 0.116559 59.696766 0.2673

The result shows:

  1. whatever the noise parameters are, the tests WER are always better than the test of "No Aug model"
  2. The performance of defending noise (column test wer (mix TW speech)) is very effective with mix noise training
  3. Don't be misled by training loss when mix with noise, because the space of data coverage is large than no-aug.
  4. To inspect the noise mix training, the parameters might lead some trade off here, if we want to enhance cocktail party speech, you might lose some accuracy in clean test, in my opinion, if we skip first x epochs to emphasize the clean environment, which should be equivalent to increase max SNR, so the noise test should be worse then.

So my conclusion is:

  • Tuning the noise parameters according to your target application environment, which should be equivalent as tuning skip first x epochs.
  • Of course I will also try your idea if I have free resources later.

@alokprasad
Copy link

@mychiux413 How you are generating test samples ,is it natural voice with noisy background or you have mixed clean speech with noise and then using it as test wave?

@mychiux413
Copy link
Contributor Author

@alokprasad mixed clean speech with noise, using the new feature --test_augmentation_files, the every test dataset is always librivox-test-clean.csv

…g, add option to mix multi noise into one audio [MOD] change FLAGS name, gla iterations is optional
@tilmankamp
Copy link
Contributor

@mychiux413 Master changed quite a bit since you opened this in December. Could you rebase (and squash) it?

@mychiux413
Copy link
Contributor Author

@tilmankamp Maybe I should wait for it, the latest master did so much refactoring, the ./bin/run-ldc93s1.sh even does not work, furthermore, I haven't fully understood the new project structure.

@DanBmh
Copy link
Contributor

DanBmh commented Apr 9, 2020

Whats the reason for the last commit (no-sort merge)?

Daniel added 4 commits April 12, 2020 20:01
# Conflicts:
#	DeepSpeech.py
#	evaluate.py
#	training/deepspeech_training/util/feeding.py
@DanBmh
Copy link
Contributor

DanBmh commented Apr 17, 2020

@tilmankamp Maybe I should wait for it, the latest master did so much refactoring, the ./bin/run-ldc93s1.sh even does not work, furthermore, I haven't fully understood the new project structure.

@mychiux413 Sent you a pull request.

@reuben
Copy link
Contributor

reuben commented Apr 17, 2020

Whats the reason for the last commit (no-sort merge)?

I think @carlfm01 just did an incorrect push at some point. @mychiux413 should be able to just force-push over it.

Daniel and others added 3 commits April 23, 2020 10:47
Revert "Merge branch 'no-sort' into more-augment-options"

This reverts commit 7792226, reversing
changes made to f7d1279.
Merge current master for rebase to v0.7
@DanBmh
Copy link
Contributor

DanBmh commented May 19, 2020

Am I right that this is now outdated with @tilmankamp's merged pull request #2897?

The overlay augmentation docs describe the same mixing features of noise and speech files.

@tilmankamp
Copy link
Contributor

@DanBmh Unfortunately yes. Due to the massive amount of data that we plan to use for overlaying, things had to be tighter integrated with sample reading facilities in util/sample_collections.py. Also some of the augmentations would've been hard to realize on the TensorFlow side of things. Sorry for this decision!

@DanBmh
Copy link
Contributor

DanBmh commented May 19, 2020

@DanBmh Unfortunately yes. Due to the massive amount of data that we plan to use for overlaying, things had to be tighter integrated with sample reading facilities in util/sample_collections.py. Also some of the augmentations would've been hard to realize on the TensorFlow side of things. Sorry for this decision!

Maybe you could have informed us earlier, but I'm glad this feature is now in the master branch:) And you also did add some other interesting augmentations.

Do you plan to use the noise augmentation for the next checkpoint release already?

@JRMeyer
Copy link
Contributor

JRMeyer commented Sep 22, 2020

@lissyx @reuben -- it seems like this PR can be closed

@DanBmh
Copy link
Contributor

DanBmh commented Sep 22, 2020

Just wanted to note that this PR still has an important feature which is missing in @tilmankamp's overlay implementation: The possibility to run tests with noise mixing.

@lissyx
Copy link
Collaborator

lissyx commented Sep 23, 2020

Just wanted to note that this PR still has an important feature which is missing in @tilmankamp's overlay implementation: The possibility to run tests with noise mixing.

This needs rebasing anyway, but if someone wants to do it and address the issues, it's welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants