InaGVAD : a Challenging French TV and Radio Corpus annotated for Speech Activity Detection and Speaker Gender Segmentation

To be released soon

This corpus will be released on May 20th 2024. Stay in touch!

About

InaGVAD is an annotated audiovisual corpus designed for Voice Activity Detection (VAD) and Speaker Gender Segmentation (SGS), aimed at representing the acoustic diversity of French TV and Radio programs. This corpus is freely available for research purposes and can be downloaded on French National Institute of Audivisual website. InaGVAD detailed description, together with a benchmark of 6 freely available VAD systems and 3 SGS systems, is provided in a paper presented in LREC-COLING 2024.

InaGVAD contains 277 1-minute-long annotated recordings, partitioned into a 1h development and 3h37 test subset, allowing fair and reproducible system evaluation. Evaluation scripts provided with the corpus provide performance estimates in the same conditions as the 6 VAD and 3 SGS systems presented in the associated paper. Recordings were collected from 10 French radio and 18 TV channels categorized into 4 groups associated to diverse acoustic conditions : generalist radio, music radio, news TV, and generalist TV.

InaGVAD provides an extended VAD and SGS annotation scheme, allowing to describe systems diverse abilities based on :

Speaker Traits categories
- 3 Genders : Female, Male, I Don't Know (IDK)
- 3 Age groups : Young (prepubescent), Adult, Ederly (Senior)
- 3 Speech Qualities : standard, interjections (ah, oh, eg, aie), atypical (crying, laughing or shouted speech, ill person voice, artificially distorted voices, vocal performance, monster voice...)
10 Non-Speech event categories : Applause, environmental noise, hubbub, jingle, foreground music, background music, respiration, non-intelligible laughers, other, empty

The entire inaGVAD package; including corpus, annotations, evaluation scripts, and baseline training code; is made freely accessible, fostering future advancement in the domain.

Statement of need

Over the past few years, a growing amount of digital humanity studies, as well as French audiovisual regulation authorities reports, have used automatic Voice Activity Detection (VAD) and Speaker Gender Segmentation (SGS) for estimating women's and men's speaking time in massive amounts of audiovisual media (Dou18, ARC24). If these studies are associated to high social impact and mediatic coverage, the lack of appropriate annotated speech resources makes it difficult to estimate the reliability of SGS systems on the diversity of audiovisual materials.

Speech corpora designed for ASR (ESTER, REPERE) tend to favor the quantity of lexical terms to the accurate timing of non-speech events. Their programs are mostly composed of news or debates, excluding documentaries, movies, cartoons, musical programs and advertisments.
Speech resources suited to VAD (AVA-Speech, DI-HARD 2, RATS, LibriParty) do provide more accurate timings but lacks speaker traits (gender, age), speech quality (timbre, ellocution) and non-speech event annotation.
Speaker recognition corpora provide isolated speaker segments not allowing to evaluate speaker changes, and are generally obtained from interviews using automatic methods (diarization, VAD, active speaker detection) discarding atypical vocal performances or noise conditions (Voxceleb, INA speaker dictionnary, INA diachronic speaker dicionnary).

InaGVAD is aimed at closing the gap between ASR, VAD and speaker corpora and provides :

fine-grained time-coded speech and non-speech events annotations
speaker traits (gender, age) and speech quality annotations
materials representing the diversity of contents that can be found in French TV and radio
freely available corpus and evaluation code allowing to train and evaluate models

A Voice Activity Detection benchmark based on 6 open-source systems (inaSpeechSegmenter, LIUM_SpkDiarization, Pyannote, Rvad, Silero, SpeechBrain) show InaGVAD generalist TV and music radio categories are more challenging than estimates obtained on AMI, VoxConverse and DIHARD 3 VAD corpora. A baseline X-vector transfer learning strategy, trained on inaGVAD 1h development set, show that models trained on a single - but diverse - hour of data can achieve competitive SGS results.

Downloading Audio files

Downloading inaGVAD audio files requires to fill accept its genral term and conditions of use (GCU) and to fill the form available on https://www.ina.fr/institut-national-audiovisuel/research/dataset-project

Installation

Installing dependencies:

pip install .

Evaluating Voice Activity Detection (VAD) systems

Evaluating Speaker Gender Segmentation (SGS) systems

Training a new system

Citing

inaGVAD has been fully described in a paper accepted to LREC-COLING 2024 to be published on May 20th 2024 at LREC-COLING. If using this corpus in your research, please cite the following study.

@inproceedings{inagvad2024,
title={InaGVAD : a Challenging French TV and Radio Corpus annotated for Speech Activity Detection and Speaker Gender Segmentation},
author={David Doukhan and Christine Maertens and William {Le Personnic} and Ludovic Speroni and Reda Dehak},
booktitle={Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
year={2024},
}

CREDITS

Audiovisual archives were provided with the support of French National Audiovisual Institute (INA). This work has been partially funded by the French National Research Agency (project GEM : Gender Equality Monitor : ANR-19-CE38-0012).

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
GenderSeg		GenderSeg
annotations		annotations
automatic_baselines		automatic_baselines
inaGVAD		inaGVAD
notebooks		notebooks
DETAILED_FA.ipynb		DETAILED_FA.ipynb
EVAL_GENDER.ipynb		EVAL_GENDER.ipynb
EVAL_GENDER_V2.ipynb		EVAL_GENDER_V2.ipynb
EVAL_VAD.ipynb		EVAL_VAD.ipynb
LICENSE		LICENSE
README.md		README.md
VAD_PAPER.ipynb		VAD_PAPER.ipynb
inagvad_paper.pdf		inagvad_paper.pdf
setup.py		setup.py

License

ina-foss/InaGVAD

Folders and files

Latest commit

History

Repository files navigation

InaGVAD : a Challenging French TV and Radio Corpus annotated for Speech Activity Detection and Speaker Gender Segmentation

To be released soon

About

Statement of need

Downloading Audio files

Installation

Evaluating Voice Activity Detection (VAD) systems

Evaluating Speaker Gender Segmentation (SGS) systems

Training a new system

Citing

CREDITS

About

Topics

Resources

License

Stars

Watchers

Forks

Languages