Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plda Scoring #548

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open

Plda Scoring #548

wants to merge 65 commits into from

Conversation

sadrasabouri
Copy link
Contributor

@sadrasabouri sadrasabouri commented May 22, 2021

Initially issued by @gooran and is mainly inspired by similar task in lid_kaldi.

By merging this pull request bellow changes will be happened:

ADDED:

  • KaldiRecognizer::PldaScoring added to kaldi_recognizer.cc
  • plda added to spk_model.h
  • plda_config added to spk_model.h
  • plda_rxfilename added to spk_model.h
  • vad_opts added to spk_model.h
  • num_utts added to spk_model.h
  • train_ivectors added to spk_model.h
  • train_ivector_rspecifier added to spk_model.h
  • num_utts_rspecifier added to spk_model.h
  • sorted_scores method added to test_speaker.py
  • spk_sig vector changed to be similar dim with model (dim=128)

After this PR for each utterance after xvector extraction there will be a PLDA scoring which scores the likelihood between the uterance speaker (test) and train xvectors and this process will be done automatically after each utterance and the JSON result will contain a new field called scores which should be work like this:

[
...
"scores" :
    [
    {'speaker': 'spk0', 'score': -3.518532},
    {'speaker': 'spk1', 'score': 8.106313},
    ...
    ]
]

There is also a tiny edit on spk field returning xvector of latest utterance. After this PR this field will be filled by PldaScoring method during its computation for PLDA.


I've tested this feature with bellow model files and all the things seems normal:

#!/usr/bin/env bash
# Copyright      2017   David Snyder
#                2017   Johns Hopkins University (Author: Daniel Garcia-Romero)
#                2017   Johns Hopkins University (Author: Daniel Povey)
# Apache 2.0.
#
# See README.txt for more info on data required.
# Results (mostly EERs) are inline in comments below.
#
# This example demonstrates a "bare bones" NIST SRE 2016 recipe using xvectors.
# It is closely based on "X-vectors: Robust DNN Embeddings for Speaker
# Recognition" by Snyder et al.  In the future, we will add score-normalization
# and a more effective form of PLDA domain adaptation.
#
# Pretrained models are available for this recipe.  See
# http://kaldi-asr.org/models.html and
# https://david-ryan-snyder.github.io/2017/10/04/model_sre16_v2.html
# for details.

. ./cmd.sh
. ./path.sh
set -e
mfccdir=`pwd`/mfcc
vaddir=`pwd`/mfcc


sre16_trials=data/sre16_eval_test/trials
nnet_dir=exp/xvector_nnet_1a

stage=9

if [ $stage -le 1 ]; then
  # Make MFCCs and compute the energy-based VAD for each dataset
  for name in sre16_major sre16_eval_test sre16_eval_enroll; do
    steps/make_mfcc.sh --write-utt2num-frames true --mfcc-config conf/mfcc.conf --nj 20 --cmd "$train_cmd" \
      data/${name} exp/make_mfcc $mfccdir
    utils/fix_data_dir.sh data/${name}
    sid/compute_vad_decision.sh --nj 20 --cmd "$train_cmd" \
      data/${name} exp/make_vad $vaddir
    utils/fix_data_dir.sh data/${name}
  done
fi


if [ $stage -le 7 ]; then
  # The SRE16 major is an unlabeled dataset consisting of Cantonese and
  # and Tagalog.  This is useful for things like centering, whitening and
  # score normalization.
  sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 6G" --nj 20 \
    $nnet_dir data/sre16_major \
    exp/xvectors_sre16_major

  # The SRE16 test data
  sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 6G" --nj 20 \
    $nnet_dir data/sre16_eval_test \
    exp/xvectors_sre16_eval_test

  # The SRE16 enroll data
  sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 6G" --nj 20 \
    $nnet_dir data/sre16_eval_enroll \
    exp/xvectors_sre16_eval_enroll
fi

if [ $stage -le 9 ]; then
  # Get results using the out-of-domain PLDA model.
  $train_cmd exp/scores/log/sre16_eval_scoring.log \
    ivector-plda-scoring --normalize-length=true \
    --num-utts=ark:exp/xvectors_sre16_eval_enroll/num_utts.ark \
    "ivector-copy-plda --smoothing=0.0 exp/xvectors_sre_combined/plda - |" \
    "ark:ivector-mean ark:data/sre16_eval_enroll/spk2utt scp:exp/xvectors_sre16_eval_enroll/xvector.scp ark:- | ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec ark:- ark:- | transform-vec exp/xvectors_sre_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
    "ark:ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec scp:exp/xvectors_sre16_eval_test/xvector.scp ark:- | transform-vec exp/xvectors_sre_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
    "cat '$sre16_trials' | cut -d\  --fields=1,2 |" exp/scores/sre16_eval_scores || exit 1;

  pooled_eer=$(paste $sre16_trials exp/scores/sre16_eval_scores | awk '{print $6, $3}' | compute-eer - 2>/dev/null)

  echo "Using Out-of-Domain PLDA, EER: Pooled ${pooled_eer}%"
  # EER: Pooled 11.73%, Tagalog 15.96%, Cantonese 7.52%
  # For reference, here's the ivector system from ../v1:
  # EER: Pooled 13.65%, Tagalog 17.73%, Cantonese 9.61%
fi

if [ $stage -le 10 ]; then
$train_cmd copy_plda.log ivector-copy-plda --smoothing=0.0 exp/xvectors_sre16_major/plda_adapt exp/xvectors_sre16_major/plda_adapt.smooth0.1
  $train_cmd log.1.log ivector-mean ark:data/sre16_eval_enroll/spk2utt scp:exp/xvectors_sre16_eval_enroll/xvector.scp ark:exp/xvectors_sre16_eval_test/xvector.55.scp ark:exp/xvectors_sre16_eval_enroll/num_utts_.ark
  $train_cmd log.2.log ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec ark:exp/xvectors_sre16_eval_test/xvector.55.scp ark:exp/xvectors_sre16_eval_test/xvector.66.scp
  $train_cmd log.3.log transform-vec exp/xvectors_sre_combined/transform.mat ark:exp/xvectors_sre16_eval_test/xvector.66.scp ark:exp/xvectors_sre16_eval_test/xvector.77.scp
  $train_cmd log.4.log ivector-normalize-length ark:exp/xvectors_sre16_eval_test/xvector.77.scp  ark:exp/xvectors_sre16_eval_test/xvector.final.train.scp
  # Get results using the adapted PLDA model.
  $train_cmd exp/scores/log/sre16_eval_scoring_adapt.log \
    ivector-plda-scoring --normalize-length=true \
    --num-utts=ark:exp/xvectors_sre16_eval_enroll/num_utts.ark \
    "ivector-copy-plda --smoothing=0.0 exp/xvectors_sre16_major/plda_adapt - |" \
    "ark:ivector-mean ark:data/sre16_eval_enroll/spk2utt scp:exp/xvectors_sre16_eval_enroll/xvector.scp ark:- | ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec ark:- ark:- | transform-vec exp/xvectors_sre_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
    "ark:ivector-subtract-global-mean exp/xvectors_sre16_major/mean.vec scp:exp/xvectors_sre16_eval_test/xvector.scp ark:- | transform-vec exp/xvectors_sre_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" \
    "cat '$sre16_trials' | cut -d\  --fields=1,2 |" exp/scores/sre16_eval_scores_adapt || exit 1;

  pooled_eer=$(paste $sre16_trials exp/scores/sre16_eval_scores_adapt | awk '{print $6, $3}' | compute-eer - 2>/dev/null)
  echo "Using Adapted PLDA, EER: Pooled ${pooled_eer}%"
  # EER: Pooled 8.57%, Tagalog 12.29%, Cantonese 4.89%
  # For reference, here's the ivector system from ../v1:
  # EER: Pooled 12.98%, Tagalog 17.8%, Cantonese 8.35%
  #
  # Using the official SRE16 scoring software, we obtain the following equalized results:
  #
  # -- Pooled --
  #  EER:          8.66
  #  min_Cprimary: 0.61
  #  act_Cprimary: 0.62
  #
  # -- Cantonese --
  # EER:           4.69
  # min_Cprimary:  0.42
  # act_Cprimary:  0.43
  #
  # -- Tagalog --
  # EER:          12.63
  # min_Cprimary:  0.76
  # act_Cprimary:  0.81
fi
  1. final.ext.raw : extracted version of sre16 model using bellow command:
nnet3-copy --nnet-config=extract.config final.raw final.ext.raw
  1. mfcc.conf MFCC config file
  2. plda_adapt.smooth0.1 : smoothed version of PLDA
  3. spk_xvectors.ark : trained speaker's xvector archive file
  4. vad.conf : VAD config file
  5. mean.vec : mean vector
  6. num_utts.ark : number of utterances associated to each speaker
  7. README.md : README file
  8. transform.mat : transformation matrix

Any comment and enhancement will be happily accepted.

@sadrasabouri sadrasabouri marked this pull request as ready for review June 9, 2021 19:15
@sadrasabouri sadrasabouri marked this pull request as draft June 9, 2021 19:17
@sadrasabouri sadrasabouri marked this pull request as ready for review July 12, 2021 19:00
@sadrasabouri sadrasabouri changed the title Plda Scoring and VAD added Plda Scoring Jul 12, 2021
Copy link

@gooran gooran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

@@ -397,7 +397,8 @@ bool KaldiRecognizer::GetSpkVector(Vector<BaseFloat> &out_xvector, int *num_spk_
// xvector_result is filled with xvector for PldaScoring process
xvector_result = xvector;
// out_xvector will be filled by PldaScoring method from utterance
// xvector after transformation
// xvector before transformation so that it can be used for new
// users enrollment
PldaScoring(out_xvector);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you get this out_xvector? only for enrollment? I guess that maybe it's better to add a specific method for this task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out_xvector is passed as reference to this function and it will be filled to sending to user as spk field.
Yes it can be used for enrollment and it should be added to spk_xvectors.ark as ark format.
Defining new function will make some redundancy computing xvector again while we compute this values in PldaScoring once.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be best to review the entire speaker enrollment and scoring scenario once. Using a function for two purposes is not interesting.

Copy link

@gooran gooran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to distinguish between speaker recognition and speech recognition task in the general structure. It may be best to have two separate recognizer modules for these two tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants