Spanish Gigaword text based POCOLM and RNNLM training recipe #3136

saikiranvalluri · 2019-03-18T14:48:28Z

We introduce the following features into the existing fisher_spanish recipe:

Text processing (optional) scripts for Spanish Gigaword text corpus.
Train a 3-gram POCOLM using the Fisher train and gigaword texts.
POCOLM wordlist is derived using relative frequency of words in each corpus and metaparameter weights of each text corpus.
OOVs of the POCOLM wordlist are added to ASR lexicon and RNNLM wordlist using seq2seq transformer based G2P model - https://github.com/cmusphinx/g2p-seq2seq
RNNLM (optional) is trained, using the two text corpora as train sets and Fisher dev2 partition as dev set, for 5 epochs.
During the test sets decode stage, after training the chain model, the extended ASR lexicon above is used to derive the graph. And RNNLM rescoring is also performed using the trained Gigaword RNNLM.

We achieved WER 20.84% WER on the Fisher Spanish test partition and 24.67% WER on Fisher dev partition, using the Gigaword text-based trained RNNLM rescoring, over the baseline 3-gram LM based decoded lattices.

Pull forward Kaldi master

… corpus

danpovey · 2019-03-19T18:11:12Z

s there a reason it doesn't make sense to just replace the current example with this?
I doubt too many people were using the old example.

Are you using a graphemic or phonemic lexicon? A graphemic lexicon might be a reasonable choice in Spanish, for simplification.

saikiranvalluri · 2019-03-24T05:02:50Z

s there a reason it doesn't make sense to just replace the current example with this?
I doubt too many people were using the old example.

I included the e2e process of processing the Spanish Gigaword corpus downloaded, to training RNNLM using that data in stages 0,1 in run.sh. Also, you see > 0.4% absolute WER improvement on test partitions upon adding the Spanish Gigaword text to RNNLM training data.
The Gigaword based rnnlm might prove more significant for WER improvement in the extended lexicon scenario and on more generalised test sets.

Are you using a graphemic or phonemic lexicon? A graphemic lexicon might be a reasonable choice in Spanish, for simplification.

I am using the same Callhome Spanish rules based lexicon, which is simplified to 36 phones, after removing accented letters and digits from the non-silence phones list. So, it is similar to graphemic lexicon.

stale · 2020-06-19T08:36:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-19T05:23:40Z

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

stale · 2020-09-17T08:30:08Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

kkm000 · 2021-09-22T21:06:56Z

@saikiranvalluri, where we are on this?

stale · 2021-11-22T16:24:08Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

saikiranvalluri and others added 28 commits October 30, 2018 23:58

Merge pull request #1 from kaldi-asr/master

801ab93

Pull forward Kaldi master

Merge remote-tracking branch 'upstream/master'

04c4a03

Merge remote-tracking branch 'upstream/master'

ea699b0

Spanish Gigaword LM recipe

cbc8eeb

Some bug fixes

e8aecbb

Update rnnlm.sh

ece34bd

Combining lexicon words with pocolm wordslist for RNNLM training

0c4fe47

merge conflict resolved

92e241b

Integrated the 2 stage scientific method POCOLM training for Gigaword…

1439b0d

… corpus

Update train_pocolm.sh

8ad0e01

Update run.sh

f856ac2

Text cleaning script for splitting Abbreviation words added

684f029

Update clean_txt_dir.sh

185da3a

Update clean_txt_dir.sh

cb393c8

Update train_pocolm.sh

18a9cb6

Update pocolm_cust.sh

b023638

Cosmetic fixes

46550f0

Update path.sh

ce3c7d7

Bug fix in text normalisation script for gigaword corpus

deeaaa7

small Fix path.sh

633f21d

Update clean_abbrevs_text.py

8d6b14d

Added sparrowhawk installation script for text normalisation

8c9c37b

G2P training stage added into Spanish gigaword recipe

c6b05d1

G2P seq2seq scripts added in steps/

8c226cc

RNNLM scripts updated to UTF8 encoding

7b67fc2

Update pocolm_cust.sh

4767c7c

Update run.sh

2cd5948

Added steps for generating POCOLM ARPA file

6595b42

saikiranvalluri and others added 16 commits April 12, 2019 11:37

Merge branch 'master' into feature/Spanish_gigaword_LM

ec0edc5

Remove virtenv dependency

8b8222e

Update path.sh

0e7afa8

Update install_sparrowhawk.sh

56d2db9

Set lang to ESP

fb6693e

Set pocolm option - --limit-unk-history=true

ce0f420

Removed unused code

9487ce1

Fix in checking for empty space lines in lexicon

25609c5

Fix in RNNLM rescoring decode stage

510db0f

Update run.sh

9894f4c

Update clean_txt_dir.sh

3bdb541

Update run.sh

6636557

Merge branch 'master' into feature/Spanish_gigaword_LM

69b1bca

Update run.sh

36499a7

Reverse the order of Abbreviation process after punct syms

8da5c3e

Update run_norm.sh

510b415

stale bot added the stale Stale bot on the loose label Jun 19, 2020

stale bot closed this Jul 19, 2020

kkm000 reopened this Jul 19, 2020

stale bot removed the stale Stale bot on the loose label Jul 19, 2020

stale bot added the stale Stale bot on the loose label Sep 17, 2020

stale bot removed the stale Stale bot on the loose label Sep 22, 2021

kkm000 self-assigned this Sep 22, 2021

kkm000 added the waiting-for-feedback Reporter's feedback has been requested label Sep 22, 2021

stale bot added the stale Stale bot on the loose label Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanish Gigaword text based POCOLM and RNNLM training recipe #3136

Spanish Gigaword text based POCOLM and RNNLM training recipe #3136

saikiranvalluri commented Mar 18, 2019 •

edited

danpovey commented Mar 19, 2019

saikiranvalluri commented Mar 24, 2019 •

edited

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

stale bot commented Sep 17, 2020

kkm000 commented Sep 22, 2021

stale bot commented Nov 22, 2021

Spanish Gigaword text based POCOLM and RNNLM training recipe #3136

Are you sure you want to change the base?

Spanish Gigaword text based POCOLM and RNNLM training recipe #3136

Conversation

saikiranvalluri commented Mar 18, 2019 • edited

danpovey commented Mar 19, 2019

saikiranvalluri commented Mar 24, 2019 • edited

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

stale bot commented Sep 17, 2020

kkm000 commented Sep 22, 2021

stale bot commented Nov 22, 2021

saikiranvalluri commented Mar 18, 2019 •

edited

saikiranvalluri commented Mar 24, 2019 •

edited