Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish Gigaword text based POCOLM and RNNLM training recipe #3136

Open
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

saikiranvalluri
Copy link
Contributor

@saikiranvalluri saikiranvalluri commented Mar 18, 2019

We introduce the following features into the existing fisher_spanish recipe:

  • Text processing (optional) scripts for Spanish Gigaword text corpus.
  • Train a 3-gram POCOLM using the Fisher train and gigaword texts.
  • POCOLM wordlist is derived using relative frequency of words in each corpus and metaparameter weights of each text corpus.
  • OOVs of the POCOLM wordlist are added to ASR lexicon and RNNLM wordlist using seq2seq transformer based G2P model - https://github.com/cmusphinx/g2p-seq2seq
  • RNNLM (optional) is trained, using the two text corpora as train sets and Fisher dev2 partition as dev set, for 5 epochs.
  • During the test sets decode stage, after training the chain model, the extended ASR lexicon above is used to derive the graph. And RNNLM rescoring is also performed using the trained Gigaword RNNLM.

We achieved WER 20.84% WER on the Fisher Spanish test partition and 24.67% WER on Fisher dev partition, using the Gigaword text-based trained RNNLM rescoring, over the baseline 3-gram LM based decoded lattices.

@danpovey
Copy link
Contributor

s there a reason it doesn't make sense to just replace the current example with this?
I doubt too many people were using the old example.

Are you using a graphemic or phonemic lexicon? A graphemic lexicon might be a reasonable choice in Spanish, for simplification.

@saikiranvalluri
Copy link
Contributor Author

saikiranvalluri commented Mar 24, 2019

s there a reason it doesn't make sense to just replace the current example with this?
I doubt too many people were using the old example.

I included the e2e process of processing the Spanish Gigaword corpus downloaded, to training RNNLM using that data in stages 0,1 in run.sh. Also, you see > 0.4% absolute WER improvement on test partitions upon adding the Spanish Gigaword text to RNNLM training data.
The Gigaword based rnnlm might prove more significant for WER improvement in the extended lexicon scenario and on more generalised test sets.

Are you using a graphemic or phonemic lexicon? A graphemic lexicon might be a reasonable choice in Spanish, for simplification.

I am using the same Callhome Spanish rules based lexicon, which is simplified to 36 phones, after removing accented letters and digits from the non-silence phones list. So, it is similar to graphemic lexicon.

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@stale
Copy link

stale bot commented Jul 19, 2020

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

@stale stale bot closed this Jul 19, 2020
@kkm000 kkm000 reopened this Jul 19, 2020
@stale stale bot removed the stale Stale bot on the loose label Jul 19, 2020
@stale
Copy link

stale bot commented Sep 17, 2020

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Sep 17, 2020
@kkm000
Copy link
Contributor

kkm000 commented Sep 22, 2021

@saikiranvalluri, where we are on this?

@stale stale bot removed the stale Stale bot on the loose label Sep 22, 2021
@kkm000 kkm000 self-assigned this Sep 22, 2021
@kkm000 kkm000 added the waiting-for-feedback Reporter's feedback has been requested label Sep 22, 2021
@stale
Copy link

stale bot commented Nov 22, 2021

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale bot on the loose waiting-for-feedback Reporter's feedback has been requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants