[scripts] Add layer for attention with bypass #3694

danpovey · 2019-10-30T21:38:12Z

No description provided.

danpovey · 2019-10-30T21:38:24Z

Note, this is a just a draft, pending experiments.

stale · 2020-06-19T06:36:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kkm000 · 2021-08-29T00:36:40Z

@danpovey, I don't want to lose this. I'd take it over. What did you mean by "not working well:" WER or convergence or other things? If you can remember, of course... :)

danpovey · 2021-08-29T08:57:35Z

egs/mini_librispeech/s5/local/chain/tuning/run_tdnn_1k101.sh

+
+  tdnn_opts="l2-regularize=0.02 dropout-proportion=0.0 dropout-per-dim-continuous=true"
+  tdnnf_opts="l2-regularize=0.02 dropout-proportion=0.0 bypass-scale=0.8"
+  attention_opts="l2-regularize=0.02 num-heads=2 num-left-inputs=5 num-left-inputs-required=1 num-right-inputs=2 num-right-inputs-required=1 dropout-proportion=0.0 bypass-scale=0.8"


This num-left-inputs and num-right-inputs was likely way to small, something much larger like 20 to 40 might be more optimal. I don't know how efficient this would be... hopefully OK, I don't remember too much of the internal implementation.
It might make sense to combine this with, instead of tdnn-f layers, or perhaps in addition, residual layers without acoustic context, i.e. frame-by-frame residual layers. I think this could likely be accomplished simply by using time-stride=0 in the tdnnf layers.
You can share the log files with me if you want, esp. the progress log and/or detailed progress log printed every 10 epochs, where it invokes nnet3-info. Now that I have worked with attention setups in PyTorch I may have some better intuitions. I see now that my intuitions at the time were likely wrong-- I expected most of the benefit, and most of the attention would be on very limited/immediate context, which isn't true, it's much more spread out. And I expected the attention map would be much more "peaky", in reality it tends to be quite spread-out.

.. Incidentally, there is another reason, I now realize, why our attempts in Kaldi to use attention were not generally that successful. The Kaldi recipes are in an optimization regime where we optimize very fast, in relative terms; and this is enabled by aggressive l2 regularization. This aggressive l2 works because the models' structure is carefully designed to not have problems in this situation; for example, that's why the tdnnf-layer has one of the projections constrained to be orthogonal (otherwise we can "lose" certain subspaces in the bottleneck dim of the tdnnf-layer; they decay to zero).
[The l2 and learning rate are related; you can actually figure out an "effective" learning rate, applicable only for layers followed by batchnorm, from an equation involving l2 and learning rate, I think it's the product of the two or something like that, that matters]
The problem with attention layers is that the key and query matrices are effectively multiplied together before a nonlinearity, i.e. if there's some direction in key/query space where they are both close to zero, the derivatives get close to zero, and they get overwhelmed by the l2 term and can disappear. So we need to be careful with l2 in this case. In Icefall I am actually working with a modified version of l2 that solves this problem, in a spirit similar to the natural gradient implementation in Kaldi, but it would be quite tiresome to implement in Kaldi because the optimizer isn't so easily separable from the layers.
Anyway, it's possible that what might work is to have very little l2, e.g. 1.0e-06 (this might be effectively equivalent to zero)--except for the output layer, where we could have, say, half the l2, at 0.0005 -- and maybe double the final-effective-lrate, because otherwise the parameters will get larger and larger as the model is trained, meaning the effective changes in parameters get smaller and smaller.

Yes, I vaguely know where the LR vs L2 equation is. I remember tweaking it. Do not remember why tho :( I think it was because I used high dropout proportion, even above the theoretical best 0.5, because it would have been boring.

Do you suggest reducing L2 on all layers below batchnorm, or only the attention layers?

Maybe dropout would be more efficient? At the least, it may hide terms from the L2 attack so they survive longer on average.

And I totally missed nearly all the new development. Had to look up what Icefall was. That's wrong too. Looks like this endless project I got stuck in neck-deep is finally coming to a close.

I never paid attention to the implementation of L2 for natural gradient descent in Kaldi (I should!), but the Fisher manifold is generally non-flat and asymmetric. Interesting, I was thinking just a couple days ago whether the Mahalanobis distance induces the Fisher metric, but got lost trying to derive it. (I finally got a bargain copy of Wheeler and Thorne Gravitation, and it incites weirdly unrelated thoughts in me :) ).

Come think of it, I do not even understand anymore why regularization uses a norm—anything convex everywhere and better without large flat hyperplanes and too many sharp hypercorners, for the lack of a better term, should do, given that the lambda is small. Hyperballs are nice and round and easily differentiable, but that's about it. In a high-D space anything has a lot of unexpected symmetries anyway.

I'm setting up a new cloud cluster, basically because the working one has software that is too old, and problems with Slurm that they at least tried to address in newer versions. I'll need to run something on it, and I do not want just to repeat the old stuff--it can be a good opportunity to explore something new.

[scripts] Add layer for attention with bypass

b5ef2a7

Add example script (not really working well yet.)

54f272f

stale bot added the stale Stale bot on the loose label Jun 19, 2020

kkm000 added in progress Issue has been taken and is being worked on stale-exclude Stale bot ignore this issue labels Jul 15, 2020

stale bot removed the stale Stale bot on the loose label Jul 15, 2020

kkm000 marked this pull request as draft July 15, 2020 09:58

danpovey commented Aug 29, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[scripts] Add layer for attention with bypass #3694

[scripts] Add layer for attention with bypass #3694

danpovey commented Oct 30, 2019

danpovey commented Oct 30, 2019

stale bot commented Jun 19, 2020

kkm000 commented Aug 29, 2021

danpovey Aug 29, 2021

danpovey Aug 29, 2021

kkm000 Sep 11, 2021

[scripts] Add layer for attention with bypass #3694

Are you sure you want to change the base?

[scripts] Add layer for attention with bypass #3694

Conversation

danpovey commented Oct 30, 2019

danpovey commented Oct 30, 2019

stale bot commented Jun 19, 2020

kkm000 commented Aug 29, 2021

danpovey Aug 29, 2021

Choose a reason for hiding this comment

danpovey Aug 29, 2021

Choose a reason for hiding this comment

kkm000 Sep 11, 2021

Choose a reason for hiding this comment