Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache #10

Open
hxer7963 opened this issue May 8, 2024 · 1 comment

Comments

@hxer7963
Copy link

hxer7963 commented May 8, 2024

In MLA, the KVCache compresses $h_t$ into $C_t^{KV} \in \mathbb{R}^{d_c}$, and to circumvent the issue of incompatibility with RoPE for low-rank KVCache compression, it concatenates $k_t^R = \text{RoPE}(W^{KR}h_t) \in \mathbb{R}^{d_h^R}$.

However, according to equation (17): $k_{t,i}=[k_{t,i}^C; k_t^R]$, during the computation of attention, $k_t^c = W^{UK}C_t^{KV} \in \mathbb{R}^{d_hn_h}$ is used instead of $C_t^{KV}$.

Appendix B mentions that by applying the associative law of matrix multiplication, $W^{DKV}$ can be absorbed into $W^Q$: $W^Q[W^{UK}(W^{DKV}h_t)] = (W^QW^{UK})(W^{DKV}h_t)=(W^{UQ})C_t^{KV}$.

Questions:

  1. Given that $W^Q \in \mathbb{R}^{d_hn_h \times d}$ and $W^{UK} \in \mathbb{R}^{d_hn_h \times d_c}$, how are these matrices multiplied to derive $W^{UQ}$?
  2. How are the values for the matrices $W^{DKV}, W^{UK}, W^{KR}$ computed? Appendix B suggests that these are calculated offline once and not during training as part of the low-rank matrix values.

Any insights or detailed explanations regarding these points would be highly appreciated.

@luofuli
Copy link
Member

luofuli commented May 14, 2024

Here's a recommended blog for you: https://spaces.ac.cn/archives/10091 @hxer7963

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants