Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache #10

hxer7963 · 2024-05-08T07:00:35Z

In MLA, the KVCache compresses $h_t$ into $C_t^{KV} \in \mathbb{R}^{d_c}$, and to circumvent the issue of incompatibility with RoPE for low-rank KVCache compression, it concatenates $k_t^R = \text{RoPE}(W^{KR}h_t) \in \mathbb{R}^{d_h^R}$.

However, according to equation (17): $k_{t,i}=[k_{t,i}^C; k_t^R]$, during the computation of attention, $k_t^c = W^{UK}C_t^{KV} \in \mathbb{R}^{d_hn_h}$ is used instead of $C_t^{KV}$.

Appendix B mentions that by applying the associative law of matrix multiplication, $W^{DKV}$ can be absorbed into $W^Q$: $W^Q[W^{UK}(W^{DKV}h_t)] = (W^QW^{UK})(W^{DKV}h_t)=(W^{UQ})C_t^{KV}$.

Questions:

Given that $W^Q \in \mathbb{R}^{d_hn_h \times d}$ and $W^{UK} \in \mathbb{R}^{d_hn_h \times d_c}$, how are these matrices multiplied to derive $W^{UQ}$?
How are the values for the matrices $W^{DKV}, W^{UK}, W^{KR}$ computed? Appendix B suggests that these are calculated offline once and not during training as part of the low-rank matrix values.

Any insights or detailed explanations regarding these points would be highly appreciated.

luofuli · 2024-05-14T05:03:18Z

Here's a recommended blog for you: https://spaces.ac.cn/archives/10091 @hxer7963

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache #10

Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache #10

hxer7963 commented May 8, 2024

luofuli commented May 14, 2024

Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache #10

Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache #10

Comments

hxer7963 commented May 8, 2024

luofuli commented May 14, 2024