-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V-MoE
token droping and MoD
#5
Comments
Our token-dropping strategy is just a token-wise dropping w.r.t. the routing probability. It is more like the token-dropping in conventional MoE models like Switch Transformer. It is totally different from MoD, so I do not understand your question. Can you give me more information about your understanding of our token-dropping strategy and MoD? Maybe we can find out something misunderstood. |
@DeepSeekDDM @luofuli |
Adding another question: How should I understand the statement from the paper that "we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped"? Is there a specific strategy implemented during token dropping to enforce this? |
A to Q1: Mainly on the device dimension. |
@DeepSeekDDM 确认一下,deepseek v2 实现的是device 维度的drop token |
Yes. The actual dropping strategy is a little complex, but the main idea is what you described just now. |
@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗?比较好奇 |
Just some additional tricks to ensure computation efficiency. It is not the key technique of DeepSeekMoE. The details will not prevent you from reproducing DeepSeekMoE. |
This token dropping method, as indicated by the citation, is based on the V-MoE method.
How this different from the recent MoD? It look like they very similar techniques.
The text was updated successfully, but these errors were encountered: