Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V-MoE token droping and MoD #5

Open
liyucheng09 opened this issue May 7, 2024 · 8 comments
Open

V-MoE token droping and MoD #5

liyucheng09 opened this issue May 7, 2024 · 8 comments

Comments

@liyucheng09
Copy link

This token dropping method, as indicated by the citation, is based on the V-MoE method.

How this different from the recent MoD? It look like they very similar techniques.

@DeepSeekDDM
Copy link
Collaborator

Our token-dropping strategy is just a token-wise dropping w.r.t. the routing probability. It is more like the token-dropping in conventional MoE models like Switch Transformer. It is totally different from MoD, so I do not understand your question. Can you give me more information about your understanding of our token-dropping strategy and MoD? Maybe we can find out something misunderstood.

@Richie-yan
Copy link

Richie-yan commented May 29, 2024

@DeepSeekDDM @luofuli
Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension?
If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity.
If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case?
Because the paper mentions device-level token dropping, I have the above confusion.

@Richie-yan
Copy link

Richie-yan commented May 29, 2024

Adding another question: How should I understand the statement from the paper that "we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped"? Is there a specific strategy implemented during token dropping to enforce this?
@DeepSeekDDM @luofuli

@DeepSeekDDM
Copy link
Collaborator

@DeepSeekDDM @luofuli Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension? If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity. If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case? Because the paper mentions device-level token dropping, I have the above confusion.

A to Q1: Mainly on the device dimension.
A to Q2: Yes.
A to Q3: Also drop tokens with the lowest prob.
A to Q4 & Q5: Yes, we implement a specific strategy to ensure this.

@DeepSeekDDM DeepSeekDDM mentioned this issue May 30, 2024
@Richie-yan
Copy link

@DeepSeekDDM 确认一下,deepseek v2 实现的是device 维度的drop token
对于 device 维度 去做drop ,是将当前device 所有的expert 分数统一做个排序 然后drop ?

@DeepSeekDDM
Copy link
Collaborator

@DeepSeekDDM 确认一下,deepseek v2 实现的是device 维度的drop token 对于 device 维度 去做drop ,是将当前device 所有的expert 分数统一做个排序 然后drop ?

Yes. The actual dropping strategy is a little complex, but the main idea is what you described just now.

@Richie-yan
Copy link

@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗?比较好奇

@DeepSeekDDM
Copy link
Collaborator

@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗?比较好奇

Just some additional tricks to ensure computation efficiency. It is not the key technique of DeepSeekMoE. The details will not prevent you from reproducing DeepSeekMoE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants