`V-MoE` token droping and `MoD` #5

liyucheng09 · 2024-05-07T02:44:33Z

This token dropping method, as indicated by the citation, is based on the V-MoE method.

How this different from the recent MoD? It look like they very similar techniques.

DeepSeekDDM · 2024-05-14T07:38:23Z

Our token-dropping strategy is just a token-wise dropping w.r.t. the routing probability. It is more like the token-dropping in conventional MoE models like Switch Transformer. It is totally different from MoD, so I do not understand your question. Can you give me more information about your understanding of our token-dropping strategy and MoD? Maybe we can find out something misunderstood.

Richie-yan · 2024-05-29T11:16:29Z

@DeepSeekDDM @luofuli
Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension?
If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity.
If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case?
Because the paper mentions device-level token dropping, I have the above confusion.

Richie-yan · 2024-05-29T11:37:55Z

Adding another question: How should I understand the statement from the paper that "we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped"? Is there a specific strategy implemented during token dropping to enforce this?
@DeepSeekDDM @luofuli

DeepSeekDDM · 2024-05-30T02:49:24Z

@DeepSeekDDM @luofuli Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension? If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity. If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case? Because the paper mentions device-level token dropping, I have the above confusion.

A to Q1: Mainly on the device dimension.
A to Q2: Yes.
A to Q3: Also drop tokens with the lowest prob.
A to Q4 & Q5: Yes, we implement a specific strategy to ensure this.

Richie-yan · 2024-05-30T11:38:05Z

@DeepSeekDDM 确认一下，deepseek v2 实现的是device 维度的drop token
对于 device 维度去做drop ，是将当前device 所有的expert 分数统一做个排序然后drop ？

DeepSeekDDM · 2024-05-30T11:51:32Z

@DeepSeekDDM 确认一下，deepseek v2 实现的是device 维度的drop token 对于 device 维度去做drop ，是将当前device 所有的expert 分数统一做个排序然后drop ？

Yes. The actual dropping strategy is a little complex, but the main idea is what you described just now.

Richie-yan · 2024-05-30T12:03:58Z

@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗？比较好奇

DeepSeekDDM · 2024-05-31T02:40:29Z

@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗？比较好奇

Just some additional tricks to ensure computation efficiency. It is not the key technique of DeepSeekMoE. The details will not prevent you from reproducing DeepSeekMoE.

DeepSeekDDM mentioned this issue May 30, 2024

Drop Token #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`V-MoE` token droping and `MoD` #5

`V-MoE` token droping and `MoD` #5

liyucheng09 commented May 7, 2024

DeepSeekDDM commented May 14, 2024

Richie-yan commented May 29, 2024 •

edited

Richie-yan commented May 29, 2024 •

edited

DeepSeekDDM commented May 30, 2024

Richie-yan commented May 30, 2024

DeepSeekDDM commented May 30, 2024

Richie-yan commented May 30, 2024

DeepSeekDDM commented May 31, 2024

V-MoE token droping and MoD #5

V-MoE token droping and MoD #5

Comments

liyucheng09 commented May 7, 2024

DeepSeekDDM commented May 14, 2024

Richie-yan commented May 29, 2024 • edited

Richie-yan commented May 29, 2024 • edited

DeepSeekDDM commented May 30, 2024

Richie-yan commented May 30, 2024

DeepSeekDDM commented May 30, 2024

Richie-yan commented May 30, 2024

DeepSeekDDM commented May 31, 2024

`V-MoE` token droping and `MoD` #5

`V-MoE` token droping and `MoD` #5

Richie-yan commented May 29, 2024 •

edited

Richie-yan commented May 29, 2024 •

edited