Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Multihead arch cutlass int8 #1975

Open
wants to merge 88 commits into
base: master
Choose a base branch
from

Conversation

almaudoh
Copy link
Contributor

Implementation of int8 quantization using cutlass for transformer nets.

ankan-ban and others added 30 commits March 22, 2022 22:28
- skip connection add before layer norm now has a scaling factor (alpha)
 - replace conv layer of value and mlh heads with an embedding layer when attention body is used.
- will be removed once it's fixed.
- also fix scratch space calculation.
 - factor of sizeof(DataType) was missing.
- to handle bigger/wider networks
1.3% improvement in BT2 on RTX 4090
15.6% improvement in test BT3 network with 64 heads.
- only tries doing the KQV dense layers in int8.
- Accuracy seems reasonable.
- Right now quantization isn't fused, and de-quantization is done with bias add.
- Both the above can be possibly be fused with more work.
- Also need to attempt INT8 for other dense layers (MHA dense, FFN1 and FFN2)
almaudoh-1 and others added 23 commits March 2, 2024 17:50
…rnels for clipping of inputs for non-int8 inference.
@almaudoh almaudoh marked this pull request as ready for review May 13, 2024 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants