WIP: Multihead arch cutlass int8 #1975

almaudoh · 2024-03-11T13:08:21Z

Implementation of int8 quantization using cutlass for transformer nets.

- skip connection add before layer norm now has a scaling factor (alpha) - replace conv layer of value and mlh heads with an embedding layer when attention body is used.

- also remove hardcoding.

- will be removed once it's fixed.

- also fix scratch space calculation.

- factor of sizeof(DataType) was missing.

get almaudoh's code

- to handle bigger/wider networks

1.3% improvement in BT2 on RTX 4090 15.6% improvement in test BT3 network with 64 heads.

- only tries doing the KQV dense layers in int8. - Accuracy seems reasonable. - Right now quantization isn't fused, and de-quantization is done with bias add. - Both the above can be possibly be fused with more work. - Also need to attempt INT8 for other dense layers (MHA dense, FFN1 and FFN2)

…da-cutlass-fmha

…s-int8

…rnels for clipping of inputs for non-int8 inference.

…s. Fix bugs.

…mal fp16 weights.

…d unused code.

…@Tilps

…@Tilps

…@Tilps

…@Tilps

…udoh/lc0 into multihead-arch-cutlass-int8

…tlass-int8

…ize next layer

ankan-ban and others added 30 commits March 22, 2022 22:28

WIP attention body changes

aa865a2

more updates to match training code

d6728dc

- skip connection add before layer norm now has a scaling factor (alpha) - replace conv layer of value and mlh heads with an embedding layer when attention body is used.

fix few crashes

49a8143

use the right encoder block for body!

7d96c63

fix output of AttentionBody

f5fe737

move pos encoding table to common header file

72cbc13

- also remove hardcoding.

add hack to match training side bug

af2db3d

- will be removed once it's fixed.

fix build error

3c2639c

remove hack for plies ply plane training side bug

410c4f9

- also fix scratch space calculation.

Fix attention body/head size

b6c8e43

- factor of sizeof(DataType) was missing.

Merge branch 'master' into attentionbody-cuda

e0f94f0

Add input gating kernel.

844186b

Completed input gating

780d47f

Add input gating, smolgen and sqrrelu.

f25f0bd

Fixed unstable softmax implementation.

e1cd35e

Remove debug log

b5d6930

Tilp's fix for smolgen gemms.

e9cda40

Merge branch 'master' into attentionbody-cuda

d7a8adf

Remove debug code

e98ef8d

Add tilps perf improvement on existing attention qkv matmuls.

b5afc19

Fix cudnn build failures.

9f9304b

Add tilps perf patch for fused smolgen weights add / softmax

edbd8a8

Merge branch 'master' into attentionbody-cuda

f1f485a

Fix errors in non-attentionbody nets.

d207abe

Add multistream support. Allow new attentionbody nets.

70b0521

Merge pull request LeelaChessZero#28 from almaudoh/attentionbody-cuda

ba83ef8

get almaudoh's code

Merge branch 'LeelaChessZero:master' into attention-opts

32cf3a4

add 8 elements per thread layernorm

eb184f4

- to handle bigger/wider networks

Try fused MHA from cutlass

6e0161a

1.3% improvement in BT2 on RTX 4090 15.6% improvement in test BT3 network with 64 heads.

almaudoh-1 and others added 23 commits March 2, 2024 17:50

File formatting.

c38b568

Fix layernorm epsilon for older attentionbody nets.

eb26621

Minor comment fixes.

8a2009d

Change 'optimistic_st' key to 'optimistic' in policy head map.

4ffee57

Switch cudnn to cuda for multiheadformat.

ee81336

Merge branch 'multihead-arch-cuda' into multihead-arch-cuda-cutlass-fmha

b14ad8b

Merge remote-tracking branch 'upstream/master' into multihead-arch-cu…

9a7bba2

…da-cutlass-fmha

Merge remote-tracking branch 'upstream/master' into multihead-arch-cu…

1275584

…da-cutlass-fmha

Merge remote-tracking branch 'ankan/int8-expts' into multihead-cutlas…

e364f2b

…s-int8

Fix buffer naming, fix source to build.

58ae0ae

Remove value error head inference.

d0a8536

Merge branch 'master' into multihead-arch-cutlass-int8

0a70093

Fix conflict resolution artefacts.

405046a

WIP. Reworked int8 to use scaling factors stored in weights. Added ke…

451fbf3

…rnels for clipping of inputs for non-int8 inference.

Add additional scaling factor for matmul accumulator. Rename variable…

2682eef

…s. Fix bugs.

Remove debug outputs.

0f61c2f

Add quantization to embedding layer FFN. Add weights clipping for nor…

71ec58b

…mal fp16 weights.

Update gemms to provide int8->fp32 for correct results. Remove old an…

01c24e5

…d unused code.

Change gemms to int32 - wip

09fcd56

Fix bugs in int8 implementation - ith extra (ssuper) pair of eyes from …

868b9ac

…@Tilps

Fix bugs in int8 implementation - with extra (super) pair of eyes from …

30bb640

…@Tilps

Merge branch 'multihead-arch-cutlass-int8' of https://github.com/alma…

3201c0b

…udoh/lc0 into multihead-arch-cutlass-int8

Merge remote-tracking branch 'upstream/master' into multihead-arch-cu…

28912e6

…tlass-int8

almaudoh marked this pull request as ready for review May 13, 2024 01:30

almaudoh added 6 commits May 13, 2024 04:15

Fix promotion to double for clipMatrix.

e82761a

Fix scratch size and change epiloge compute to int32.

4e3a650

Fuse FFN2 quantize to FFN2 dequantize+bias-add. 2% speedup.

dddc978

Implement int8 in all gemms except QKV. Fuse dequant-bias + add+quant…

adad545

…ize next layer

Remove epsilon from quantize

870ebba

Split QKV to allow use of int8->int8 cutlass matmul.

80cc673

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Multihead arch cutlass int8 #1975

WIP: Multihead arch cutlass int8 #1975

almaudoh commented Mar 11, 2024

WIP: Multihead arch cutlass int8 #1975

Are you sure you want to change the base?

WIP: Multihead arch cutlass int8 #1975

Conversation

almaudoh commented Mar 11, 2024