Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad fp16 performance on "GTX 1650" cards #1670

Open
gcp opened this issue Dec 13, 2021 · 16 comments
Open

Bad fp16 performance on "GTX 1650" cards #1670

gcp opened this issue Dec 13, 2021 · 16 comments

Comments

@gcp
Copy link
Contributor

gcp commented Dec 13, 2021

lc0-v0.28.2 built from git

CUDA Runtime version: 11.3.0
Cudnn version: 8.2.1
Latest version of CUDA supported by the driver: 11.4.0
GPU: NVIDIA GeForce GTX 1650
GPU memory: 3.81671 GiB
GPU clock frequency: 1755 MHz
GPU compute capability: 7.5

Network hanse-69722-vf2

go
Found pb network file: ./hanse-69722-vf2.pb.gz
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
info depth 1 seldepth 2 time 3428 nodes 2 score cp 5 nps 6 tbhits 0 pv c2c4 e7e5
info depth 2 seldepth 3 time 4041 nodes 14 score cp 4 nps 15 tbhits 0 pv c2c4 c7c5 g2g3
info depth 2 seldepth 4 time 4694 nodes 33 score cp 4 nps 20 tbhits 0 pv c2c4 c7c5 g1f3 g8f6
info depth 3 seldepth 4 time 4963 nodes 57 score cp 5 nps 30 tbhits 0 pv c2c4 c7c5 g2g3 g7g6

If I force normal (not fp16 mode):

Found pb network file: ./hanse-69722-vf2.pb.gz
Creating backend [cudnn]...
CUDA Runtime version: 11.3.0
Cudnn version: 8.2.1
Latest version of CUDA supported by the driver: 11.4.0
GPU: NVIDIA GeForce GTX 1650
GPU memory: 3.81671 GiB
GPU clock frequency: 1755 MHz
GPU compute capability: 7.5
WARNING: you will probably get better performance from the cudnn-fp16 backend.
info depth 1 seldepth 2 time 3342 nodes 2 score cp 5 nps 41 tbhits 0 pv c2c4 c7c5
info depth 2 seldepth 3 time 3441 nodes 14 score cp 4 nps 95 tbhits 0 pv c2c4 c7c5 g2g3
info depth 3 seldepth 4 time 3607 nodes 53 score cp 5 nps 169 tbhits 0 pv c2c4 c7c5 g2g3 g7g6
info depth 3 seldepth 4 time 3658 nodes 70 score cp 4 nps 192 tbhits 0 pv g2g3 d7d5 g1f3 g7g6
info depth 3 seldepth 4 time 3706 nodes 71 score cp 5 nps 172 tbhits 0 pv c2c4 g8f6 g2g3 e7e5
info depth 3 seldepth 5 time 3716 nodes 105 score cp 5 nps 248 tbhits 0 pv c2c4 g8f6 g2g3 e7e5 f1g2
info depth 4 seldepth 6 time 3961 nodes 300 score cp 5 nps 449 tbhits 0 pv c2c4 c7c5 g2g3 g7g6 g1f3 b8c6
info depth 4 seldepth 7 time 4288 nodes 578 score cp 5 nps 581 tbhits 0 pv c2c4 c7c5 g2g3 g7g6 g1f3 b8c6

Notice the performance warning - in reality it's the other way around by a factor of what, x10? IIRC lc0 had GTX 16x0 series optimizations at some point, they must've been broken.

The same behavior happens with cuda/cuda-fp16 backends.

@gcp
Copy link
Contributor Author

gcp commented Dec 13, 2021

Dumping network_cudnn.cc status doesn't show anything suspicious, i.e.

fp16: 1
nhwc: 0
custom wino: 0
use_res_block_winograd_fuse: 0

@gcp
Copy link
Contributor Author

gcp commented Dec 13, 2021

I went back to lc0 v0.21.4, which has a bunch of remarks related to GTX 16x0 bugfixes, and it still performs far worse in fp16 mode.

This makes me wary of any driver regression. But I found something else that I find quite suspicious and alarming.

The chip on this card, according to lspci, is a TU106. This is not the chip that was in the GTX1650 originally (that would be the TU117). The TU106 is the chip from the RTX2060 and RTX2070.

This makes me wonder if this is a cut down card where they disabled the Tensor Cores via the driver or fusing them off...but then forgot that the GTX1650 and 1660 are supposed to have a bunch of additional FP16 units (to make up for the missing Tensor Cores) that the RTX cards didn't have?

Something else that is pointing me in this direction is that the card is drawing a whopping 22 Watts idling. A real GTX 1650 is supposed to draw like... 7 or 8 Watts, at most.

@gcp
Copy link
Contributor Author

gcp commented Dec 14, 2021

The same card, using the lc0 release binaries and the Windows (rather than Linux) drivers, has about the same performance in fp16 and fp32 (cuda/cuda-fp16 or cudnn/cudnn-fp16 backend is almost the same), slightly slower than a GTX 1060. The large difference confirms my suspicion that the driver is trying to "emulate" the right kind of performance because it has Tensor Cores instead of fp16 units.

Note that this is about half of what a GTX1650 is supposed to do in fp16 mode. I'll be returning this card as I obviously feel pretty scammed by this "fake" GTX1650.

@gcp
Copy link
Contributor Author

gcp commented Dec 14, 2021

The [dx12] backend is also broken with this card (I was curious if the driver would "forget" that it is supposed to have no Tensor Cores - but it's broken differently), just hanging in backend creation.

Edit: Oops, it worked eventually and is now running at 1/20th the expected speed.

@gcp gcp changed the title Likely fp16 performance regression on GTX 16x0 cards Bad fp16 performance on "GTX 1650" cards Dec 14, 2021
@gcp
Copy link
Contributor Author

gcp commented Dec 15, 2021

The vendor confirmed there's 2 nigh identical looking versions of this card:
https://www.techpowerup.com/gpu-specs/asus-tuf-gtx-1650-gaming-gddr6.b8403
https://www.techpowerup.com/gpu-specs/asus-tuf-gtx-1650-gaming-gddr6.b7616

Obviously this part

FP16 (half) performance   5.699 TFLOPS (2:1) 

Is only true for the latter card!

@shizukachan
Copy link

tu106 has tensor cores, they're just throttled when used in a tu11x config.

try hacking https://github.com/LeelaChessZero/lc0/blob/master/src/neural/cuda/network_cudnn.cc#L202 to see if anything helps, if not, file a bug with the hardware vendor

@borg323
Copy link
Member

borg323 commented Dec 15, 2021

We have had a report of atrocious performance on a GTX16x0 card with cuda 11.5 and cudnn 8.2.4, while cuda 10.2 with cudnn 7.4.2 (the ones we package) were working fine. Is it possible to try older versions?

@gcp
Copy link
Contributor Author

gcp commented Dec 15, 2021

Is it possible to try older versions?

I tested the release packages as indicated above (on Windows, where it's easier to swap out the DLLs). See #1670 (comment).

I'm still seeing only about half the expected performance.

@gcp
Copy link
Contributor Author

gcp commented Dec 15, 2021

These are the default results:

Driver Version: 495.44       CUDA Version: 11.5 
       _
|   _ | |
|_ |_ |_| v0.28.2 built Dec 15 2021
Found pb network file: ./42850.net
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 11.5.0
Cudnn version: 8.0.4
WARNING: CUDNN Runtime version mismatch, was compiled with version 8.0.0
Latest version of CUDA supported by the driver: 11.5.0
GPU: NVIDIA GeForce GTX 1650
GPU memory: 3.81671 GiB
GPU clock frequency: 1755 MHz
GPU compute capability: 7.5
Benchmark batch size 1 with inference average time 37.2439ms - throughput 26.85 nps.
...
Benchmark batch size 8 with inference average time 38.184ms - throughput 209.512 nps.
...
Benchmark batch size 16 with inference average time 38.6451ms - throughput 414.024 nps.
...
Benchmark batch size 32 with inference average time 73.3726ms - throughput 436.13 nps.

Forcing fp32

       _
|   _ | |
|_ |_ |_| v0.28.2 built Dec 15 2021
Found pb network file: ./42850.net
Creating backend [cudnn]...
CUDA Runtime version: 11.5.0
Cudnn version: 8.0.4
WARNING: CUDNN Runtime version mismatch, was compiled with version 8.0.0
Latest version of CUDA supported by the driver: 11.5.0
GPU: NVIDIA GeForce GTX 1650
GPU memory: 3.81671 GiB
GPU clock frequency: 1755 MHz
GPU compute capability: 7.5
WARNING: you will probably get better performance from the cudnn-fp16 backend.
Benchmark batch size 1 with inference average time 4.30386ms - throughput 232.35 nps.
...
Benchmark batch size 8 with inference average time 6.75452ms - throughput 1184.39 nps.
...
Benchmark batch size 16 with inference average time 8.41624ms - throughput 1901.09 nps.
...
Benchmark batch size 32 with inference average time 19.9259ms - throughput 1605.95 nps.

With the proposed change to force-enable Tensor Cores:

       _
|   _ | |
|_ |_ |_| v0.28.2+git.dirty built Dec 15 2021
Found pb network file: ./42850.net
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 11.5.0
Cudnn version: 8.0.4
WARNING: CUDNN Runtime version mismatch, was compiled with version 8.0.0
Latest version of CUDA supported by the driver: 11.5.0
GPU: NVIDIA GeForce GTX 1650
GPU memory: 3.81671 GiB
GPU clock frequency: 1755 MHz
GPU compute capability: 7.5
Benchmark batch size 1 with inference average time 4.02966ms - throughput 248.16 nps.
...
Benchmark batch size 8 with inference average time 33.6778ms - throughput 237.545 nps.
...
Benchmark batch size 16 with inference average time 34.6671ms - throughput 461.533 nps.
...
Benchmark batch size 32 with inference average time 69.3711ms - throughput 461.288 nps.

@gcp
Copy link
Contributor Author

gcp commented Dec 15, 2021

if not, file a bug with the hardware vendor

@shizukachan Do you mean NVIDIA or ASUS here? If it's the latter, does this mean that e.g. they made some mistake in the BIOS and forgot to enable the Tensor Cores?

@gcp
Copy link
Contributor Author

gcp commented Dec 15, 2021

We have had a report of atrocious performance on a GTX16x0 card with cuda 11.5 and cudnn 8.2.4, while cuda 10.2 with cudnn 7.4.2 (the ones we package) were working fine. Is it possible to try older versions?

I tested this on Linux too now, indeed it's the same as on Windows: stepping back to cudnn 7.6.5 "improves" the performance to be slightly slower than fp32, which is still about half what it should be.

I wonder when it was reported to be "working fine", that just meant "not atrociously bad" instead of "where it should be" :-)

@borg323
Copy link
Member

borg323 commented Dec 15, 2021

I wonder when it was reported to be "working fine", that just meant "not atrociously bad" instead of "where it should be" :-)

I checked, it was with a TU117, so probably "where it should be" - when we did the test we were not troubleshooting a performance issue, just trying to see whether the new versions helped.

@borg323
Copy link
Member

borg323 commented Dec 18, 2021

If you haven't returned this card yet, can you try whether the following enables the tensor cores?

--- a/src/neural/cuda/network_cudnn.cc
+++ b/src/neural/cuda/network_cudnn.cc
@@ -199,7 +199,8 @@ class CudnnNetwork : public Network {
         // Some GPUs (GTX 16xx) are SM 7.5 but don't have tensor cores
         // enabling TENSOR_OP_MATH or nhwc_ layout for them works but is
         // very very slow (likely because the system emulates it).
-        if (!strstr(deviceProp.name, "GTX 16")) {
+        if (!strstr(deviceProp.name, "GTX 16") ||
+            (deviceProp.pciDeviceID & ~0x7f) == 0x1f00) {
           hasTensorCores = true;
           nhwc_ = true;
         }

@gcp
Copy link
Contributor Author

gcp commented Dec 20, 2021

I tried this trick (unconditionally) above where it says With the proposed change to force-enable Tensor Cores and as you can see performance is still atrocious. That was with later cuDNN but at this point I want to get rid of the card ASAP and already requested the return.

@borg323
Copy link
Member

borg323 commented Dec 20, 2021

We may have a workaround/fix in #1675 if you can still try it.

@gcp
Copy link
Contributor Author

gcp commented Dec 20, 2021

I can't and the card was a TU106 anyway (not TU11x) - it was only half the normal speed even on the old 10.2 CUDA/7.4 cudnn (release package on Windows), as described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants