Some network architectures broken in 0.28 with NaNs in the calculations #1620

rooklift · 2021-08-05T14:30:27Z

This is the Tiny Gyal 8 network in dx12 0.28-rc1

Even though the net is weak, it's not this weak...

rooklift · 2021-08-05T14:33:06Z

It looks like you might have to actually play out the moves in order to get the bad behaviour, i.e. if you simply start with this position its output looks reasonable, but if you actually run the engine while playing the moves, things get dodgy.

rooklift · 2021-08-05T14:55:50Z

Example repro steps (tested in dx12, edit the path to your copy of the weights)...

uci
setoption name WeightsFile value C:\Users\Owner\Documents\Misc\Chess\Lc0_Networks\tinygyal-8.pb.gz
setoption name VerboseMoveStats value true
ucinewgame
position startpos
go nodes 1000000

    <wait for bestmove>

position startpos moves e2e4
go nodes 1000000

Output:

info string g7g5  (378 ) N:     291 (+ 0) (P:  0.71%) (WL: -0.25873) (D: 0.000) (M:  3.8) (Q: -0.25873) (U: 0.26795) (S:  0.00922) (V:  -.----)
info string b7b5  (234 ) N:     382 (+ 0) (P:  0.39%) (WL: -0.11978) (D: 0.000) (M:  4.4) (Q: -0.11978) (U: 0.11256) (S: -0.00722) (V:  -.----)
info string f7f6  (346 ) N:     404 (+ 0) (P:  0.72%) (WL: -0.19686) (D: 0.000) (M:  4.1) (Q: -0.19686) (U: 0.19810) (S:  0.00125) (V:  -.----)
info string g8h6  (161 ) N:     567 (+ 0) (P:  0.71%) (WL: -0.14161) (D: 0.000) (M:  4.1) (Q: -0.14161) (U: 0.13831) (S: -0.00330) (V:  -.----)
info string b8a6  (34  ) N:     591 (+ 0) (P:  0.73%) (WL: -0.14055) (D: 0.000) (M:  4.7) (Q: -0.14055) (U: 0.13631) (S: -0.00424) (V:  -.----)
info string e7e5  (322 ) N:     648 (+ 0) (P: 20.42%) (WL:      nan) (D: 0.000) (M:  5.1) (Q:      nan) (U: 3.49159) (S:      nan) (V:  -.----)
info string a7a5  (207 ) N:     864 (+ 0) (P:  0.65%) (WL: -0.09471) (D: 0.000) (M:  5.2) (Q: -0.09471) (U: 0.08390) (S: -0.01082) (V:  -.----)
info string f7f5  (351 ) N:    1000 (+ 0) (P:  0.57%) (WL: -0.07996) (D: 0.000) (M:  5.1) (Q: -0.07996) (U: 0.06328) (S: -0.01668) (V:  -.----)
info string h7h6  (400 ) N:    1762 (+ 0) (P:  2.34%) (WL: -0.14931) (D: 0.000) (M:  5.2) (Q: -0.14931) (U: 0.14703) (S: -0.00228) (V:  -.----)
info string b7b6  (230 ) N:    2549 (+ 0) (P:  0.98%) (WL: -0.06257) (D: 0.000) (M:  5.3) (Q: -0.06257) (U: 0.04273) (S: -0.01984) (V:  -.----)
info string b8c6  (36  ) N:    2636 (+ 0) (P:  3.58%) (WL: -0.15253) (D: 0.000) (M:  6.3) (Q: -0.15253) (U: 0.15050) (S: -0.00203) (V:  -.----)
info string a7a6  (204 ) N:    3325 (+ 0) (P:  3.28%) (WL: -0.11613) (D: 0.000) (M:  5.4) (Q: -0.11613) (U: 0.10955) (S: -0.00658) (V:  -.----)
info string g7g6  (374 ) N:    3685 (+ 0) (P:  1.31%) (WL: -0.06062) (D: 0.000) (M:  5.9) (Q: -0.06062) (U: 0.03928) (S: -0.02133) (V:  -.----)
info string g8f6  (159 ) N:    4990 (+ 0) (P:  1.13%) (WL: -0.05032) (D: 0.000) (M:  6.8) (Q: -0.05032) (U: 0.02517) (S: -0.02515) (V:  -.----)
info string c7c6  (259 ) N:    6088 (+ 0) (P:  5.68%) (WL: -0.11082) (D: 0.000) (M:  7.2) (Q: -0.11082) (U: 0.10355) (S: -0.00727) (V:  -.----)
info string d7d6  (288 ) N:    7882 (+ 0) (P:  7.59%) (WL: -0.11371) (D: 0.000) (M:  6.2) (Q: -0.11371) (U: 0.10687) (S: -0.00683) (V:  -.----)
info string e7e6  (317 ) N:   12690 (+ 0) (P: 21.54%) (WL: -0.18663) (D: 0.000) (M:  7.3) (Q: -0.18663) (U: 0.18832) (S:  0.00169) (V:  -.----)
info string d7d5  (293 ) N:   90376 (+ 0) (P: 10.35%) (WL: -0.04425) (D: 0.001) (M:  8.3) (Q: -0.04425) (U: 0.01270) (S: -0.03155) (V:  -.----)
info string c7c5  (264 ) N:  170214 (+ 0) (P: 16.54%) (WL: -0.04273) (D: 0.000) (M:  9.2) (Q: -0.04273) (U: 0.01078) (S: -0.03195) (V:  -.----)
info string h7h5  (403 ) N:  392984 (+205) (P:  0.78%) (WL: -0.01160) (D: 0.001) (M:  9.2) (Q: -0.01160) (U: 0.00022) (S: -0.01138) (V:  -.----)
info string node  (  20) N:  703929 (+205) (P: 99.99%) (WL:     -nan) (D: 0.001) (M:  9.9) (Q:     -nan) (V:  -.----)
bestmove h7h5 ponder f1d3

Ofc h7h5 is terrible, also what are these NaNs doing?

rooklift · 2021-08-05T15:17:06Z

Hmm with Cuda by contrast, I don't get NaNs in the output and the results seem marginally better but still weird compared to 0.27.

Edit: Actually I think cuda-fp16 shows broken results while cuda (32) seems maybe OK.

rooklift · 2021-08-05T19:33:32Z

Comparing cuda with cuda-fp16 on startpos (again, still Tiny Gyal 8):

cuda:        e4    P = 42.54%
cuda-fp16:   e4    P = 9.92%

borg323 · 2021-08-24T08:18:03Z

Is this still an issue with rc2?

rooklift · 2021-08-24T13:31:49Z

Yeah, for example I got this with 0.28-rc2 cuda:

mooskagh · 2021-10-28T09:02:54Z

Still an issue?

rooklift · 2021-10-28T15:08:03Z

Uh, I just downloaded a recent appveyor build for dx12, definitely some weirdness still happening with tinygyal, for instance:

With lines like:

info depth 10 seldepth 20 time 2366 nodes 429846 score cp -2147483648 wdl 0 1000 0 nps 74184 tbhits 0 multipv 2 pv d1h5 f8b4 b1c3 d8h4 f1b5 h4g3 h2g3 c7c6 b5a4 g8e7 h5e5 b4c3 e5c3 b7b5 g1f3 b5a4 b2b3 d7d5 b3a4

info string d1h5  (93  ) N:   75512 (+ 0) (P:  0.56%) (WL:      nan) (D: 0.000) (M: 10.1) (Q:      nan) (U: 0.00055) (S:      nan) (V:  -.----)

This was without the --backend-opts=enable-gemm-metacommand=false workaround though.

borg323 · 2021-10-28T15:57:45Z

Is the workaround effective?

rooklift · 2021-10-28T16:46:15Z

For dx12, yes, I think the workaround works.

For CUDA, I think everything is fine now even without it? As far as I can see.

rooklift changed the title ~~Some network architectures broken in 0.28?~~ Some network architectures broken in 0.28 with NaNs in the calculations Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some network architectures broken in 0.28 with NaNs in the calculations #1620

Some network architectures broken in 0.28 with NaNs in the calculations #1620

rooklift commented Aug 5, 2021 •

edited

rooklift commented Aug 5, 2021

rooklift commented Aug 5, 2021 •

edited

rooklift commented Aug 5, 2021 •

edited

rooklift commented Aug 5, 2021 •

edited

borg323 commented Aug 24, 2021

rooklift commented Aug 24, 2021

mooskagh commented Oct 28, 2021

rooklift commented Oct 28, 2021 •

edited

borg323 commented Oct 28, 2021

rooklift commented Oct 28, 2021

Some network architectures broken in 0.28 with NaNs in the calculations #1620

Some network architectures broken in 0.28 with NaNs in the calculations #1620

Comments

rooklift commented Aug 5, 2021 • edited

rooklift commented Aug 5, 2021

rooklift commented Aug 5, 2021 • edited

rooklift commented Aug 5, 2021 • edited

rooklift commented Aug 5, 2021 • edited

borg323 commented Aug 24, 2021

rooklift commented Aug 24, 2021

mooskagh commented Oct 28, 2021

rooklift commented Oct 28, 2021 • edited

borg323 commented Oct 28, 2021

rooklift commented Oct 28, 2021

rooklift commented Aug 5, 2021 •

edited

rooklift commented Aug 5, 2021 •

edited

rooklift commented Aug 5, 2021 •

edited

rooklift commented Aug 5, 2021 •

edited

rooklift commented Oct 28, 2021 •

edited