Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search on a single thread is non-deterministic in 0.29-rc0 with T78 nets #1769

Open
diceydust opened this issue Aug 8, 2022 · 13 comments
Open

Comments

@diceydust
Copy link

Even with number of threads set to 1, the search process of lc0 is non-deterministic, meaning that the everytime I do analysis of the same position, the output is different in terms of number of visits etc. For example: I have done many runs of 0.29-rc0 (784010 net) from the starting position with the following options:

--threads=1
--task-workers=0
--syzygy-paths=C:\Tablebases\Syzygy
--logfile=log.txt
--log-live-stats=true
--smart-pruning-factor=0

And after 'go nodes 100000' most often the engine returns d2d4 as the best move. However there were runs where e2e4 was chosen. I attach two log files - two runs from the same position returning different best moves.

I'm pretty sure this is not expected behavior, therefore reporting this as a bug.
1.txt
2.txt

@mooskagh
Copy link
Member

mooskagh commented Aug 8, 2022

If you (or someone else) build Lc0 from sources, it would be interesting to git bisect to find exact commit when it started to happen.

@borg323
Copy link
Member

borg323 commented Aug 10, 2022

Is it repeatable with a T80 net?

@diceydust
Copy link
Author

Is it repeatable with a T80 net?

Actually it is not! With T80 net everything is fine - I get the same output every time. Can someone confirm this strange behaviour with T78?

@borg323
Copy link
Member

borg323 commented Aug 21, 2022

Thanks, this is very useful information. T78 nets use a new part of the the cuda backend, so we now know where to look.

@borg323 borg323 changed the title Search on a single thread is non-deterministic in 0.29-rc0 Search on a single thread is non-deterministic in 0.29-rc0 with T78 nets Aug 21, 2022
@Kovax007
Copy link

This possibly get along that if we create a transformer out of encoders only, the search goes totally wrong, and totally differently, this issue gets mitigated with mbs=1, meaning that the issue is related with the memory allocations when encoder layers are being evaluated, also the issue seems being solved in Ceres backend under this commit: dje-dev/Ceres@da03904, however that solution does not translate to lc0, imo mostly due to cudagraph solves the issue with allocating the tensor memory automatically.

@borg323
Copy link
Member

borg323 commented Sep 14, 2022

@diceydust can you check whether #1773 makes T78 results deterministic?

@borg323
Copy link
Member

borg323 commented Sep 15, 2022

Since #1773 is now merged, a test can also be done using master.

@diceydust
Copy link
Author

Guys, I've no access to my home PC these days. Perhaps someone could provide Lc0-master binaries? Or maybe releasing 29 rc1 is an idea.

@borg323
Copy link
Member

borg323 commented Sep 23, 2022

You can find the current master lc0 binary with cuda backend in https://ci.appveyor.com/api/buildjobs/l4tti40aktxcp0g6/artifacts/build%2Flc0.exe

@diceydust
Copy link
Author

@diceydust can you check whether #1773 makes T78 results deterministic?

I have checked it. And the issue is still there. Attaching log files.
1.txt
2.txt

@marcin-rzeznicki
Copy link

Just my 2 cents. It's definitely a thing. Yesterday I left an unfinished analysis in my files, today I went back to the same position but by taking a different "path" through the alternatives (because some of them I had already analyzed), and when I reached the exact same position, the top move this time was different. This is not great, I'd say.

@borg323
Copy link
Member

borg323 commented May 14, 2023

@diceydust can you check again now that v0.30.0-rc1 is out, the related code was significantly revamped for the release.

@diceydust
Copy link
Author

diceydust commented Jul 27, 2023

Just reporting that the issue is still there (in v30.0). I've checked the latest binary with t1-768x15x24h-swa-4000000.pb.gz net. Done two runs (go depth 100000) from initial position.

After first run I got:
info depth 9 seldepth 44 time 120839 nodes 68865 score cp 16 nps 616 tbhits 0 pv d2d4 d7d5 c2c4 c7c6 b1c3 g8f6 g1f3 e7e6 e2e3 a7a6 b2b3 f8b4 c1d2 b8d7 f1d3 d8e7 e1g1 e8g8 h2h3

And after second run:

info depth 9 seldepth 43 time 114826 nodes 69349 score cp 16 nps 615 tbhits 0 pv d2d4 d7d5 c2c4 c7c6 c4d5 c6d5 c1f4 a7a6 e2e3 b8c6 f1d3 g8f6 g1f3 c8g4 b1d2 e7e6 d1b3 c6a5 b3d1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants