Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting parts of Lc0's search into classes would help future development #1734

Open
Naphthalin opened this issue Apr 20, 2022 · 5 comments

Comments

@Naphthalin
Copy link
Contributor

As of now, Lc0's search is still using AlphaZero's original PUCT, with a few modifications in WDL+draw score, some certainty propagation, and the general batching strategy. This certainly works decently well, but one of the reasons why there has been a lack of progress in the search can be attributed to how difficult it is not only to test search changes (especially if they affect the scaling behaviour), but also how difficult it is to implement them in the first place. I would like to change the latter part with this proposal.

Currently, an experiment in the search code involves

  • the functional changes into search.cc and search.h itself
  • possibly changes in node.cc and node.h if a new node output is needed
  • addition of any new parameters and their callers in params.cc and params.h
  • addition of these new parameters in several places in search.cc, which is rather counterintuitive e.g. in move sorting

This effectively renders casual experiments unfeasible, and on top requires a major effort in maintaining branches mergeable with master, which increases the difficulty of contributing to Leela's search.

I have identified the following aspects of Lc0's search code which IMO would profit from being extracted similar to how timemgr, as experiments in the respective fields would then only require creating the associated class file, and registering the new class in the respective file, without touching any of the above files or creating merge complications.

  1. Selection of move to play. Currently, we're using two strategies: most visits, and with temperature. There are other strategies of interest like sampling according to NN policies which are yet to be implemented.
  2. Selection of move to explore. Currently, this uses PUCT, with the addition of drawscore, and some batching logic in multigather and collision scaling. Here, a lot of alternatives are possible (and explored in the literature).
  3. Result propagation. Currently, this follows the general MCTS principle of averaging all outcomes, and calculating these averages in an incremental way. This would need some adaptation for the MCGS implementation at least, and in general when moving away from pure averaging.
  4. Node value recalculation. This doesn't yet exist, but in the case of leaving averaging behind (as done in most MCTS like PUCT and MCGS), and calculating the value of a node from the value of its child nodes e.g. as RENTS or my own Beta-MCTS: a soft hybrid of MCTS and Minimax (wip, not for merge) #963 do it, this would need to be added, and multiple strategies are possible.

I will create a post for each of the 4 points, collecting (a) necessary functions for such classes, (b) spell them out at least in pseudo-code (or copy from the code where available), and (c) list all parts of the codes which need replacement.

@Naphthalin
Copy link
Contributor Author

Naphthalin commented Apr 20, 2022

1. Selection of move to play.

Current strategies:

  • move with max visits. This is done through Search::GetBestChildNoTemperature(), which itself calls GetBestChildrenNoTemperature(), which returns an ordered list of edges. The same order is also used for multiPV, verbosemovestats and other locations, and returning the best move just piggybacks on these functions.
  • move with temperature. This is done throhugh Search::GetBestChildWithTemperature(), acting if temperature is set, and taking into account several additional temperature parameters. This is used for creating training games, so backwards compatibility becomes an issue.

Additional strategies:

  • move with highest LCB. This is likely best done in the move ordering, as LCB is a generalization of Q and N based selection strategies, and affects not only selected move, but also move ordering in verbosemovestats etc.
  • move according to policy. There is a recent PR (WIP) Functionality for choosing a random move weighted by P outputs. #1732 implementing this, though likely an additional parameter specifying a lower bound to chosen policy would be nice.

The class(es) would need to employ

  • a function returning a single move.

This could be done either by looping over root's child nodes, or by giving an array with all relevant information.
A few parameters would need to be handed over from search:

  • drawscore
  • fpu
  • pst (maybe)

@Naphthalin
Copy link
Contributor Author

Naphthalin commented Apr 21, 2022

2. Selection of move to explore.

This probably is the hardest of them, but also the most interesting one for future development.

Current strategies:

  • PUCT. This defines a score function as value + exploration term S=Q+U

Additional strategies:

General remarks:

  • we made multigather path default in Make multi-gather default. #1550, so ideally we only consider multigather (i.e. collect a batch of nodes to visit "at once" instead of one-by-one) in the current format
  • the above strategies can be divided into quasi-deterministic (each node has an exploration score) and randomized strategies (each node has an exploration probability)
  • luckily, conversion in both directions is possible through the Gumbel-Max-trick, and can be efficiently extended to gathering batches by using "draws with and without replacement" strategies.
  • this means that the current way of definining an exploration score for each node, taking into account certain search parameters, is sufficient to cover both types of strategies, and works with multigather.

The class(es) would need to employ the following functions:

  • exploration score of all child nodes of a node, including unvisited ones
  • a way of caching "once per parent node" calculations
  • a version of GetVisitsToReachU() to speed up multigather
  • heuristics dependent multiPV/verbosemovestats outputs
  • some way of hiding strategy specific fields in the Node struct at compile time, or mirroring/replacing the Node structure to allow for memory efficient builds
  • mixed strategy, e.g. based on visits (my betamcts-rents branch with the BetaTS option uses PUCT for <100 nodes and then uses Thompson Sampling)

@Naphthalin
Copy link
Contributor Author

3. Result Propagation

Current strategy:

  • incremental updates for all nodes along the root -> new leaf node path, such that Q/D/M of a node is the average of all NN evals of the subtree
  • plus some terminal logic for proving wins/losses, also acting incrementally

Other strategies:

  • remove V of a node if extending it (might be better for mate lines), so Q/D/M is only the average of all current leaf nodes
  • minimax update of all nodes along the root -> new leaf node path
  • "soft minimax", where each child node Q gets a weight on top of the visits; pure averaging would be "all weights = 1", minimax would be "best Q has weight 1, all others weight 0"
  • MCGS (when encountering a discrepancy between nodes in the tree, send an amplified correction signal instead of a true Q/D/M)
  • in principle, during the update process a node recalculation could be triggered as well

The class(es) would need to employ the following functions:

  • provided with old values plus weight and "new result" from the explored child, return new Q/D/M/N node and update values for the parent
  • some way of blocking strategies based on the necessary Node fields for optimized builds
  • how to deal with new terminals (though that's likely the same for all possible strategies, and is already implemented)
  • if in a graph search implementation a node has more than one parent, decide whether to update other parents than the traversed one (I personally think that's a bad idea anyway, so probably nothing to worry about)

@Naphthalin
Copy link
Contributor Author

4. Node value recalculation

This doesn't yet exist, with the small exception of reverting twofold draw visits to restore balance to the tree, though there are known side effects in fast TCs (#1466, fix attempt in #1513).

A node recalculation would be necessary in all strategies which can't be fully represented as incremental updates. In general, it means calculating Q/D/M of a node from its children, which effectively allows pruning of branches which turned out bad, and hence also allow exploration strategies which don't converge on mostly exploring the best moves. #963 and the mentioned betamcts-rents branch explore a version of "soft minimax recalculation", which interpolates between pure averaging at low nodes to pure minimax at infinite nodes, and allows for nearly arbitrary forced exploration as it happens in forward/backward analysis.

However, a big caveat is that this can lead to non-incremental changes in Q, which interacts badly with PUCT, as the latter relies on incremental updates in order for the PUCT score to (semi-)reliably spread the visits across moves. This means that likely neither a recalculation loop nor a replacement for PUCT will lead to an improvement without the other, and they need to be tested together.

The class(es) would need to employ the following functions:

  • when to be triggered: during incremental update or just before exploration
  • triggered at each visit, or from a bool Recalculate(int visits)
  • return new node values after reading all child nodes
  • some way of blocking strategies based on necessary Node fields missing in optimized builds
  • mixed strategy, e.g. based on visit (my betamcts-branch currently uses pure averaging for <100 nodes and only then recalculates nodes)

@danegraphics
Copy link
Contributor

danegraphics commented May 10, 2022

There are definitely a large number of different paradigms that would be nice to explore and test. Making the code more modular in this way would definitely make it much easier to do so.

More specifically, being able to experiment more easily with the search, value calculation, and the integration of the two would help massively.

Currently, by the nature of MCTS, search is somewhat bound to the assumption that any move worth more exploration is also more worth playing, and vice versa, even though this isn't necessarily true, especially in chess. So the ability to experiment more easily with functions with aims akin to the aims of quiescence, pruning, minimax, and more without needing to overwrite code for existing strategies in the process is definitely something we should aim for.

I very much support this proposal.

AadityaSalgarkar added a commit to AadityaSalgarkar/lc0 that referenced this issue Aug 19, 2023
The idea is to create a class SearchMethod which will make the
process described in LeelaChessZero#1734 modular. The user can just specify
1. Strategy for the selection of the move to play,
2. Strategy for the selection of the move to explore,
3. Strategy for the result propagation,
4. Strategy for the recalculation of node value.

It will allow wider experimentation with search by, for example,
using NN for some of the choices made.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants