Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordering of transfers and the impact of forced_ownership_handoff threshold #1013

Open
martinsumner opened this issue Oct 18, 2023 · 0 comments

Comments

@martinsumner
Copy link
Contributor

martinsumner commented Oct 18, 2023

In an ownership change (i.e. join, leave, replace or location change), once a transfer commences, assuming vnodes are sufficiently busy to prevent the vnode_inactivity_timeout from triggering - then all transfers will be prompted by the vnode manager's scheduled tick (default every 10s).

If there are pending transfers, this takes the list of transfers from riak_core_ring:pending_changes/1 and applies the forced_ownership_handoff threshold (default 8) to examine only the first threshold items in the list.

The vnode_manager looks for the existence of itself (as a sender) in that sublist, and if there is a valid potential transfer candidate matching, the transfer will then be attempted (and may or may not succeed depending on the per-node handoff_concurrency transfer limit).

There are aspects of this which are potentially confusing:

  • The net effect of force_ownership_handoff is to act as a cluster-wide limit on the number of concurrent transfers. However, this is not documented, nor referred to in the documentation of the handoff_concurrency transfer limit. So an operator following the configuration guidance may see unexpected behaviour. In particular, a much reduced level of concurrency on transfers. Note though, that expected concurrency would be achieved should the cluster be sufficiently quiet to trigger vnode inactivity (see Refactor riak_core vnode management (part 2) #120 (comment))
  • The pending transfers from riak_core_ring:pending_changes(Ring) are integer-ordered by Partition (i.e. position in the ring). It is not clear that this is a deliberate decision, that the combination of force_ownership_handoff and a partitioned-ordered ring-changes leading to transitions happening in partition order. There have been cases whereby this situation has lead to nodes gaining a large number of partitions via transfer, before they release any (leading to risks of disk-space growth). The use of location awareness makes it more likely that joins to address capacity issue may also require intra-cluster shuffling.
  • When using controls, such as riak admin handoff disable inbound, it is possible to block all progress across the ring if the first forced_ownership_handoff number of pending changes are all blocked.

As a minimum there is a need to update the riak_core_schema to enable the configuration of forced_ownership_handoff, and also to clarify in the transfer limit configuration stanza that the limit may not be reached without also re-configuring forced_ownership_handoff.

It also may be worth considering whether using partition-order on riak_core_ring:pending_changes/1 is correct. Should the order be deterministically set so that nodes receiving transfers are evenly distributed through the list, so that they don't over-grow on an interim basis by first receiving then offloading. However, there may be some data-safety in the existing ring order. In that by keeping changes within the same preflist close together in order, it reduces any window whereby that preflist may be temporarily in an unsafe state (i.e. if partition in position M is moving to node X, and partition M + 1 is moving from X - if these two moves happened at the beginning and end of the list of changes, the window of unsafety when target_n_val is not met is unnecessarily extended).

A possible re-ordering of the output of riak_core_ring:pending_changes/1 from within the riak_core_vnode_manager:trigger_ownership_handoff/4 might be to place all pending_changes for joining nodes at the top of the list, and leave the remainder ordered by ring position. This would minimise the risk of interim states during joins either overloading disk-space on existing nodes, by first maximising use of fresh capacity.

@martinsumner martinsumner changed the title Ordering of transfers and the impact of force_ownership_handoff threshold Ordering of transfers and the impact of forced_ownership_handoff threshold Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant