Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running tictac_aae in non-native mode leads to eventual failure of node #1765

Open
jagguli opened this issue Jun 29, 2020 · 15 comments
Open

Comments

@jagguli
Copy link

jagguli commented Jun 29, 2020

crasher:
    initial call: poolboy:init/1
    pid: <0.6187.0>
    registered_name: []
    exception exit: {{{noproc,{gen_server,call,[<0.6185.0>,close]}},{gen_server,call,[<0.5307.0>,close,30000]}},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,622}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
    ancestors: [<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_sup,<0.201.0>]
    messages: []
    links: [<0.6194.0>,<0.6196.0>,<0.6197.0>,<0.6198.0>,<0.6195.0>,<0.6190.0>,<0.6192.0>,<0.6193.0>,<0.6191.0>,<0.6188.0>,<0.6189.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 376
    stack_size: 27
    reductions: 648
  neighbours:
    neighbour: [{pid,<0.6189.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6191.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6193.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6192.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6190.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6195.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6198.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6197.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6196.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
    neighbour: [{pid,<0.6194.0>},{registered_name,[]},{initial_call,{riak_core_vnode_worker,init,['Argument__1']}},{current_function,{gen_server,loop,6}},{ancestors,[<0.6188.0>,<0.6187.0>,<0.6186.0>,<0.3441.0>,riak_core_vnode_sup,riak_core_
sup,<0.201.0>]},{messages,[]},{links,[<0.6188.0>,<0.6187.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,233},{stack_size,9},{reductions,27}]
2020-06-28 22:09:18 =SUPERVISOR REPORT====
     Supervisor: {<0.6188.0>,poolboy_sup}
     Context:    shutdown_error
     Reason:     {{noproc,{gen_server,call,[<0.6185.0>,close]}},{gen_server,call,[<0.5307.0>,close,30000]}}
     Offender:   [{nb_children,10},{name,riak_core_vnode_worker},{mfargs,{riak_core_vnode_worker,start_link,[[{worker_module,riak_core_vnode_worker},{worker_args,[137015778499772148581595453067151533092743675904,[],worker_props,<0.6186.0>]}
,{worker_callback_mod,riak_kv_worker},{size,10},{max_overflow,0}]]}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]

masleeds 3:21 AM

I mean tictac_aae in parallel mode - so leveldb or bitcask as storage_backend, but leveled as AAE backend. The other user had problems when they had a cluster with busy vnodes, and they switched tictac_aae on - and the rebuilds/repair cycle that got triggered as tictac_aae was trying to get up-to-speed went into a downward spiral

https://postriak.slack.com/archives/C6R0LPH4N/p1593349532100600

@martinsumner
Copy link
Contributor

I'm going to have a look at this from tomorrow. Most of the volume testing I did was in native mode not parallel, so the first aim is to look at what happens if I build up vnodes of >> 100GB of data with an eleveldb backend - and then then switch on tictacaae.

I'm hoping to keep building up the volume until I hit issues with aae store rebuilds timing out, aae cache rebuilds timing out and maybe fetch_clocks queries.

@martinsumner
Copy link
Contributor

martinsumner commented Jul 6, 2020

Tested this with 200GB per-vnode (running leveldb backend, but with a small ring size) which had been built with AAE enabled but not tictac_aae. The intention was to test what happens when we then restart all the nodes at the same time with tictac_aae enabled (without the standard transition precaution of increasing the exchange tick).

No crashes were observed in the aftermath of this event, however, there were warning signs of potential trouble ahead.

When the restart occurs - each vnode/aae_controller spots that there is a state mismatch between the vnode (non-empty) and the parallel aae store (empty). This will prompt a rebuild of the aae keystore to resolve this mismatch. This is as expected.

There is minimal spacing of these rebuilds - so all nodes become busy concurrently, doing all their rebuilds. The busyness is primarily with write activity, in this environment there was no noticeable impact on read_await, but a significant impact on write_await. It is assumed this is because the vnode backend by default is greedy at allocating itself memory, so most of the reads as the rebuild folds over the leveldb store are from memory (or from the page cache reading ahead). The high volume of writes are an unavoidable consequence from the amplification of writes necessary to build a LSM store.

In this environment, the rebuilds of the aae store took about 1 hour. However, there were only 4 vnodes per node, so in a normal environment this could have taken much longer - as disk utilisation was at 80% with 4 concurrent rebuilds.

Once the aae store rebuilds are complete, an aae treecache rebuild is prompted for each vnode. This requires a fold over the new aae_keystore in order to build the aae treecache. These aae treecache rebuilds are prompted almost concurrently, but the work is scheduled using the vnode's "Best Effort" queue - and there is by default only one concurrent item of work allowed to be worked from this queue per node.

At this stage, exchanges are occurring, and completing with no deltas - as all treecaches are empty (as none have been rebuilt).

After 20-30 minutes, the first treecaches are built, and the next treecache build for the node is picked up off the queue. At this stage, some aae exchanges begin to see false deltas when comparing trees - where they are comparing a completed tree with a still empty tree.

These false deltas require aae_folds over the whole store (albeit an efficient fold which is skipping over slots based on the aae segment index). This adds more read activity to the keystore, and seems to slow further the next treecache rebuilds. Some of the aae_exchanges begin to timeout, waiting 10 minutes for the fold_clocks query to complete.

As the treecaches begin to get built across the cluster, the number of false delta hunts first increases, then decreases - as does the number of exchange timeouts. Eventually (after about 4 hours), the treecaches are all built, and false delta hunts stop altogether.

The cluster at this stage is behaving as expected.

@martinsumner
Copy link
Contributor

martinsumner commented Jul 6, 2020

There are some signs of trouble:

  • As observed in the production environment, in this test environment not all rebuilds had observable logs for stop/start. Logging (or perhaps log indexing) was strangely intermittent for these rebuilds, although it is clear from other logs all required rebuilds did occur.

  • The kv_index_tictactree requires snapshots of the aae keystore for treecache rebuilds to be taken when the folder function is created, not when the fold is initiated. So there is a real potential for folds at the back of the BE queue to timeout, and therefore rebuilds to crash, as snapshots are only guaranteed a lock for 12 hours.

  • The riak_kv_vnode tictacaae_exchangepoke is prompted every tick regardless of how long exchanges are actually taking to complete. So when exchanges are leading to false delta hunts, and those false delta hunts are taking a long time (and even timing out), it is possible that the exchange poke could keep launching parallel exchanges which lead to parallel delta hunts ... and these would slow the delta hunts, and also slow the treecache rebuilds, extending the time over which false delta hunts could occur. There is potential for a tictacaae store to be overwhelmed with load, and eventually crash, crashing the vnode.

All of this could be mitigated with configuration for this transition scenario. A temporary increase in the exchange tick time, and also an increase in the size of the BE queue.

However, the problem of overlapping exchanges timing out could occur outside of this scenario in a high-entropy cluster.

@martinsumner
Copy link
Contributor

There are obvious improvements to make:

  • Make the tictacaae_exchangetick responsive to the time it is taking to complete exchanges (so don't rely on the operator to configure for the transition, and handle this scenario outside of transition).

  • Increase the scan timeout within the aae_exchange, potentially making this configurable, or even responsive. The current default is ten minutes.

@martinsumner
Copy link
Contributor

The problem of rebuild scheduling is hard. Not queueing rebuilds may increase pressure, and cause other issues. Scheduling them further apart would increase the window for false positives.

Perhaps false positives caused by initial rebuilds could be handled by comparing rebuild state at startup.

There are potential solutions here, but none without notable additional complexity or alternative risks.

@martinsumner
Copy link
Contributor

I don't know what is happening wrt logging.

@martinsumner
Copy link
Contributor

Some thoughts on an algorithm through which the system can back-off on exchanges, should exchanges start to take a long time.

  • There is a conscious decision to start an exchange not start_link the exchange from a vnode. Exchanges may crash, and the vnode can ignore this. So we can't keep a strict "no new exchanges until old ones complete" policy without changing this, as some exchanges may never return.

  • The default exchange tick is 2 minutes 30s, and the exchange timeout is approx 10 minutes (although an exchange doesn't cancel its work on timeout). Need to consider what happens when the ReturnFun crashes meaning that cache_replacedirtysegments/3 is never called https://github.com/martinsumner/kv_index_tictactree/blob/master/src/aae_controller.erl#L666-L674.

  • There may be limits (e.g. very high object counts per vnode) for which tictacaae in parallel mode is just not appropriate. In which case there needs to be warnings emitted and a graceful ending of aae_exchanges rather than crashes. If aae exchanges slow to the extent that aae is barely running - is this more acceptable than vnode crashes?

  • In the bad scenarios we're seeing, there are moth fast exchanges and slow one. So if the exchange tick is bumped, but then re-reduced rapidly on fast exchanges, the outcome may not be as protective desired.

@martinsumner
Copy link
Contributor

Never calling cache_replacedirtysegments/3 is fine. Eventually the dirty segments will be removed from the list by alterations, or scrubbed by a new active fold

@martinsumner
Copy link
Contributor

Proposal is:

This should resolve the problem of cascading load on slow exchanges.

The issue of queueing treecache rebuilds and snapshot timeouts still needs to be resolved.

@martinsumner
Copy link
Contributor

https://github.com/basho/riak_kv/tree/mas-i1765-parallel

Presently this only addresses the issue of needing to back off as the time to complete exchanges increases. This works well, at the moment - as the exchange time goes up, at the peak where 50% of vnodes have rebuilt but 50% have not, the frequency of exchanges is reduced by just 30%. With this reduction, the maximum time to complete an exchange reduces by more than 50%, with no exchange getting within 3 minutes of the old threshold for timeout.

@martinsumner
Copy link
Contributor

The branch has now been extended so that the taking of the rebuild snapshot is deferred until the rebuild work item is consumed from the queue - so as long as the rebuild of the trees completes within 12 hours, there will be no issues with the snapshot timing out during the rebuild.

The scenario above was then re-tested with equivalent (although slightly improved) results.

@jagguli
Copy link
Author

jagguli commented Jul 9, 2020

Hi Martin, thanks for detailing the issue,

Would allowing additional capacity to accomodate the inital treecache build, alleviate this issue ?

Could you also please share the specs of the nodes used for testing?

Yea logging in general for riak could use some work, I'll create an issue if one doesn't exist alread.

@martinsumner
Copy link
Contributor

The test rig is 8 x servers, each with 8 x large HDD (RAID 10) and 12 CPU cores. There is a FBWC on the RAID controller for improved write latency.

Ran a more challenging version of the test, with 1.4TB per node of data, with about 4K ops/persec (90% writes) load during the test. This test also included a server failure mid-test (not deliberate, but a kernel panic). Everything went OK with the new code, until I tried recovering the failed server. Then the combination of hinted handoffs, the backlog of rebuilds on the server recreated the memory-leak problem.

This exposed there is an issue with the aae_runner in kv_index_tictactree having an unbounded queue. The issue with slow fetch_clocks queries is related more to queue time at the aae_runner, than it is to query time.

There is a need for a further fix to bound the queue on aae_runner, so as not to take snapshots of the store and place the snapshot-related work at the end of the queue.

@martinsumner
Copy link
Contributor

@martinsumner
Copy link
Contributor

It should be noted, that the process of enabling tictacaae for the first time, on a large store, when under heavy write load - will consume significant amounts of memory.

Each aae_keystore keeps a "change_queue" of Keys/Metadata of updates received since the rebuild of the store started. Since all rebuilds are running concurrently when tictacaae is enabled for the first time - this can require lots of memory. In the above test an additional 4GB per node was required by the end of the 3 to 4 hour period of the keystore rebuilds.

To reduce the ram required, enabling tictacaae must happen away from periods of heavy write load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants