Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hinted handoffs do not happen when using multiple bitcask backends #1789

Open
zigge-zagge opened this issue May 12, 2021 · 2 comments
Open

Comments

@zigge-zagge
Copy link

(As posted on the Riak Slack with some modifications.)

I think I have found an interesting issue where hinted handoffs do not happen if you have a couple of bitcask backends. (Assuming default configuration for the related things below.)

Hinted handoffs are triggered when the riak_core_vnode (gen_fsm implementation) / riak_kv_node (the callback module we're interested of here) has been inactive for a period of 60 to 120 seconds [1]. It then sends an inactive event to the riak_core_vnode_manager which will trigger the handoffs [2].

Thus, if the riak_core_vnode/riak_kv_node receives an event at least once per minute the hinted handoffs are never triggered.

At first I though this was related to AAE activity (we're not using Tictac AAE though) or our Telegraf plugin that extracts Riak stats every 10 seconds (after reading this issue: basho/riak_core#715). However, I inspected a process that I expected to do a hinted handoff with sys:log/2 and found something interesting.

With an interval of about ten seconds it received a riak_vnode_req_v1 message with {backend_callback, reference(), merge_check} as the request [3]. That merge_check event is scheduled by riak_kv_bitcask_backend with an interval of 180 seconds +/- 30% [4]. And we currently have 15 backends (different backends for different TTL etc). So that means that an event like this is handled every 12 seconds on average and the riak_core_vnode/riak_kv_node never hits that inactivity timeout that's needed to trigger the handoffs.

This issue has been brought up before by @bipthelin but we did not know the reason back then (https://postriak.slack.com/archives/C6R0LPH4N/p1570609054000900). @martinsumner linked this issue (#1706) as a good explanation of how handoffs are triggered but the issue itself was related to Tictac AAE (https://postriak.slack.com/archives/C6R0LPH4N/p1570618826012100).

We're currently running Riak 2.9.3 but it seems like my findings are relevant for the latest version as well.

Is there something that I'm missing here? As I see it, this is maybe not technically a bug but still an issue for what I guess is not an uncommon setup.

An easy workaround for us is to set vnode_inactivity_timeout to 5000 (5 to 10 seconds inactivity timeout) but that's to step around the idea of only doing handoffs when the vnode is not busy. Perhaps one should use the background manager (riak_core_bg_manager) for handoffs to achieve that kind of safety but I have not looked further into that. Anyway, the background manager is not used for handoffs by default.


I received a reply from @martinsumner containing the following:

I suspect your analysis is correct though, I don't think the consequences of having this many backends on things like the inactivity timeout have been considered. I would be worried about reducing the inactivity timeout, I think it would be better to change the config of the bitcask_merge_check_interval. This should be changeable now in the riak.conf (although it currently hidden so that isn't obvious - https://github.com/basho/riak_kv/blob/develop-3.0/priv/riak_kv.schema#L749-L762)

He also asked about the reasons of having so many backends and if it's just for the TTL. Here's my reply to that:

This is an old setup that stretches back a few years and I can't justify all the reasons but I can see that we have been using separate backends for specific buckets for full flexibility. Some examples of TTLs are 1 h, 3 h, 24 h and 4 w. We don't even use all of these ones today. After that we have added generic ones like expiry_1h and expiry_24h. Our default backend is LevelDB (we only have one there).

If there would be an easy way to have either bucket or object specific TTL with a single backend there would be nothing stopping us from using that except some time and effort to migrate. I can understand the complexity of testing these multi backend combinations so what you're writing there seems like good news.

References

[1] riak_core_vnode inactivity timeout

riak_core_vnode will send an inactive event to the riak_core_vnode_manager if it has not received any event for a period of 60 to 120 seconds.

Snippets from riak_core_vnode.erl:

-define(DEFAULT_TIMEOUT, 60000).

Timeout = app_helper:get_env(riak_core, vnode_inactivity_timeout, ?DEFAULT_TIMEOUT),
Timeout2 = Timeout + rand:uniform(Timeout),
State#state{... inactivity_timeout=Timeout2 ...}

continue(State) ->
    {next_state, active, State, State#state.inactivity_timeout}.

active(timeout, State=#state{mod=Mod, index=Idx}) ->
    riak_core_vnode_manager:vnode_event(Mod, Idx, self(), inactive),
    continue(State);

[2] riak_core_vnode_manager

Snippets from riak_core_vnode_manager.erl:

handle_vnode_event(inactive, Mod, Idx, Pid, State) ->
    maybe_trigger_handoff(Mod, Idx, Pid, State),
    {noreply, State};
    
maybe_trigger_handoff(Mod, Idx, Pid, _State=#state{handoff=HO}) ->
    case dict:find({Mod, Idx}, HO) of
        {ok, {resize, _Direction, '$resize'}} ->
            ...
        {ok, {delete, local, '$delete'}} ->
            ...
        {ok, {_Type, _Direction, TargetNode}} ->
            riak_core_vnode:trigger_handoff(Pid, TargetNode),
            ok;
        error ->
            ok
    end.

[3] Using sys:log/2 to see messages received by a vnode process

> riak_core_vnode_manager:all_vnodes().
[{riak_kv_vnode,867766597165223607683437869425293042920709947392,
                <0.1059.0>},
 {riak_kv_vnode,1301649895747835411525156804137939564381064921088,
                <0.1060.0>},
 {riak_kv_vnode,1278813932664540053428224228626747642198940975104,
                <0.6896.3>},
 {riak_pipe_vnode,867766597165223607683437869425293042920709947392,
                  <0.3588.0>},
 {riak_pipe_vnode,1301649895747835411525156804137939564381064921088,
                  <0.3597.0>}]

I expected riak_kv_node 1278813932664540053428224228626747642198940975104 to do its handoff and stop.

> Pid = pid(0, 6896, 3).
<0.6896.3>
> sys:log(Pid, true).
ok

% After 30 seconds or so

> sys:log(Pid, get).
{ok,[

     {{in,{timeout,#Ref<0.0.1345.165403>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.217209>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>},
     
     
     {{in,{timeout,#Ref<0.0.1363.129046>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.217575>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>},
     

     {{in,{timeout,#Ref<0.0.1370.78858>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.215724>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>}

    ]}

[4] Scheduling of the merge_check event

Snippets from riak_kv_bitcask_backend.erl:

-define(MERGE_CHECK_INTERVAL, timer:minutes(3)).
-define(MERGE_CHECK_JITTER, 0.3).

schedule_merge(Ref) when is_reference(Ref) ->
    Interval = app_helper:get_env(riak_kv, bitcask_merge_check_interval,
                                  ?MERGE_CHECK_INTERVAL),
    JitterPerc = app_helper:get_env(riak_kv, bitcask_merge_check_jitter,
                                    ?MERGE_CHECK_JITTER),
    Jitter = Interval * JitterPerc,
    FinalInterval = Interval + trunc(2 * rand:uniform() * Jitter - Jitter),
    lager:debug("Scheduling Bitcask merge check in ~pms", [FinalInterval]),
    riak_kv_backend:callback_after(FinalInterval, Ref, merge_check).

Snippet from riak_kv_backend.erl:

%% Queue a callback for the backend after Time ms.
-spec callback_after(integer(), reference(), term()) -> reference().
callback_after(Time, Ref, Msg) when is_integer(Time), is_reference(Ref) ->
    riak_core_vnode:send_command_after(Time, {backend_callback, Ref, Msg}).
@martinsumner
Copy link
Contributor

I think everything stated here is correct:

  1. Handoff is dependent on inactivity of a vnode
  2. It is assumed that vnodes that are ready to handoff will naturally go inactive at some point (as requests have migrated away from it)
  3. However, there may be regular scheduled messages to a vnode - e.g. polling for tictac AAE, backend callbacks - and these disturb and reset the timeout, potentially meaning the timeout will never occur even when application traffic is diverted from the vnode
  4. vnodes can then be stuck awaiting handoff

As there are no limits on backends, or on the operator reconfiguring timeouts - Riak has no awareness it has been configured and set-up in such a state that handoffs will not occur.

An interim fix for this particular case, would be to reconfigure the bitcask_merge_check_interval - and to make this more obvious perhaps in the next release this should be a made a visible setting (it is currently by default hidden) with an additional comment warning about impacting the vnode_inactivity_timeout.

Is a more robust fix possible though?

@martinsumner
Copy link
Contributor

The only thing is perhaps there should be an infrequent poll which will always prompt a maybe_trigger_handoff request. So for most clusters without timeout issues, a 60s timeout will be sufficient to trigger handoff ... whereas if the timeout isn't being triggered, eventually the infrequent poll will trigger it.

But this always leads to the possibility of triggering handoff earlier in some cases, before the vnode is inactive. Could this lead to strange bugs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants