Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riak_core_handoff_sender {badrecord,ho_acc} #1010

Open
Konstantin74R opened this issue Sep 13, 2023 · 4 comments
Open

riak_core_handoff_sender {badrecord,ho_acc} #1010

Konstantin74R opened this issue Sep 13, 2023 · 4 comments

Comments

@Konstantin74R
Copy link

Konstantin74R commented Sep 13, 2023

Hi, please help. I can't find reson fo broken handoff between old node riak@172.16.3.92 to new node riak@172.16.3.63.

In log I see error "failed because of error:{badrecord,ho_acc}" and which node has the problem is unclear.

versions
riak_kv 2.1.1
n_val = 3,
nodes is 40

console.log
2023-09-13 10:48:55.880 [info] <0.25322.2>@riak_core_handoff_sender:start_fold:153 Starting ownership transfer of riak_kv_vnode from 'riak@172.16.3.92' 52808164630 1205155991565808696313200461616250880 to 'riak@172.16.3.63' 528081646301205155991565808696313200461616250880 2023-09-13 10:57:13.707 [info] <0.1490.1>@riak_kv_index_hashtree:clear_tree:842 Clearing AAE tree: 625134489405210427903529254618878869735643021312 2023-09-13 10:57:14.659 [info] <0.30711.2>@riak_kv_index_hashtree:build_or_rehash:880 Starting AAE tree build: 625134489405210427903529254618878869735643021312 2023-09-13 11:48:20.321 [error] <0.4371.0> gen_server <0.4371.0> terminated with reason: no function clause matching riak_core_handoff_sender:visit_item({<<48,98,5 8,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89, 118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 2023-09-13 11:48:20.321 [error] <0.4371.0> CRASH REPORT Process <0.4371.0> with 0 neighbours exited with reason: no function clause matching riak_core_handoff_send er:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0, 0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 in gen_server:terminate/6 line 744 2023-09-13 11:48:20.322 [error] <0.4368.0> Supervisor {<0.4368.0>,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.4371.0> exit with reason no function clause matching riak_core_handoff_sender:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 in context child_terminated 2023-09-13 11:48:20.323 [error] <0.25322.2>@riak_core_handoff_sender:start_fold:282 ownership transfer of riak_kv_vnode from 'riak@172.16.3.92' 528081646301205155991565808696313200461616250880 to 'riak@172.16.3.63' 528081646301205155991565808696313200461616250880 failed because of error:{badrecord,ho_acc} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,204}]}]
error.log

2023-09-13 10:48:48.097 [error] <0.6963.0>@riak_core_handoff_sender:start_fold:282 ownership transfer of riak_kv_vnode from 'riak@172.16.3.92' 528081646301205155991565808696313200461616250880 to 'riak@172.16.3.63' 528081646301205155991565808696313200461616250880 failed because of error:{badrecord,ho_acc} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,204}]}] 2023-09-13 11:48:20.321 [error] <0.4371.0> gen_server <0.4371.0> terminated with reason: no function clause matching riak_core_handoff_sender:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 2023-09-13 11:48:20.321 [error] <0.4371.0> CRASH REPORT Process <0.4371.0> with 0 neighbours exited with reason: no function clause matching riak_core_handoff_sender:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 in gen_server:terminate/6 line 744 2023-09-13 11:48:20.322 [error] <0.4368.0> Supervisor {<0.4368.0>,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.4371.0> exit with reason no function clause matching riak_core_handoff_sender:visit_item({<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,...>>}, <<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,...>>, {error,eio}) line 296 in context child_terminated 2023-09-13 11:48:20.323 [error] <0.25322.2>@riak_core_handoff_sender:start_fold:282 ownership transfer of riak_kv_vnode from 'riak@172.16.3.92' 528081646301205155991565808696313200461616250880 to 'riak@172.16.3.63' 528081646301205155991565808696313200461616250880 failed because of error:{badrecord,ho_acc} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,204}]}] 2023-09-13 12:16:23.882 [error] <0.170.0> Supervisor riak_core_handoff_sender_sup had child riak_core_handoff_sender started with {riak_core_handoff_sender,start_link,undefined} at <0.2844.4> exit with reason max_concurrency in context child_terminated
crash.log

2023-09-13 11:48:20 =CRASH REPORT==== crasher: initial call: riak_core_vnode_worker:init/1 pid: <0.4371.0> registered_name: [] exception exit: {{function_clause,[{riak_core_handoff_sender,visit_item,[{<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,1 41,135,102,76,129,178,206,178,15,200,127,187,40,0,0,0,0>>},<<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,2,97,1,11 ,48,51,46,50,48,50,50,45,...>>,...],...},...]},...} ancestors: [<0.4368.0>,<0.4367.0>,<0.4366.0>,<0.1819.0>,riak_core_vnode_sup,riak_core_sup,<0.161.0>] messages: [] links: [<0.4367.0>,<0.4368.0>,<0.93.0>] dictionary: [{bitcask_file_mod,bitcask_file},{bitcask_time_fudge,no_testing},{is_visit_item_timer_set,true}] trap_exit: false status: running heap_size: 318187 stack_size: 27 reductions: 37404224106 neighbours: 2023-09-13 11:48:20 =SUPERVISOR REPORT==== Supervisor: {<0.4368.0>,poolboy_sup} Context: child_terminated Reason: {function_clause,[{riak_core_handoff_sender,visit_item,[{<<48,98,58,181,35,123,106,120,146,164,42,74,122,96,253,251,232,40,185>>,<<175,18,82,141,135,102,76,129,178,206,178,15,200,127,187,40,0,0,0,0>>},<<53,1,0,0,0,38,131,108,0,0,0,1,104,2,109,0,0,0,12,100,4,162,222,89,118,233,214,0,0,43,103,104,2,97,1,110,5, 51,46,50,48,50,50,45,...>>,...],...},...]} Offender: [{pid,<0.4371.0>},{name,riak_core_vnode_worker},{mfargs,{riak_core_vnode_worker,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]

@martinsumner
Copy link
Contributor

Hard to know - but the code is running a fold over the vnode and sending to the remote vnode at the time ... but in the midst of the logs is a reference to {error,eio} - a generic UNIX socket error? Is this a recurring issue, or might it have been a one-off HW or OS issue?

@Konstantin74R
Copy link
Author

CentOS Linux release 7.9.2009 (Core) 3.10.0-1160.81.1.el7.x86_64 #1 SMP Fri Dec 16 17:29:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

HW and OS problems not detected.

It's a recurring problem only for one node and only for one partition. Temporary resolve - I was delete this partition from backends on this node to use replication for restore data. Joining new node to cluster was finished. No active transfers.

@martinsumner
Copy link
Contributor

Perhaps there was a corrupted block on the disk within that vnode backend causing an IO error when the fold tried to read a particular part of the keyspace? Deleting the partition would then allow the handoff to complete.

You should be able to run partition repair to restore data with relative efficiency - https://docs.riak.com/riak/kv/latest/using/repair-recovery/repairs/index.html#repairing-a-single-partition.

@Konstantin74R
Copy link
Author

I tried to start repair, but console return message to me "change ownership" and no effect.
Maybe was. I not found a problems with disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants