Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing some remaining riak-admin relics #1886

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from

Conversation

JMercerGit
Copy link

Making some changes to some outdated riak-admin outputs that would be better suited to be consistent with the new format riak admin

Andrei Zavada and others added 17 commits April 28, 2022 16:39
PR now accepted into redbug, so return to master.  Changes to pass eqc tests on OTP 24.  Also all eunit tests passing on OTP 25.1.1
Add cmake to allow build of snappy
* Bound reap/erase queues

Also don't log queues on a crash - avoid over-sized crash dumps

* Create generic behaviour for eraser/reaper

Have the queue backed to disk, so that beyond a certain size it overflows from memory to disk (and once the on disk part is consumed from the queue the files are removed).

* Stop cleaning folder

Risk of misconfiguration leading to wiping of wrong data.  Also starting a job may lead to the main process having its disk_log wiped.

* Setup folders correctly

Avoid enoent errors

* Correct log

* Further log correction

* Correct cleaning of directories for test

* Switch to action/2

* Update eqc tests for refactoring of reaper/eraser

* Improve comments/API

* Pause reaper on overload

Check for a soft overload on any vnode before reaping.  This will add some delay - but reap has been show to potentially overload a cluster ... availability is more important than reap speed.

There is no catch for {error, mailbox_overload} should it occur - it should not as the mailbox check should prevent it.  If it does, the reaper will crash (and restart without any reaps) - return to a safe known position.

* Adjustments following review

The queue file generated, are still in UUID format, but the id now incorporates creation date.  This should make it easier to detect and clean any garbage that might be accrued.

One use case where '++' is used has also been removed (although a lists:flatten/1 was still required at the end in this case)
# Conflicts:
#	priv/riak_kv.schema
#	rebar.config
#	src/riak_kv_overflow_queue.erl
#	src/riak_kv_reaper.erl
#	src/riak_kv_replrtq_src.erl
#	src/riak_kv_test_util.erl
See basho#1804

The heart of the problem is how to avoid needing configuration changes on sink clusters when source clusters are bing changed. This allows for new nodes to be discovered automatically, from configured nodes.

Default behaviour is to always fallback to configured behaviour.

Worker Counts and Per Peer Limits need to be set based on an understanding of whether this will be enabled. Although, if per peer limit is left to default, the consequence will be the worker count will be evenly distributed (independently by each node). Note, if Worker Count mod (Src Node Count) =/= 0 - then there will be no balancing of the excess workers across the sink nodes.
# Conflicts:
#	rebar.config
#	src/riak_kv_replrtq_peer.erl
#	src/riak_kv_replrtq_snk.erl
#	src/riak_kv_replrtq_src.erl
To simplify the configuration, rather than have the operator select all_check day_check range_check etc, there is now a default strategy of auto_check which tries to do a sensible thing:

do range_check when a range is set, otherwise
do all_check if it is out of hours (in the all_check window), otherwise
do day_check
Some stats added to help with monitoring, the detail is still also in the console logs.

See basho#1815
# Conflicts:
#	src/riak_kv_stat.erl
#	src/riak_kv_ttaaefs_manager.erl
Expand on use of riak_kv_overflow_queue so that it is used by the riak_kv_replrtq_src, as well as riak_kv_reaper and riak_kv_eraser.

This means that larger queue sizes can be supported for riak_kv_replrtq_src without having to worry about compromising the memory of the node. This should allow for repl_keys_range AAE folds to generate very large replication sets, without clogging the node worker pool by pausing so that real-time replication can keep up.

The overflow queues are deleted on shutdown (if there is a queue on disk). The feature is to allow for larger queues without memory exhaustion, persistence is not used to persist queues across restarts.

Overflow Queues extended to include a 'reader' queue which may be used for read_repairs. Currently this queue is only used for the repair_keys_range query and the read-repair trigger.
# Conflicts:
#	priv/riak_kv.schema
#	src/riak_kv_overflow_queue.erl
#	src/riak_kv_replrtq_src.erl
Introduces a reader overflowq for doing read repair operations.  Initially this is used for:

- repair_keys_range aae_fold - avoids the pausing of the fold that would block the worker pool;
- repair on key_amnesia - triggers the required repair rather than causing an AAE delta;
- repair on finding a corrupted object when folding to rebuild aae_store - previously the fold would crash, and the AAE store would therefore never be rebuilt.  [This PR](martinsumner/kv_index_tictactree#106) is required to make this consistent in both AAE solutions.
# Conflicts:
#	priv/riak_kv.schema
#	rebar.config
#	src/riak_kv_clusteraae_fsm.erl
#	src/riak_kv_index_hashtree.erl
#	src/riak_kv_vnode.erl
Merges removed the stat updates for ttaae full-sync (detected by riak_test).

A log had been introduced in riak_kv_replrtq_peer what could crash (detected by riak_test).

The safety change to avoid coordination in full-sync by setting time for first work item from beginning of next hour, makes sense with 24 slices (one per hour) ... but less sense with different values.  riak_test which uses a very high slice_count to avoid delays then failed.

# Conflicts:
#	src/riak_kv_replrtq_peer.erl
#	src/riak_kv_ttaaefs_manager.erl
# Conflicts:
#	rebar.config
Fixing some remaining riak-admin relics that should be `riak admin`
Fixing some remaining riak-admin relics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants