Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark tool for Riak KV #1863

Open
richamishra006 opened this issue Jun 20, 2023 · 23 comments
Open

benchmark tool for Riak KV #1863

richamishra006 opened this issue Jun 20, 2023 · 23 comments

Comments

@richamishra006
Copy link

Hi, can someone please suggest me some benchmarking tool for Riak to generate and perform load test on the machine.

@martinsumner
Copy link
Contributor

The develop-3.0 branch of basho_bench is the main one used for pre-release testing of Riak - https://github.com/basho/basho_bench/tree/develop-3.0 with this configuration.

There is a alternative branch which is used for testing the same thing via the HTTP API.

@richamishra006
Copy link
Author

Thanks for your reply, I installed rebar3, but when running make all command, it is giving error

root@68d36c17d1d9:/basho_bench# make all
/basho_bench/rebar get-deps
make: /basho_bench/rebar: Command not found
make: *** [Makefile:22: deps] Error 127

Please let me know if I am missing anything

@richamishra006
Copy link
Author

ok i resolved that error by updating Makefile and replacing rebar with rebar3

@richamishra006
Copy link
Author

i have installed erlang 25 and rebar3 but getting this error when running make all

root@68d36c17d1d9:/basho_bench# make all
/basho_bench/rebar3 get-deps
=ERROR REPORT==== 21-Jun-2023::15:30:10.069489 ===
beam/beam_load.c(551): Error loading function rebar3:run_aux/2: op put_tuple u x:
  please re-compile this module with an Erlang/OTP 25 compiler


escript: exception error: undefined function rebar3:main/1
  in function  escript:run/2 (escript.erl, line 750)
  in call from escript:start/1 (escript.erl, line 277)
  in call from init:start_em/1 
  in call from init:do_boot/3 
make: *** [Makefile:22: deps] Error 127

@martinsumner
Copy link
Contributor

Try with OTP 22.

Run ./rebar3 escriptize from the basho_bench folder.

That should generate a basho_bench executable in _build/default/bin which can then be used like this nohup _build/default/bin/basho_bench examples/riakc_nhs_general.config & (or whatever config file), which will start generating output in tests/current/. You will need to update the config file to include your ip addresses (not the use of commas not periods in the addresses).

Sorry, the readme instructions are lagging behind the changes made. There's a bit of fiddling normally required to get this working.

The generation of charts using R is probably broken, you may need to do your own work on the csv outputs to chart any results.

@richamishra006
Copy link
Author

Thankyou for the response, it is working with otp 22,now probably one last doubt, I have three riak nodes riak0.local, riak1.local and riak2.local
in what all files I need to update this name or I can update with IP address as well, but there are lots of file in examples directory,
basically the end goal is to generate load, and how about if I want to increase the load, is there any parameter which I need to update for increasing the load on cluster

@richamishra006
Copy link
Author

hey @martinsumner could you please help me out, I am just about to finish my load test

@martinsumner
Copy link
Contributor

martinsumner commented Jun 22, 2023

For basho_bench you have a configuration file which you can set-up to control your test. You can pick one of the examples as a starting point, a good starting point might be the riakc_nhs_general.config which is what is used for riak release testing.

Here is an annotated version of that config file to explain what it means:

{mode, max}.

{duration, 1920}.

{report_interval, 10}.

{node_name, testnode1}.

{concurrent, 100}.

The first few elements define the throughput for the test.

  • The mode in this case is {mode, max} - hit the cluster as hard as you can, each worker will try a new piece of work once the last has been completed.
  • The {concurrent, 100} means have a 100 workers each generating and sending requests concurrently.
  • The {duration, 1920} just sets the test to run for 1920 minutes.

So in this case you would increase the throughput by increasing the {concurrent, 100} value.

{driver, basho_bench_driver_nhs}.

{record_bucket, "recordBucket"}.
{document_bucket, "documentBucket"}.
{record_sync, "one"}.
{document_sync, "backend"}.
{node_confirms, 2}.

{postcode_indexcount, 6}.

%% Ignored by alwaysget and unique operations
{key_generator, {eightytwenty_int, 100000000}}.

{value_generator, {semi_compressible, 10000, 2000, 10, 0.1}}.

%% For alwaysget operations what is:
%% - the maximum number of keys per worker (max number of keys = this * concurrent)
%% - whether the inserts should be in key_order
{alwaysget, {2000000, 700000, skew_order}}.
{unique, {6000, key_order}}.
  • the next key but of configuration is to define the driver (the erlang file which defines the individual test commands). In this case this is the basho_bench_driver_nhs.
  • all the rest of this section is the configuration for that driver, things like object size (mean variation), size of the key space, how many indexes a record will have, bucket names to generate keys in. There's a lot going on here.
{pb_ips, [{127,0,0,1}]}.
{http_ips, [{127,0,0,1}]}.
  • This is where the IP addresses should go. You can point at a single address (i.e. a load-balancer), or a list of addresses. I think only IP addresses will work, and they need to be comma separated inside a tuple e.g. [{192, 168, 3, 1}, {192, 168, 3, 2}]. Normally the pb IPs and http IPs will be the same list.
{operations, [{alwaysget_pb, 620}, {alwaysget_updatewith2i, 130}, 
                {put_unique, 90}, {get_unique, 130}, {delete_unique, 25},
                {postcodequery_http, 2}, {dobquery_http, 3}]}.
  • This defines the distribution of operations. In this case there are 1000 operations in total, so 62% of operations will be of the form alwayget_pb (this is an operation that fetches an object that has been added as part of the test - so never gets a not found - using the PB API). {postcodequery_http, 2} - means that 0.2% of test requests will be for a HTTP 2i query of the postcode index.

@martinsumner
Copy link
Contributor

This isn't easy. There's a lot going on in this particular config file to generate various test scenarios. this particular test scenario runs in an upload mode until a certain threshold is reached, and then switches to a load which has more GETs than PUTs once the database is of sufficient size to be worth testing.

There are much simpler test configs available - https://github.com/basho/basho_bench/blob/mas-nhs-httponly/examples/riakc_pb.config is a good example. The simpler test scenarios tend to give unrealistic tests - e.g. most of the test runs with against a small database, with lots of not_found responses, and there's no testing of 2i etc.

@martinsumner
Copy link
Contributor

econnrefused normally means either there is no lustener on the TCP port, or some sort of firewall is blocking it. On the riak node 172.22.0.212 if you do netstat -an | grep LISTEN | grep 8087 is it listening on that TCP port? Can you telnet to that port/IP from the basho_bench server?

I'm not sure why riakc_pb.config would work though. this has an info message just before it connects - may be worth confirming the details being reported in the console log when it hits this log:

https://github.com/basho/basho_bench/blob/develop-3.0/src/basho_bench_driver_riakc_pb.erl#L130

@richamishra006
Copy link
Author

yes i got to know that riak node is getting down again and again, when checked the logs, I found this error

Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.298.0> exit with reason {{function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{{badmatch,{error,{{badmatch,{error,{{badmatch,{error,emfile}},[{leveled_pmanifest,open_manifest,1,[{file,"/root/riak/rel/pkg/out/riak-3.0.10-OTP22.3/_build/default/lib/leveled/src/leveled_pmanifest.erl"},{line,128}]},{leveled_penciller,start_from_file,1,[{file,"/root/riak/rel/pkg/out/riak-3.0.10-OTP22.3/_build/default/lib/leveled/src/leveled_penciller.erl"},{line,1231}]},{gen_server,init_it,2,[{file,"gen_server.erl"},...]},...]}}},...}}},...}}},...],...},...]},...} in context child_terminated

I increased the limit and fs.file-max but still getting this error

@martinsumner
Copy link
Contributor

This looks like a standard ulimit issue. Guidance for setting ulimit here

@richamishra006
Copy link
Author

Thankyou @martinsumner , i updated the limits by following the doc you shared, but still the service is dropping again and again. So I removed the ring and decided to test with a single node itself(where I am not facing issues), I used the riakc_nhs_general.config file and ran the test.
Also I installed R using this command sudo apt-get install r-base as mentioned in this doc https://docs.riak.com/riak/kv/latest/using/performance/benchmarking/index.html but when I am running priv/summary.r -i tests/current the summary.png is not getting created. and I am getting this as the output

root@application-node01:~/basho_bench# priv/summary.r -i tests/current
[1] "plyr"
Loading required package: plyr
[1] "grid"
Loading required package: grid
[1] "getopt"
Loading required package: getopt
[1] "proto"
Loading required package: proto
[1] "ggplot2"
Loading required package: ggplot2
[1] 0
[1] -Inf
Warning message:
In max(summary$elapsed) : no non-missing arguments to max; returning -Inf
Error: No latency information available to analyze in tests/current
Execution halted

can you please help me in generating the graph for this test

@martinsumner
Copy link
Contributor

I can't help I'm afraid. Personally, I load results into a spreadsheet and then manipulate and chart them there. This made it a easier (for me) to chart comparisons between different runs of basho_bench etc and tidy up the presentation.

I'm not sufficiently familiar with R/ggplot to troubleshoot this code.

@richamishra006
Copy link
Author

can you please explain me the way you are doing it, I mean which file you are uploading to spreadheet, it will be really helpful for me

@richamishra006
Copy link
Author

Hi @martinsumner , actually i am new to basho_bench so my questions might sound silly to you. I am thankful for your help so far. I want to test the risk performance running on a node with 256GB memory and 72 cores CPU, i want to generate load and see if it breaks,
i am using this file riakc_nhs_general.config, so apart from increasing concurrent value, is there anything which i can fine-tune to increase the load, as i tried running the tests with 300 as concurrent value, but it didn't made any change on the utilisation

@martinsumner
Copy link
Contributor

At some stage you may hit limits on the basho_bench node itself, or with the network connection. The riakc_nhs_general.config uses fairly large objects so you can hit bandwidth limits quite easily. On the Riak side, the first limit tends to be write throughput to disk (as the test initially sends a write-heavy load to build up the database).

When testing Riak, the crucial questions are:

  • how many different objects and index entries do you expect to have in your cluster?
  • what is the approximate distribution of object sizes?
  • will those object values be compressible?
  • what is the expected balance of transactions - GET vs PUT vs 2i query?
  • how will read requests be distributed (i.e. will a proportion of the key space be hotter than the rest)?

There are then tuning options in basho_bench to reflect this. the answers to these questions will have a huge impact on the throughput (in terms of transactions per second) that can be achieved.

Then you need to setup Riak to accept the load. How many nodes do you expect to have in your Riak cluster - normally Riak in production will run on at least six nodes, so it doesn't make sense to test for any less than that. Then there are things to consider such as what anti-entropy configuration to use, and which Riak backend you intend to use (this makes a big difference for performance). The ring_size also needs to be set correctly to reflect the size of the cluster (you generally need to make sure that ring_size > total count of vcpu in the cluster).

There's not much point running the test without a certain investment in observability. So generally you need to make sure all the riak logs, the riak metrics and your general OS metrics are being indexed in something splunk-like so that you can then determine where limits are and tune accordingly.

Doing worthwhile database testing needs quite a bit of preparation.

@richamishra006
Copy link
Author

Thanks, for the detailed explanation
how can i install riak exporter to collect some metrics I have installed grafana and prometheus, also I observed that the files in tests/current dir doesn't contain any data, attaching the screenshots, when doing cat for these files,not getting any data.
image

@martinsumner
Copy link
Contributor

Then the test must not be starting. Is there anything in the crash.log?

@richamishra006
Copy link
Author

This is the crash.log output . Also the tests is failing when using riakc_nhs_general.config file but when using riakc_pb.config I am able to see data in latencies.csv file

2023-07-03 05:50:29 =ERROR REPORT====
** Generic server <0.737.0> terminating 
** Last message in was {tcp_closed,#Port<0.299>}
** When Server state == {state,{172,22,0,214},8087,false,false,undefined,false,gen_tcp,undefined,{[],[]},1,[],infinity,undefined,undefined,undefined,undefined,[],100,false,{false,0}}
** Reason for termination ==
** disconnected
2023-07-03 05:50:29 =CRASH REPORT====
  crasher:
    initial call: riakc_pb_socket:init/1
    pid: <0.737.0>
    registered_name: []
    exception exit: {disconnected,[{gen_server,handle_common_reply,8,[{file,"gen_server.erl"},{line,751}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
    ancestors: [<0.736.0>]
    message_queue_len: 0
    messages: []
    links: [<0.736.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 610
    stack_size: 27
    reductions: 1130
  neighbours:

@martinsumner
Copy link
Contributor

Assuming 172.22.0.214:8087 is the correct IP/port for a Riak instance, there's no real clues here as to why it is being disconnected, especiall given things work fine with the riakc_pb.config, which also uses the PB Riak erlang client in the the same way.

you may be able to run the erlang client from rebar3 shell on your basho_bench box, and see if this does or doesn't work:

{ok, Pid} = riakc_pb_socket:start("172.22.0.214", 8087).
MyBucket = <<"test">>.
Val1 = 1.
Obj1 = riakc_obj:new(MyBucket, <<"one">>, Val1).
riakc_pb_socket:put(Pid, Obj1).

{ok, Fetched1} = riakc_pb_socket:get(Pid, MyBucket, <<"one">>).

@richamishra006
Copy link
Author

i got this as output, seems something is wrong with the drivers
image

@richamishra006
Copy link
Author

Also i am getting this error when i did cat nohup.out

12:57:21.584 [debug] Driver basho_bench_driver_riakc_pb crashed: {undef,[{base64,encode,[<<1,63,146,202>>],[]},{basho_bench_keygen,'-new/2-fun-11-',2,[{file,"/root/basho_bench/src/basho_bench_keygen.erl"},{line,82}]},{basho_bench_driver_riakc_pb,run,4,[{file,"/root/basho_bench/src/basho_bench_driver_riakc_pb.erl"},{line,319}]},{basho_bench_worker,worker_next_op2,2,[{file,"/root/basho_bench/src/basho_bench_worker.erl"},{line,252}]},{basho_bench_worker,worker_next_op,1,[{file,"/root/basho_bench/src/basho_bench_worker.erl"},{line,258}]},{basho_bench_worker,max_worker_run_loop,1,[{file,"/root/basho_bench/src/basho_bench_worker.erl"},{line,338}]}]}

i gogoled and then ran this command, attaching the screenshot
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants