Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault injection: get() incorrectly returns not_found #107

Open
slfritchie opened this issue Mar 27, 2014 · 17 comments
Open

Fault injection: get() incorrectly returns not_found #107

slfritchie opened this issue Mar 27, 2014 · 17 comments

Comments

@slfritchie
Copy link
Contributor

Fault injection parameters:

  • ftruncate(2): 7% chance (triggered once)
  • pread(2): 7% chance (triggered once)
  • other intercepted system calls: also 7% chance, but they are triggered 0 times in this test case.

Failing test case:

5> C1.
[{[{set,{var,1},
        {call,generic_qc_fsm,set_keys,
              [[<<9,217,162,128,149,197,176,236,10,163,247,68,141>>]]}},
   {set,{var,2},
        {call,generic_qc_fsm,open,
              ["/tmp/generic.qc",
               [read_write,
                {open_timeout,0},
                {max_file_size,100},
                {sync_strategy,none}]]}},
   {set,{var,15},
        {call,generic_qc_fsm,put,[{var,2},<<"k">>,<<>>]}},
   {set,{var,17},
        {call,generic_qc_fsm,put,
              [{var,2},
               <<9,217,162,128,149,197,176,236,10,163,247,68,141>>,
               <<>>]}},
   {set,{var,27},{call,generic_qc_fsm,close,[{var,2}]}},
   {set,{var,28},
        {call,generic_qc_fsm,open,
              ["/tmp/generic.qc",
               [read_write,
                {open_timeout,0},
                {max_file_size,100},
                {sync_strategy,none}]]}},
   {set,{var,29},{call,generic_qc_fsm,get,[{var,28},<<"k">>]}},
   {set,{var,30},
        {call,generic_qc_fsm,put,[{var,28},<<"k">>,<<>>]}},
   {set,{var,31},{call,generic_qc_fsm,close,[{var,28}]}},
   {set,{var,32},
        {call,generic_qc_fsm,open,
              ["/tmp/generic.qc",
               [read_write,
                {open_timeout,0},
                {max_file_size,100},
                {sync_strategy,none}]]}},
   {set,{var,33},
        {call,generic_qc_fsm,delete,[{var,32},<<"k">>]}},
   {set,{var,34},
        {call,generic_qc_fsm,get,
              [{var,32},<<9,217,162,128,149,197,...>>]}}],
  7354},
 [{verify_trace,[]}]]

The successful fault injection triggers are:

  1. An ftruncate(2) call returns -1, errno ENOSPC
  2. A pread(2) call returns -1, errno EIO

Test case trace of results:

Trace: [{set_keys,[<<9,217,162,128,149,197,176,236,10,163,247,68,141>>]},
        open,
        {put,yes,<<"k">>,<<>>},
        {put,yes,<<9,217,162,128,149,197,176,236,10,163,247,68,141>>,<<>>},
        close,open,
        {get,<<"k">>,<<>>},
        {put,yes,<<"k">>,<<>>},
        close,open,
        {delete,yes,<<"k">>},
        {get,<<9,217,162,128,149,197,176,236,10,163,247,68,141>>,not_found},
        close]
verify_trace: {get,expected,[<<>>],got,not_found}
verify_trace: failed

All of the DB mutations have status yes, meaning that the eleveldb API responded affirmatively to each call: i.e. there were no errors returned by the API.

This is probably not a truly minimal failing test case: the nature how the random number generator is used means that a certain number of commands needed to happen "in the middle" before a lucky 7% event happened at a crucial time. The two close & open cycles in the middle of the test case: they might be truly necessary, but they might not.

To reproduce, cut-and-paste to execute these commands. Tested with OS X. It ought to work with Linux, but honestly I haven't tested the faulterl library very much on Linux.

mkdir /tmp/e
cd /tmp/e
git clone git@github.com:basho/eleveldb.git
cd eleveldb
git checkout -f 391b885abb2005e4b3f4f96a16822c9b1f31601f
make ; rebar skip_deps=true eunit suites=NONE ; erlc -o deps/faulterl/ebin -I deps/faulterl/include priv/scenario/*erl ; deps/faulterl/ebin/make_intercept_c.escript trigger_commonpaths yo ; env `deps/faulterl/ebin/example_environment.sh $PWD/yo` erl -pz .eunit deps/*/ebin

You now have an Erlang shell running with a fault-injectable VM. Cut-and-paste to execute the following three commands.

C1 = [{[{set,{var,1},{call,generic_qc_fsm,set_keys,[[<<9,217,162,128,149,197,176,236,10,163,247,68,141>>]]}},{set,{var,2},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,15},{call,generic_qc_fsm,put,[{var,2},<<107>>,<<>>]}},{set,{var,17},{call,generic_qc_fsm,put,[{var,2},<<9,217,162,128,149,197,176,236,10,163,247,68,141>>,<<>>]}},{set,{var,27},{call,generic_qc_fsm,close,[{var,2}]}},{set,{var,28},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,29},{call,generic_qc_fsm,get,[{var,28},<<107>>]}},{set,{var,30},{call,generic_qc_fsm,put,[{var,28},<<107>>,<<>>]}},{set,{var,31},{call,generic_qc_fsm,close,[{var,28}]}},{set,{var,32},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,33},{call,generic_qc_fsm,delete,[{var,32},<<107>>]}},{set,{var,34},{call,generic_qc_fsm,get,[{var,32},<<9,217,162,128,149,197,176,236,10,163,247,68,141>>]}}],7354},[{verify_trace,[]}]].
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(false, false)), C1).
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(true, false)), C1).
@slfritchie
Copy link
Contributor Author

This one is a bit more interesting. It still has a close & open cycle in the middle (as well as a failed open attempt), but it only uses a single key.

C3 = [{[{set,{var,1},{call,generic_qc_fsm,set_keys,[[]]}},{set,{var,2},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,7},{call,generic_qc_fsm,put,[{var,2},<<107>>,<<>>]}},{set,{var,9},{call,generic_qc_fsm,close,[{var,2}]}},{set,{var,10},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,15},{call,generic_qc_fsm,close,[{var,10}]}},{set,{var,16},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,40},{call,generic_qc_fsm,close,[{var,16}]}},{set,{var,41},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,42},{call,generic_qc_fsm,get,[{var,41},<<107>>]}}],8318},[{verify_trace,[]}]].
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(true, false)), C3).

The diagnostic output:

Trace: [{set_keys,[]},
        open,
        {put,yes,<<"k">>,<<>>},
        close,open,close,
        {open,{error,{db_open,"IO error: lock /tmp/generic.qc/LOCK: Resource deadlock avoided"}}},
        open,
        {get,<<"k">>,not_found},
        close]
verify_trace: {get,expected,[<<>>],got,not_found}
verify_trace: failed

@slfritchie slfritchie added the Bug label Mar 27, 2014
@slfritchie slfritchie added this to the 2.0-RC milestone Mar 27, 2014
@slfritchie
Copy link
Contributor Author

I've seen over a dozen variations of this failure. (I have yet to have a 5 minute run succeed.) Each failing case has a close-and-reopen cycle in the middle.

@matthewvon
Copy link
Contributor

If you are failing reads, that seems to make sense that <<"k">> is sometimes not found. What am I missing?

@slfritchie
Copy link
Contributor Author

The trace shows the order of operations: open, put <<"k">>, close, open, close, open (fails), open, get <<"k">> -> not_found. That last operation is definitely bogus. The fault injection isn't causing pread(2) to return invalid data: the fault injection is causing a return of -1 + valid errno value. I don't believe that eleveldb should be returning {ok, ...} in such a case.

@slfritchie
Copy link
Contributor Author

Hi. I have another case. I'm not sure if it's different or not. The case frequently passes, which is annoying. But it might have something to do with compaction, so I'm wondering if I might open another ticket -- if I run the test case 25 times it a row, it will always fail before the end of the 25th run.

{[{set,{var,1},
       {call,generic_qc_fsm,set_keys,
             [[<<116,116,0,0,0,0>>],{var,parameter_test_dir}]}},
  {set,{var,2},
       {call,generic_qc_fsm,open,
             [{var,parameter_test_dir},
              [read_write,
               {open_timeout,0},
               {max_file_size,100},
               {sync_strategy,none},
               {create_if_missing,true},
               {limited_developer_mem,true},
               {compression,false},
               {write_buffer_size,16000000}]]}},
  {set,{var,6},
       {call,generic_qc_fsm,put,[{var,2},<<116,116,0,0,0,0>>,<<"Eà">>]}},
  {set,{var,8},
       {call,generic_qc_fsm,put_filler,[{var,2},{3085,<<137,112>>},50170]}},
  {set,{var,15},{call,generic_qc_fsm,get,[{var,2},<<116,116,0,0,0,0>>]}}],
 95149}
Trace: [{set_keys,[<<116,116,0,0,0,0>>]},
        open,
        {put,yes,<<116,116,0,0,0,0>>,<<"Eà">>},
        {get,get,<<116,116,0,0,0,0>>,not_found},
        close]
verify_trace: {get,get,<<116,116,0,0,0,0>>,expected,[<<"Eà">>],got,not_found}

The put_filler step is writing 3,085 keys with a prefix of <<137,112>> and value blobs of 50,170 bytes each (for about approximately 150MBytes of data).

@slfritchie
Copy link
Contributor Author

It's also worth noting that the case that I mention from approx. 14 hours ago does not have close & reopen steps in the middle.

@slfritchie
Copy link
Contributor Author

I don't know if there's one or more separate bug/trigger/thingies in here, but FWIW I observed a failure last night where get didn't find a key but fold did. Here's an abbreviated trace of events.

Trace: [{open,ok}, 
        ...
        {close,ok},
        {open,ok},
        {put,yes,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>}, 
        {get,fold,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>},
        ...
        {get,fold,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>},
        ...
        {get,get,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>}, 
        ...
        {get,fold,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>},
        {get,get,<<228,240,133,109,0,0,0,0>>,not_found},
        ...
        {get,fold,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>},
        ...
        {delete,yes,<<228,240,133,109,0,0,0,0>>},
        ...
        {close,ok}]
verify_trace: {get,get,
                   <<228,240,133,109,0,0,0,0>>,
                   expected,
                   [<<"\ey½K-">>],
                   got,not_found}
final_close_ok: true  
postconditions: passed
verify_trace: failed  

@slfritchie
Copy link
Contributor Author

Here's more background info on the original failing test case.

I commented out the really_delete_dir() call on line 49 of test/generic_qc_fsm.erl to preserve the data dir after the failing test runs and fails. I've put a tarball of the /tmp/generic_qc data dir and put it at http://www.snookles.com/scotttmp/generic.qc.tar.gz

The failing test case triggers one message to syslogd:

Apr 10 13:44:05 sbb3.local beam.smp[31744]: Close ftruncate failed [28, No space left on device]

@matthewvon
Copy link
Contributor

@slfritchie My theory of the moment is that this is "functions as designed". Set "paranoid_checks" to true in eleveldb … likely advanced.config in your Riak 2.0 environment. Instead of the key being not found, the open operation will abort … forever. If this expected behavior is confirmed, then we do not want to change the code.

@slfritchie
Copy link
Contributor Author

Hrm, well, there's no Riak involved in this case.

IMHO a series of API-says-it-was-successful operations followed by another API-says-it-was-successful operation that doesn't return the expected value is a data loss bug. If paranoid_checks=true makes the problem disappear, then IMHO it should be impossible for Riak to run when paranoid_checks=false.

@slfritchie
Copy link
Contributor Author

Current status: still seeing not_found for keys that indeed exist, even when paranoid_checks=true.

eleveldb commit: 35f681e of branch slf-faulterl-introduction3 (which is a rebased branch from slf-faulterl-introduction2), which is in turn based upon today's develop branch at 9ab8ed6
leveldb commit: f4cf10d2ba1b999f10805ac35f22145e80932544 of today's develop branch
faulterl commit: 88bd8c8b377f8cdae9683218f229706f4f79ea8e of today's master branch

To recreate, in two cut-and-paste steps. Tested on OS X 10.8.5 and Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-54-generic x86_64). First:

mkdir /tmp/e
cd /tmp/e
git clone git@github.com:basho/eleveldb.git
cd eleveldb
git checkout -f 35f681e6e936a7f036e661b24fc73e7fe6f3e18f
./rebar get-deps ; (cd deps/faulterl ; git checkout -f 88bd8c8b377f8cdae9683218f229706f4f79ea8e) ; make ; rm priv/*.so ; (cd c_src/leveldb ; git checkout -f f4cf10d2ba1b999f10805ac35f22145e80932544) ; make ; ./rebar skip_deps=true eunit suites=NONE ; erlc -o deps/faulterl/ebin -I deps/faulterl/include priv/scenario/*erl ; deps/faulterl/ebin/make_intercept_c.escript trigger_commonpaths yo ; env `deps/faulterl/ebin/example_environment.sh $PWD/yo` erl -pz .eunit deps/*/ebin

And second. This one fails for OS X (Ubu case part 2 is below):

C4 = [{[{set,{var,1},{call,generic_qc_fsm,set_keys,[[<<173,0,0,0,0>>],{var,parameter_test_dir}]}},{set,{var,2},{call,generic_qc_fsm,open,[{var,parameter_test_dir},[read_write,{paranoid_checks,true},{open_timeout,0},{max_file_size,100},{sync_strategy,none},{create_if_missing,true},{limited_developer_mem,true},{compression,false},{write_buffer_size,16000000}]]}},{set,{var,3},{call,generic_qc_fsm,put,[{var,2},<<173,0,0,0,0>>,<<>>]}},{set,{var,4},{call,generic_qc_fsm,put_filler,[{var,2},{2245,<<22>>},54214]}},{set,{var,6},{call,generic_qc_fsm,fold_all,[{var,2}]}},{set,{var,7},{call,generic_qc_fsm,close,[{var,2}]}},{set,{var,8},{call,generic_qc_fsm,open,[{var,parameter_test_dir},[read_write,{paranoid_checks,true},{open_timeout,0},{max_file_size,100},{sync_strategy,none},{create_if_missing,true},{limited_developer_mem,true},{compression,false},{write_buffer_size,16000000}]]}},{set,{var,9},{call,generic_qc_fsm,get,[{var,8},<<173,0,0,0,0>>]}},{set,{var,10},{call,generic_qc_fsm,fold_all,[{var,8}]}}],1023},[{verify_trace,[]}]].
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(false, false)), C4).
[true = eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(true, false)), C4) || _ <- lists:seq(1,30)].

The Ubuntu 12 Linux part 2 is:

C4c = [{[{set,{var,1},{call,generic_qc_fsm,set_keys,[[<<173,0,0,0,0>>],{var,parameter_test_dir}]}},{set,{var,2},{call,generic_qc_fsm,open,[{var,parameter_test_dir},[read_write,{paranoid_checks,true},{open_timeout,0},{max_file_size,100},{sync_strategy,none},{create_if_missing,true},{limited_developer_mem,true},{compression,false},{write_buffer_size,16000000}]]}},{set,{var,3},{call,generic_qc_fsm,put,[{var,2},<<173,0,0,0,0>>,<<>>]}},{set,{var,4},{call,generic_qc_fsm,put_filler,[{var,2},{2245,<<22>>},54214]}},{set,{var,6},{call,generic_qc_fsm,fold_all,[{var,2}]}},{set,{var,7},{call,generic_qc_fsm,close,[{var,2}]}},{set,{var,8},{call,generic_qc_fsm,open,[{var,parameter_test_dir},[read_write,{paranoid_checks,true},{open_timeout,0},{max_file_size,100},{sync_strategy,none},{create_if_missing,true},{limited_developer_mem,true},{compression,false},{write_buffer_size,16000000}]]}},{set,{var,9},{call,generic_qc_fsm,get,[{var,8},<<173,0,0,0,0>>]}},{set,{var,10},{call,generic_qc_fsm,fold_all,[{var,8}]}}],9023},[{verify_trace,[]}]].
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(false, false)), C4c).
[true = eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(true, false)), C4c) || _ <- lists:seq(1,30)].

Here's an easier-to-read-for-humans formatting of the test case and failing diagnostic output:

{[{set,{var,1},
       {call,generic_qc_fsm,set_keys,
             [[<<173,0,0,0,0>>],{var,parameter_test_dir}]}},
  {set,{var,2},
       {call,generic_qc_fsm,open,
             [{var,parameter_test_dir},
              [read_write,
               {paranoid_checks,true},
               {open_timeout,0},
               {max_file_size,100},
               {sync_strategy,none},
               {create_if_missing,true},
               {limited_developer_mem,true},
               {compression,false},
               {write_buffer_size,16000000}]]}},
  {set,{var,3},{call,generic_qc_fsm,put,[{var,2},<<173,0,0,0,0>>,<<>>]}},
  {set,{var,4},{call,generic_qc_fsm,put_filler,[{var,2},{2245,<<22>>},54214]}},
  {set,{var,6},{call,generic_qc_fsm,fold_all,[{var,2}]}},
  {set,{var,7},{call,generic_qc_fsm,close,[{var,2}]}},
  {set,{var,8},
       {call,generic_qc_fsm,open,
             [{var,parameter_test_dir},
              [read_write,
               {paranoid_checks,true},
               {open_timeout,0},
               {max_file_size,100},
               {sync_strategy,none},
               {create_if_missing,true},
               {limited_developer_mem,true},
               {compression,false},
               {write_buffer_size,16000000}]]}},
  {set,{var,9},{call,generic_qc_fsm,get,[{var,8},<<173,0,0,0,0>>]}},
  {set,{var,10},{call,generic_qc_fsm,fold_all,[{var,8}]}}],
 1023}
Trace: [{open,ok},
        {put,yes,<<173,0,0,0,0>>,<<>>},
        {get,fold,<<173,0,0,0,0>>,<<>>},
        {close,ok},
        {open,ok},
        {get,get,<<173,0,0,0,0>>,not_found},
        {close,ok}]
verify_trace: {get,get,<<173,0,0,0,0>>,expected,[<<>>],got,not_found}
final_close_ok: true
verify_trace: failed

@matthewvon
Copy link
Contributor

I still have neither built a leveldb unit test reproducer nor taken time to set up eqc.

That said, the paranoid_check=true and the syslog message point to Recover() log processing eating the error. Specifically the failed ftruncate call did not send an error return. Also, there is the possibility that even if the error was returned that it would be overwritten. Branch mv-tuning7 in basho/leveldb contains the changes (and a throttle change in version_set.cc that is unrelated).

@slfritchie
Copy link
Contributor Author

Branch mv-tuning7 in basho/leveldb contains the changes [...]

I shouldn't be asking until tomorrow morning, but I'll ask now anyway ... why isn't mv-tuning7 merged to develop yet?

@matthewvon
Copy link
Contributor

No code review. And there may be other minor bugs to fix. "tuning" branches are just that, minor stuff.

@slfritchie
Copy link
Contributor Author

@matthewvon Nop, sorry, mv-tuning7 doesn't fix the bug: I still see test case C4 failing on OS X within 30 attempts, and typically within 3 attempts.

@slfritchie
Copy link
Contributor Author

Update: Source drift/version control is a pain, but I think I put all of the moving parts in the proper place. Today's mv-tuning7 branches from both eleveldb and leveldb repos appear to fix the C1 bug above: it passes 5K iterations. The original failure case always/almost always fails the first time.

@cmeiklejohn
Copy link
Contributor

Moving to 2.0.1.

@cmeiklejohn cmeiklejohn modified the milestones: 2.0.1, 2.0-RC Jun 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants