Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/logictest: TestParallel failed #123965

Open
cockroach-teamcity opened this issue May 10, 2024 · 4 comments · May be fixed by #124066
Open

sql/logictest: TestParallel failed #123965

cockroach-teamcity opened this issue May 10, 2024 · 4 comments · May be fixed by #124066
Assignees
Labels
branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 10, 2024

sql/logictest.TestParallel failed with artifacts on master @ fd581e5f2458cf3585fefbf74a90c77d6710ea44:

=== RUN   TestParallel
--- FAIL: TestParallel (24.37s)
=== RUN   TestParallel/create_stats
    test_log_scope.go:170: test logs captured to: /artifacts/tmp/_tmp/273be200bf142dd38d3748aad57ea94a/logTestParallel_create_stats586570266
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:157: automatically injected an external process virtual cluster under test; see comment at top of test_server_shim.go for details.
[18:38:49] --- done: testdata/parallel_test/create_stats/setup with config : 2 tests, 0 failures
[18:38:50] --- done: testdata/parallel_test/create_stats/create_stats with config : 0 tests, 0 failures
    parallel_test.go:85: 
        testdata/parallel_test/create_stats/create_stats:2: 
        expected success, but found
        (40003) job 967625704145649667: could not mark as succeeded: result is ambiguous: error=can't combine batch responses with errors, have errors <nil> and [NotLeaseHolderError] proxy failed, update client information desc: r71:/Tenant/1{0/Table/11-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING != et:desc:r72:/Tenant/10/Table/{12-25} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease:repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0; r72: replica not lease holder; current lease is repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0 [propagate] (last error: key range /Tenant/10/Table/15/1/967625704145649667/0-/Tenant/10/Table/54/1/967625704145649667/"legacy_progress"/1915-08-24T05:21:08.625059999Z/0/NULL outside of bounds of range /Tenant/10/Table/11-/Tenant/10/Table/12; suggested ranges: [desc: r71:/Tenant/10/Table/1{1-2} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=8], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING desc: r72:/Tenant/10/Table/{12-25} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING desc: r73:/Tenant/1{0/Table/25-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING])
        dist_sender.go:3124: in sendToReplicas()
[18:38:53] --- progress: testdata/parallel_test/create_stats/create_stats: 1 statements
[18:38:53] --- done: testdata/parallel_test/create_stats/create_stats with config : 1 tests, 0 failures
[18:38:54] --- progress: testdata/parallel_test/create_stats/create_stats: 1 statements
[18:38:54] --- done: testdata/parallel_test/create_stats/create_stats with config : 1 tests, 0 failures
    parallel_test.go:254: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/273be200bf142dd38d3748aad57ea94a/logTestParallel_create_stats586570266
    --- FAIL: TestParallel/create_stats (12.97s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

Jira issue: CRDB-38630

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team labels May 10, 2024
@yuzefovich yuzefovich added T-kv KV Team and removed T-sql-queries SQL Queries Team labels May 12, 2024
@blathers-crl blathers-crl bot added this to Incoming in KV May 12, 2024
@nvanbenschoten
Copy link
Member

(40003) job 967625704145649667: could not mark as succeeded: result is ambiguous: error=can't combine batch responses with errors, have errors <nil> and [NotLeaseHolderError] proxy failed, update client information desc: r71:/Tenant/1{0/Table/11-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING != et:desc:r72:/Tenant/10/Table/{12-25} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease:repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0; r72: replica not lease holder; current lease is repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0 [propagate] (last error: key range /Tenant/10/Table/15/1/967625704145649667/0-/Tenant/10/Table/54/1/967625704145649667/"legacy_progress"/1915-08-24T05:21:08.625059999Z/0/NULL outside of bounds of range /Tenant/10/Table/11-/Tenant/10/Table/12; suggested ranges: [desc: r71:/Tenant/10/Table/1{1-2} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=8], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING desc: r72:/Tenant/10/Table/{12-25} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING desc: r73:/Tenant/1{0/Table/25-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING])

@andrewbaptist do we expect this form of ambiguous error?

@nvanbenschoten nvanbenschoten added P-1 Issues/test failures with a fix SLA of 1 month C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels May 13, 2024
@andrewbaptist
Copy link
Collaborator

I don't think we do. Let me take a closer look at where this "failure to combine" error is coming from.

@andrewbaptist
Copy link
Collaborator

I understand how this can happen now, and I was worried about this exact scenario when I was fixing #123146. Specifically the scenario of the "client", "proxy node" and "meta2" having 3 different views of the world can cause this problem to happen. I thought about a reasonably easy fix to this but didn't implement it as part of that PR as it was a little larger than I would have liked.

The correct solution is to have 2 guarantees:

  1. Immediately reject requests if the proxy node detects that the client had stale information compared to what the proxy node has
  2. Not allow the proxy node to update its state information during its request evaluation.

The way the code is structured, we almost have both these guarantees, but they are not airtight right now. This should be a very rare case where we are hitting these scenarios, but based on the fact that this test is failing, it needs to be stronger.

Let me take a look at how easy it will be to tighten this up.

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented May 13, 2024

@arulajmani - We should chat about this bug.

As I’m looking through this issue, I’m realizing I might have made a mistake with the way proxying is done on the proxy node. Specifically there are 3 parts of the proxy code, and it is the third one where there is an issue:

  1. On the sender side deciding if it should proxy and to whom (this is OK)
  2. On the proxy node, evaluating locally and then determining if it should proxy (this is also OK)
  3. On the proxy node, sending the proxy request through DistSender (this is the problem)

For the DistSender proxy piece, I had originally conceived of there being some ability to handle the proxy node having more up-to-date information than the client and therefore taking advantage of its local DistSender cache to send the request, but as I’ve gone through the last couple bugs related to this I’m realizing this was a bad decision. Specifically proxy requests should just use the Transport rather than the RangeCache at all.

So there are a couple options:

  1. Add a new method to DistSender (something like - SendProxyRequest) that basically validates the request against its local cache, and if the request carries stale information, it returns an error immediately, otherwise it creates a Transport and sends the request without all the retries in sendPartialBatch/sendToReplicas.
  2. Attempt to use the existing methods, but make sure that certain operations (like splitting a batch) can’t happen.

I prefer the first, but I need to be careful to ensure I handle things like Context cancellation and other errors that can happen. I don’t think the refactor will be very big and it would help reduce some of the complexity of the already complex

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 13, 2024
Previously a proxy request was handled using the same logic within
sendPartialBatch and sendToReplicas. There were short circuits to handle
the different cases of retries, but this was still incorrect in some
places. Instead this change intercepts proxy request at the start of
Send and creates and sends to the transport directly. This greatly
simplifies the code for proxy requests and additionally fixes complex
cases where the routing information changes.

Epic: none
Fixes: cockroachdb#123965
Informs: cockroachdb#123146

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 13, 2024
Previously a proxy request was handled using the same logic within
sendPartialBatch and sendToReplicas. There were short circuits to handle
the different cases of retries, but this was still incorrect in some
places. Instead this change intercepts proxy request at the start of
Send and creates and sends to the transport directly. This greatly
simplifies the code for proxy requests and additionally fixes complex
cases where the routing information changes.

Epic: none
Fixes: cockroachdb#123965
Informs: cockroachdb#123146

Release note: None
@andrewbaptist andrewbaptist linked a pull request May 13, 2024 that will close this issue
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 13, 2024
Previously in DistSender, a proxy request was handled using the same
logic within sendPartialBatch and sendToReplicas. There were short
circuits to handle the different cases of retries, but in cases with
changing range boundaries, it could return the wrong error. This change
intercepts proxy request at the start of DistSender.Send and sends to
the transport bypassing the rest of the DistSender retry logic. This
simplifies the code for proxy requests.

Epic: none
Fixes: cockroachdb#123965
Informs: cockroachdb#123146

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
Previously in DistSender, a proxy request was handled using the same
logic within sendPartialBatch and sendToReplicas. There were short
circuits to handle the different cases of retries, but in cases with
changing range boundaries, it could return the wrong error. This change
intercepts proxy request at the start of DistSender.Send and sends to
the transport bypassing the rest of the DistSender retry logic. This
simplifies the code for proxy requests.

Epic: none
Fixes: cockroachdb#123965
Informs: cockroachdb#123146

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
KV
Incoming
Development

Successfully merging a pull request may close this issue.

4 participants