New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: spatial.cKDTree: do not slide the median #20606
base: main
Are you sure you want to change the base?
BUG: spatial.cKDTree: do not slide the median #20606
Conversation
Never slide the pivot when using median as pivot. Only slide the pivot when pivot is the midpoint. That’s what sliding midpoint means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a regression test that can be added so this doesn't get accidentally broken again or is the change buried too deep?
It is buried as deep as it gets. I am not quite sure how a regression test for it should be constructed. Perhaps if we can construct a data set that is degenerate enough, we might be able to inspect the kd-tree and see that nodes are not pivoted by the median. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking for a regression test case, I seem to have found one that crashes on this PR branch and doesn't crash on main
:
--- a/scipy/spatial/tests/test_kdtree.py
+++ b/scipy/spatial/tests/test_kdtree.py
@@ -1531,3 +1531,13 @@ def test_gh_18800(incantation):
tree = incantation(points, 10)
tree.query(arr_like, 1)
tree.query_ball_point(arr_like, 200)
+
+
+def test_gh_20605():
+ # this currently crashes in gh-20606
+ # but not on main
+ data = np.full((100, 2), 4)
+ data[:50, :] = 5
+ data[52:60, 1] = 8
+ balanced_tree = KDTree(data=data,
+ balanced_tree=True)
I suppose that makes me a little nervous.
It seems when splitting on the median, we can get a situation where all points end up in the same branch. This will then trigger an infinite recursion, resulting in a stack overflow. |
I also find this strange code in the kd-tree construction: if (CKDTREE_UNLIKELY(p == start_idx || p == end_idx)) {
// All children are equal in this dimension, try again with new bounds
assert(!_compact);
self->tree_buffer->pop_back();
std::vector<double> tmp_bounds(2 * m);
double* tmp_mins = &tmp_bounds[0];
std::copy_n(mins, m, tmp_mins);
double* tmp_maxes = &tmp_bounds[m];
std::copy_n(maxes, m, tmp_maxes);
const auto fixed_val = data[indices[start_idx]*m + d];
tmp_mins[d] = fixed_val;
tmp_maxes[d] = fixed_val;
return build(self, start_idx, end_idx, tmp_maxes, tmp_mins, _median, _compact);
} It tries to solve the issue of empty branches by pretending the spread along a dimension is zero. However, it propagates a false max value into the kd-tree, which can result in bogus query results later on. It came in with d13a48c and is a logic error. |
Median pivot (aka "standard split") results in stability issues when we have a lot of ties.
Thanks for the test. It revealed two more bugs in the new C++ code. There are also stability issues with using the median as pivot I have not seen before. This makes me a bit nervous too. |
Never slide the pivot when using median as pivot. Only slide the pivot when pivot is the midpoint. That’s what sliding midpoint means.
Sliding median is not a thing. A median is always the median.
Closes #20605