Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: spatial.cKDTree: do not slide the median #20606

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

sturlamolden
Copy link
Contributor

Never slide the pivot when using median as pivot. Only slide the pivot when pivot is the midpoint. That’s what sliding midpoint means.

Sliding median is not a thing. A median is always the median.

Closes #20605

Never slide the pivot when using median as pivot. Only slide the pivot when pivot is the midpoint. That’s what sliding midpoint means.
@github-actions github-actions bot added scipy.spatial C/C++ Items related to the internal C/C++ code base labels Apr 29, 2024
@sturlamolden sturlamolden changed the title Do not slide the median in cKDTree BUG: Do not slide the median in cKDTree Apr 29, 2024
Copy link
Member

@j-bowhay j-bowhay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a regression test that can be added so this doesn't get accidentally broken again or is the change buried too deep?

@sturlamolden
Copy link
Contributor Author

Is there a regression test that can be added so this doesn't get accidentally broken again or is the change buried too deep?

It is buried as deep as it gets. I am not quite sure how a regression test for it should be constructed.

Perhaps if we can construct a data set that is degenerate enough, we might be able to inspect the kd-tree and see that nodes are not pivoted by the median.

@tylerjereddy tylerjereddy added this to the 1.14.0 milestone Apr 29, 2024
Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking for a regression test case, I seem to have found one that crashes on this PR branch and doesn't crash on main:

--- a/scipy/spatial/tests/test_kdtree.py
+++ b/scipy/spatial/tests/test_kdtree.py
@@ -1531,3 +1531,13 @@ def test_gh_18800(incantation):
     tree = incantation(points, 10)
     tree.query(arr_like, 1)
     tree.query_ball_point(arr_like, 200)
+
+
+def test_gh_20605():
+    # this currently crashes in gh-20606
+    # but not on main
+    data = np.full((100, 2), 4)
+    data[:50, :] = 5
+    data[52:60, 1] = 8
+    balanced_tree = KDTree(data=data,
+                           balanced_tree=True)

I suppose that makes me a little nervous.

@sturlamolden
Copy link
Contributor Author

It seems when splitting on the median, we can get a situation where all points end up in the same branch. This will then trigger an infinite recursion, resulting in a stack overflow.

@sturlamolden
Copy link
Contributor Author

I also find this strange code in the kd-tree construction:

if (CKDTREE_UNLIKELY(p == start_idx || p == end_idx)) {
            // All children are equal in this dimension, try again with new bounds
            assert(!_compact);
            self->tree_buffer->pop_back();
            std::vector<double> tmp_bounds(2 * m);
            double* tmp_mins = &tmp_bounds[0];
            std::copy_n(mins, m, tmp_mins);
            double* tmp_maxes = &tmp_bounds[m];
            std::copy_n(maxes, m, tmp_maxes);

            const auto fixed_val = data[indices[start_idx]*m + d];
            tmp_mins[d] = fixed_val;
            tmp_maxes[d] = fixed_val;

            return build(self, start_idx, end_idx, tmp_maxes, tmp_mins, _median, _compact);
        }

It tries to solve the issue of empty branches by pretending the spread along a dimension is zero. However, it propagates a false max value into the kd-tree, which can result in bogus query results later on. It came in with d13a48c and is a logic error.

@sturlamolden
Copy link
Contributor Author

While looking for a regression test case, I seem to have found one that crashes on this PR branch and doesn't crash on main:

--- a/scipy/spatial/tests/test_kdtree.py
+++ b/scipy/spatial/tests/test_kdtree.py
@@ -1531,3 +1531,13 @@ def test_gh_18800(incantation):
     tree = incantation(points, 10)
     tree.query(arr_like, 1)
     tree.query_ball_point(arr_like, 200)
+
+
+def test_gh_20605():
+    # this currently crashes in gh-20606
+    # but not on main
+    data = np.full((100, 2), 4)
+    data[:50, :] = 5
+    data[52:60, 1] = 8
+    balanced_tree = KDTree(data=data,
+                           balanced_tree=True)

I suppose that makes me a little nervous.

Thanks for the test. It revealed two more bugs in the new C++ code.

There are also stability issues with using the median as pivot I have not seen before.

This makes me a bit nervous too.

@lucascolley lucascolley added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label May 17, 2024
@lucascolley lucascolley changed the title BUG: Do not slide the median in cKDTree BUG: spatial: Do not slide the median in cKDTree May 17, 2024
@lucascolley lucascolley changed the title BUG: spatial: Do not slide the median in cKDTree BUG: spatial.cKDTree: do not slide the median May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C/C++ Items related to the internal C/C++ code base defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Sliding is applied when using median as pivot in cKDTree, should only be used for midpoint
4 participants