Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pypi version throws ValueError #607

Open
FinnHuelsbusch opened this issue Aug 1, 2023 · 27 comments
Open

pypi version throws ValueError #607

FinnHuelsbusch opened this issue Aug 1, 2023 · 27 comments

Comments

@FinnHuelsbusch
Copy link

FinnHuelsbusch commented Aug 1, 2023

To reproduce the bug:

  1. Create a new python 3.11.x environment (tested with python 3.11.4)
  2. install the following dependencies:
  • scipy 1.11.1
  • scikit-learn 1.3.0
  • cython 0.29.36
  • hdbscan 0.8.33
  1. create a minimal example:
from sklearn.datasets import make_blobs
import hdbscan
blobs, labels = make_blobs(n_samples=2000, n_features=10)
clusterer = hdbscan.HDBSCAN()
clusterer.fit(blobs)
print(clusterer.labels_)
  1. Execute it and get the following error:
Traceback (most recent call last):
File "/home/***/Desktop/hdbscan_test.py", line 5, in <module>
    clusterer.fit(blobs)
  File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 1205, in fit
    ) = hdbscan(clean_data, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 884, in hdbscan
    _tree_to_labels(
  File "/home/***/micromamba/envs/hdbscan3/lib/python3.11/site-packages/hdbscan/hdbscan_.py", line 80, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(
                                         ^^^^^^^^^^^^^
  File "hdbscan/_hdbscan_tree.pyx", line 659, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
TypeError: 'numpy.float64' object cannot be interpreted as an integer

Workaround:

  1. Clone the repo
  2. uninstall hdbscan from the environment
  3. execute python setup.py install while the environment is active
  4. Execute the minimal example again.
  5. It work's

This was also tested with the commit 813636b (The commit of version 0.8.33)

Would be nice to get instructions on how to fix this (if the error is on my side) or to fix this in general.

Tested on Windows and Linux. This error only occurs under python 3.11.x.

@FinnHuelsbusch
Copy link
Author

The error message seems similar to an error mentioned in #600 in the comments and its fix in #602. Though both are talking about the condense_tree function.

@empowerVictor
Copy link

I have the same error, both the 0.8.29 and 0.8.33

@LoveFishoO
Copy link

Absolutely, my version of python is also 3.11.x. I have the same error, but after I try this method, I get anthor error ModuleNotFoundError: No module named 'hdbscan._hdbscan_linkage'

Try python setup.py develop to replace python setup.py install
I solve this problem.

@FinnHuelsbusch
Copy link
Author

Maybe #606 helps with this error.

@jkmackie
Copy link

jkmackie commented Aug 10, 2023

I also replicated the bug on Windows. Packages installed with pypi. Base virtual environment created with miniconda.

Bug occurs:

  • Python 3.11.x
  • scikit-learn 1.3.0
  • hdbscan 0.8.33
  • numpy 1.24.4
from sklearn.datasets import make_blobs
import hdbscan
blobs, labels = make_blobs(n_samples=2000, n_features=10)
clusterer = hdbscan.HDBSCAN()
clusterer.fit(blobs)
print(clusterer.labels_)

Error:

File hdbscan\\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()

TypeError: 'numpy.float64' object cannot be interpreted as an integer

Avoid the bug by switching to slower Python 3.10.x and downgrading scikit-learn. Keep the hdbscan and numpy versions.

No errors:

  • Python 3.10.x
  • scikit-learn 1.21.1
  • hdbscan 0.8.33
  • numpy 1.24.4

Revised 15 August, 2023

@RichieHakim
Copy link

I am also getting this error on windows builds. This seems like a pretty urgent issue. @lmcinnes or @gclendenning, forgive the @, but you may want to take a look at this.

@johnlees
Copy link

So this line:
https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L733

is_cluster = {cluster: True for cluster in node_list}

node_list is constructed above:

    if allow_single_cluster:
        node_list = sorted(stability.keys(), reverse=True)
    else:
        node_list = sorted(stability.keys(), reverse=True)[:-1]
        # (exclude root)

and stability is from https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L164, see return https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/_hdbscan_tree.pyx#L237-L241

    result_pre_dict = np.vstack((np.arange(smallest_cluster,
                                           condensed_tree['parent'].max() + 1),
                                 result_arr)).T

    return dict(result_pre_dict)

np.arange should have an integer dtype I think; result_arr has type dtype=np.double.

I am not sure if the np.vstack might be casting the the integer keys to floats due to the result_arr type (I might check this later), can't see anything obvious in numpy which would have changed this behaviour

@JanElbertMDavid
Copy link

@jkmackie thanks for the solution mate! appreciate it.

@lmcinnes
Copy link
Collaborator

At least some of the issues seem to be related to the wheel built for windows (and python 3.11). I have deleted that from PyPI. The downside is that installing on windows will require you to build from source; the upside is that hopefully installing from PyPI might work now.

@johnlees
Copy link

Just to confirm, I am also seeing this on an Ubuntu 22.04 CI with:

  • hdbscan 0.8.33
  • python 3.10.12
  • scikit-learn 1.3.0
  • numpy 1.22.4

@johnlees
Copy link

b .../lib/python3.10/site-packages/hdbscan/hdbscan_.py:80
p stability_dict.keys()
dict_keys([378.0, 379.0, 380.0, 381.0, 382.0, 383.0, 384.0, 385.0, 386.0, 387.0, 388.0, 389.0, 390.0, 391.0, 392.0, 393.0, 394.0])

not sure if those being floats is the problem here

@jkmackie
Copy link

jkmackie commented Aug 16, 2023

@johnlees I suspect downgrading scikit-learn below 1.3 would fix on Ubuntu. Numpy 1.22.4 is used in the successful Windows configuration below:

#Successful configuration - Windows 10.

(myvirtualenv) 
me@mypc MINGW64 ~/embedding_clustering
$ conda list | grep -w '^python\s\|scikit\|hdbscan\|numpy'
hdbscan                   0.8.33                   pypi_0    pypi
numpy                     1.24.4                   pypi_0    pypi
python                    3.10.9          h4de0772_0_cpython    conda-forge
scikit-learn              1.2.1                    pypi_0    pypi

Note hdbscan is imported separately from scikit-learn. I wonder why it isn't imported as a module like KMeans?

#from package.subpackage import module
from sklearn.cluster import KMeans

#in contrast, hdbscan cluster algo is imported directly
import hdbscan

@johnlees
Copy link

Same issue with scikit-learn 1.2.2 and 1.2.1, and other packages as above.
I'm guessing this is a cython issue with the pyx files?

@lmcinnes
Copy link
Collaborator

This is really quirky, and I am having a great deal of trouble reproducing it in a way that I can actually debug it myself.

@RichieHakim
Copy link

Removing the pre-built wheel for windows on pypi was sufficient to get it working on my github actions windows runners.

If it is helpful, here is an example of when it was failing: https://github.com/RichieHakim/ROICaT/actions/runs/5861440405/job/15891513454

Thank you for the quick fix.

@alxfgh
Copy link

alxfgh commented Aug 16, 2023

Removing the pre-built wheels and building from source didn't solve the bug for me

@jkmackie
Copy link

Removing the pre-built wheels and building from source didn't solve the bug for me

Did you try a fresh environment?

conda create -n testenv python=3.11

pip install hdbscan==0.8.33 numpy==1.24.4 notebook==7.0.2 scikit-learn==1.3.0

Cython should be something like 0.29.26 not 3.0.

If there's a hdbscan error, try:

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

@johnlees
Copy link

This is really quirky, and I am having a great deal of trouble reproducing it in a way that I can actually debug it myself.

Likewise – doing the install from source (rebuilding the cython generated .so libraries) makes the issue go away. I have floats in the line reported by the backtrace, and am not sure that's the correct erroring line anyway. I might try rebuilding the conda-forge version and see if that helps

@lmcinnes
Copy link
Collaborator

We have a new azure-pipelines CI system that will automatically build wheels and publish them to PyPI thanks to @gclendenning, so hopefully the next time we make a release this will all work a little better. It is definitely just quirks on exactly how things build on different platforms etc. but the fine details of that are ... hard to sort out.

@johnlees
Copy link

Ah maybe I should have been clearer, I am having issues with the conda version, not pypi.
The rebuild on conda-forge didn't sort out the CI issue unfortunately, still the same error.

@lmcinnes
Copy link
Collaborator

The conda forge recipe might need to be changed. Potentially adding a version restriction to Cython in the recipe itself (since it may not use the build isolation that pip install does) might help.

chasemc added a commit to KwanLab/Autometa that referenced this issue Aug 23, 2023
@johnlees
Copy link

The conda forge recipe might need to be changed. Potentially adding a version restriction to Cython in the recipe itself (since it may not use the build isolation that pip install does) might help.

Thanks for the pointer, this seems to have fixed it! Looks like we can have cython<3 when built but free version at run time and it works. I also added a run test to the recipe which I hope would flag such an issue in future releases

@Gr4dient
Copy link

Hi all, having trouble understanding what to do here (I installed HDBSCAN 2 days ago through Conda and I'm currently experiencing this issue). Can I remove and reinstall HDBSCAN through Conda at this point to solve the problem? If so, do I also need to remove and reinstall anything else? Cython? Thank you.

@johnlees
Copy link

@Gr4dient I would reinstall your HDBSCAN in that environment, or even just try a fresh conda environment. I hope to have fixed it in 0.8.33_3 releases (when you do conda list the hdbscan version should end in _3)

@Gr4dient
Copy link

Hi John, thanks for clarifying - it took several hours for Conda to find a solution to remove Cython and HDBSCAN from my NLP environment last night... not sure why it got so hung up. I'm not seeing '_3' on conda-forge; will that be available at some point soon? Thanks

@johnlees
Copy link

The new builds are on conda forge, e.g. in my working environment conda list shows:

hdbscan                    0.8.33        py310h1f7b6fc_3          conda-forge

If you are having trouble with time taken to resolve environments I would recommend using mamba instead of conda, or just starting over with a new environment, or both.

@benmwebb
Copy link

I can also reproduce this with a from-source build on Fedora 39:

# dnf install python3-devel python3-Cython python3-numpy python3-scipy python3-scikit-learn python3-setuptools gcc
# curl -LO https://files.pythonhosted.org/packages/44/2c/b6bb84999f1c82cf0abd28595ff8aff2e495e18f8718b6b18bb11a012de4/hdbscan-0.8.33.tar.gz
# tar -xvzf hdbscan-0.8.33.tar.gz 
# (cd hdbscan-0.8.33 && python3 setup.py build -j8)
# cat <<END > test.py
import hdbscan
from sklearn.datasets import make_blobs
data, _ = make_blobs(1000)
clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)
assert len(cluster_labels) == 1000
END
# PYTHONPATH=hdbscan-0.8.33/build/lib.linux-x86_64-cpython-312/ python3 test.py
...
  File "//hdbscan-0.8.33/build/lib.linux-x86_64-cpython-312/hdbscan/hdbscan_.py", line 80, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(
                                         ^^^^^^^^^^^^^
  File "hdbscan/_hdbscan_tree.pyx", line 659, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
TypeError: 'numpy.float64' object cannot be interpreted as an integer

A hacky fix which works for me is to replace https://github.com/scikit-learn-contrib/hdbscan/blob/0.8.33/hdbscan/_hdbscan_tree.pyx#L726-L729 with

    if allow_single_cluster:
        node_list = sorted([int(x) for x in stability.keys()], reverse=True)
    else:
        node_list = sorted([int(x) for x in stability.keys()], reverse=True)[:-1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests