Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set cluster_selection_epsilon when using cosine distances? #627

Open
ma9o opened this issue Feb 22, 2024 · 0 comments
Open

How to set cluster_selection_epsilon when using cosine distances? #627

ma9o opened this issue Feb 22, 2024 · 0 comments

Comments

@ma9o
Copy link

ma9o commented Feb 22, 2024

Hi, I am using HDBSCAN to cluster text embeddings.

As the data is unbalanced in favor of one category of embeddings, I am obtaining too many sub-clusters of that category, which I would like to squash together. I have found that datapoints with a cosine distance <0.7 should belong in the same cluster, and if I understand correctly I should set cluster_selection_epsilon=0.7 to achieve this outcome.

This doesn't seem to be working as all the datapoints and up in the same cluster (the value is too high?).

My current code:

from cuml.metrics import pairwise_distances
from hdbscan import HDBSCAN
import numpy as np
import cupy as cp  
import cuml

embeddings_gpu = cp.asarray(embeddings)

umap_model = cuml.UMAP(n_neighbors=15,
                       n_components=100, 
                       metric='cosine')
reduced_data_gpu = umap_model.fit_transform(embeddings_gpu)

cosine_dist = pairwise_distances(reduced_data_gpu, metric='cosine')

clusterer = HDBSCAN(min_cluster_size=5, 
                    gen_min_span_tree=True,
                    metric="precomputed",
                    cluster_selection_epsilon=0.7) 
cluster_labels = clusterer.fit_predict(cosine_dist.astype(np.float64).get())

cluster_labels:

Shape: 9533
array([0, 0, 0, ..., 0, 0, 0])

cosine_dist:

Shape: (9533, 9533)
array([[5.9604645e-07, 1.6956329e-02, 5.4422319e-02, ..., 1.0555809e+00,
        1.1026136e+00, 1.3615031e+00],
       ...,
       [1.3615031e+00, 1.4514638e+00, 1.3940278e+00, ..., 3.1383842e-01,
        7.0653200e-02, 5.9604645e-07]], dtype=float32) 

Is this the correct use of cluster_selection_epsilon? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant