HDBScan performance issue when choosing Best algorithm #630

divya-agrawal3103 · 2024-04-22T09:29:54Z

Hi,

I am attempting to execute a stream that is using HDBScan clustering algorithm on a set of input data to generate a model.
When I am selecting the Algorithm as Best and randomly passing 10% of the total input data (The input data is a csv file that has 15 columns, and ~169379 rows) , the stream executes and never finishes, I tracked it till 5hrs 9 mins and then had to stop.

This is the piece of code from the python script that is getting used to build the model, and it is runs forever and is taking all the time.

hdb = hdbscan.hdbscan_.HDBSCAN(min_cluster_size=param['min_cluster_size'],
                                       min_samples=param['min_samples'],
                                       metric=param['metric'],
                                       alpha=param['alpha'],
                                       p=param['p'],
                                       algorithm=param['algorithm'],
                                       leaf_size=param['leaf_size'],
                                       approx_min_span_tree=param['approx_min_span_tree'],
                                       cluster_selection_method=param['cluster_selection_method'],
                                       allow_single_cluster=param['allow_single_cluster'],
                                       gen_min_span_tree=param['gen_min_span_tree'],
                                       prediction_data=True).fit(X)

Below are the inputs we are feeding-
min_cluster_size = 50
min_samples = 5
metric = euclidean
alpha = 1.0
p = 1.5
algorithm = best
leaf_size = 30
approx_min_span_tree = True
cluster_selection_method = eom
allow_single_cluster = False
gen_min_span_tree = True

Can you help us with this? 5hrs+ seems to be a lot of time. We need to optimise it.
Note: This is happening if we choose the algorithm as Best and 10% of input data, with other algorithms it is finishing in reasonable time, also if we choose Best as the algorithm but only pass 8% of input data, it finishes within 2 minutes.

The text was updated successfully, but these errors were encountered:

jc-healy · 2024-04-22T20:07:01Z

That is indeed strange behaviour. By 10% of input data I presume you mean clustering about ~16,000 points which are 15 dimensional. If so, 5hrs+ is remarkably slow and even two minutes is a bit slow. I can cluster 16,000 15 dimensional points with your parameters in about 4 seconds (truncatedSVD to 15 dimensions on top of MNIST). For scaling context I can handle 70,000 15 dimensional points in about 30 seconds. My best guess is that there is something strange going on with your data being loaded from your csv. Is it properly numeric data? Or do you have 15 string columns that are being loaded as categorical values and being transformed via a one hot encoder or some such thing? Have you loaded it into a numpy array? As an aside, I think your parameter of p=1.5 is being ignored. It is a parameter for Minkowski distance and should be ignored when your metric='euclidean'.

…

On Mon, Apr 22, 2024 at 5:30 AM divya-agrawal3103 ***@***.***> wrote: Hi, I am attempting to execute a stream that is using *HDBScan* clustering algorithm on a set of input data to generate a model. When I am selecting the Algorithm as *Best* and randomly passing 10% of the total input data (The input data is a csv file that has *15 columns*, and *~169379 rows*) , the stream executes and never finishes, I tracked it till *5hrs 9 mins* and then had to stop. This is the piece of code from the python script that is getting used to build the model, and it is runs forever and is taking all the time. hdb = hdbscan.hdbscan_.HDBSCAN(min_cluster_size=param['min_cluster_size'], min_samples=param['min_samples'], metric=param['metric'], alpha=param['alpha'], p=param['p'], algorithm=param['algorithm'], leaf_size=param['leaf_size'], approx_min_span_tree=param['approx_min_span_tree'], cluster_selection_method=param['cluster_selection_method'], allow_single_cluster=param['allow_single_cluster'], gen_min_span_tree=param['gen_min_span_tree'], prediction_data=True).fit(X) Below are the inputs we are feeding- *min_cluster_size* = 50 *min_samples* = 5 *metric* = euclidean *alpha* = 1.0 *p* = 1.5 *algorithm* = best *leaf_size* = 30 *approx_min_span_tree* = True *cluster_selection_method* = eom *allow_single_cluster* = False *gen_min_span_tree* = True Can you help us with this? 5hrs+ seems to be a lot of time. We need to optimise it. Note: This is happening if we choose the algorithm as *Best and 10% of input data*, with other algorithms it is finishing in reasonable time, also if we choose Best as the algorithm but only pass *8% of input data, it finishes within 2 minutes*. — Reply to this email directly, view it on GitHub <#630>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWVUNKX7RQIT7JKYYALY6TKCTAVCNFSM6AAAAABGSM4ND6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TMMBUHAYTGOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

divya-agrawal3103 · 2024-04-23T09:39:17Z

Hi @jc-healy Thanks for the swift response.
I am attaching the input data file here, and as far as I can see it comprises of categorical columns (Gender,Marital Status).
Could you please try using this input to test ?
Really appreciate your time!
sample.zip

divya-agrawal3103 · 2024-04-24T22:37:49Z

Hi @jc-healy We are stuck and really looking forward for any inputs from your side in order to resolve the problem.
Thank you in advance.

jc-healy · 2024-04-25T00:56:24Z

Hi there, I grabbed your data and filtered out the categorical columns (and your customer ID column) before hitting it with hdbscan and it took 3 to 5 minutes for me to cluster the 198,000 records.

Looking at your data I have two recommendations for clustering. First your numeric values are on vastly different scales. So Euclidean distance over this data will be dominated by your Income column which is on a vastly different scale that the "Members Within Household". To fix that I'd use something like a RobustScaler from sklearn.preprocessing to normalize your numeric columns. You can do fancier things but that's a pretty solid first thing to try.

I'd also one hot encode your two categorical fields to convert them to numeric. I'd do this in a pipeline using sklearns OneHotEncoder. Again you can get fancier but this is a good start.

As general good practice I'd suggest wrapping your preprocessing in a ColumnTransformer. That is a generally good practice for keeping track of your column transformations so they can be consistently applied to future data. Not necessary here but still a good habit.

Here is some sample code to get you started:

import hdbscan
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder

data = pd.read_csv('sample.csv')
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)

categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features),
    ('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())

model = hdbscan.HDBSCAN(min_cluster_size = 50, min_samples = 5).fit(normalized_df)

Cheers,
John

divya-agrawal3103 · 2024-04-26T13:40:44Z

Hi @jc-healy
Thanks a lot for the detailed analysis.
Will try to incorporate the suggestions.
Appreciate your time.

jc-healy · 2024-05-20T14:11:55Z

Closing this for now

jc-healy closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDBScan performance issue when choosing Best algorithm #630

HDBScan performance issue when choosing Best algorithm #630

divya-agrawal3103 commented Apr 22, 2024

jc-healy commented Apr 22, 2024 via email

divya-agrawal3103 commented Apr 23, 2024

divya-agrawal3103 commented Apr 24, 2024

jc-healy commented Apr 25, 2024 •

edited

divya-agrawal3103 commented Apr 26, 2024

jc-healy commented May 20, 2024

HDBScan performance issue when choosing Best algorithm #630

HDBScan performance issue when choosing Best algorithm #630

Comments

divya-agrawal3103 commented Apr 22, 2024

jc-healy commented Apr 22, 2024 via email

divya-agrawal3103 commented Apr 23, 2024

divya-agrawal3103 commented Apr 24, 2024

jc-healy commented Apr 25, 2024 • edited

divya-agrawal3103 commented Apr 26, 2024

jc-healy commented May 20, 2024

jc-healy commented Apr 25, 2024 •

edited