Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDBScan performance issue when choosing Best algorithm #630

Closed
divya-agrawal3103 opened this issue Apr 22, 2024 · 6 comments
Closed

HDBScan performance issue when choosing Best algorithm #630

divya-agrawal3103 opened this issue Apr 22, 2024 · 6 comments

Comments

@divya-agrawal3103
Copy link

Hi,

I am attempting to execute a stream that is using HDBScan clustering algorithm on a set of input data to generate a model.
When I am selecting the Algorithm as Best and randomly passing 10% of the total input data (The input data is a csv file that has 15 columns, and ~169379 rows) , the stream executes and never finishes, I tracked it till 5hrs 9 mins and then had to stop.

This is the piece of code from the python script that is getting used to build the model, and it is runs forever and is taking all the time.

hdb = hdbscan.hdbscan_.HDBSCAN(min_cluster_size=param['min_cluster_size'],
                                       min_samples=param['min_samples'],
                                       metric=param['metric'],
                                       alpha=param['alpha'],
                                       p=param['p'],
                                       algorithm=param['algorithm'],
                                       leaf_size=param['leaf_size'],
                                       approx_min_span_tree=param['approx_min_span_tree'],
                                       cluster_selection_method=param['cluster_selection_method'],
                                       allow_single_cluster=param['allow_single_cluster'],
                                       gen_min_span_tree=param['gen_min_span_tree'],
                                       prediction_data=True).fit(X)

Below are the inputs we are feeding-
min_cluster_size = 50
min_samples = 5
metric = euclidean
alpha = 1.0
p = 1.5
algorithm = best
leaf_size = 30
approx_min_span_tree = True
cluster_selection_method = eom
allow_single_cluster = False
gen_min_span_tree = True

Can you help us with this? 5hrs+ seems to be a lot of time. We need to optimise it.
Note: This is happening if we choose the algorithm as Best and 10% of input data, with other algorithms it is finishing in reasonable time, also if we choose Best as the algorithm but only pass 8% of input data, it finishes within 2 minutes.

@jc-healy
Copy link
Collaborator

jc-healy commented Apr 22, 2024 via email

@divya-agrawal3103
Copy link
Author

Hi @jc-healy Thanks for the swift response.
I am attaching the input data file here, and as far as I can see it comprises of categorical columns (Gender,Marital Status).
Could you please try using this input to test ?
Really appreciate your time!
sample.zip

@divya-agrawal3103
Copy link
Author

Hi @jc-healy We are stuck and really looking forward for any inputs from your side in order to resolve the problem.
Thank you in advance.

@jc-healy
Copy link
Collaborator

jc-healy commented Apr 25, 2024

Hi there, I grabbed your data and filtered out the categorical columns (and your customer ID column) before hitting it with hdbscan and it took 3 to 5 minutes for me to cluster the 198,000 records.

Looking at your data I have two recommendations for clustering. First your numeric values are on vastly different scales. So Euclidean distance over this data will be dominated by your Income column which is on a vastly different scale that the "Members Within Household". To fix that I'd use something like a RobustScaler from sklearn.preprocessing to normalize your numeric columns. You can do fancier things but that's a pretty solid first thing to try.

I'd also one hot encode your two categorical fields to convert them to numeric. I'd do this in a pipeline using sklearns OneHotEncoder. Again you can get fancier but this is a good start.

As general good practice I'd suggest wrapping your preprocessing in a ColumnTransformer. That is a generally good practice for keeping track of your column transformations so they can be consistently applied to future data. Not necessary here but still a good habit.

Here is some sample code to get you started:

import hdbscan
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder

data = pd.read_csv('sample.csv')
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)

categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features),
    ('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())

model = hdbscan.HDBSCAN(min_cluster_size = 50, min_samples = 5).fit(normalized_df)

Cheers,
John

@divya-agrawal3103
Copy link
Author

Hi @jc-healy
Thanks a lot for the detailed analysis.
Will try to incorporate the suggestions.
Appreciate your time.

@jc-healy
Copy link
Collaborator

Closing this for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants