New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDBScan performance issue when choosing Best algorithm #630
Comments
That is indeed strange behaviour. By 10% of input data I presume you mean
clustering about ~16,000 points which are 15 dimensional. If so, 5hrs+ is
remarkably slow and even two minutes is a bit slow. I can cluster 16,000
15 dimensional points with your parameters in about 4 seconds (truncatedSVD
to 15 dimensions on top of MNIST). For scaling context I can handle 70,000
15 dimensional points in about 30 seconds.
My best guess is that there is something strange going on with your data
being loaded from your csv. Is it properly numeric data? Or do you have 15
string columns that are being loaded as categorical values and being
transformed via a one hot encoder or some such thing? Have you loaded it
into a numpy array?
As an aside, I think your parameter of p=1.5 is being ignored. It is a
parameter for Minkowski distance and should be ignored when your
metric='euclidean'.
…On Mon, Apr 22, 2024 at 5:30 AM divya-agrawal3103 ***@***.***> wrote:
Hi,
I am attempting to execute a stream that is using *HDBScan* clustering
algorithm on a set of input data to generate a model.
When I am selecting the Algorithm as *Best* and randomly passing 10% of
the total input data (The input data is a csv file that has *15 columns*,
and *~169379 rows*) , the stream executes and never finishes, I tracked
it till *5hrs 9 mins* and then had to stop.
This is the piece of code from the python script that is getting used to
build the model, and it is runs forever and is taking all the time.
hdb = hdbscan.hdbscan_.HDBSCAN(min_cluster_size=param['min_cluster_size'],
min_samples=param['min_samples'],
metric=param['metric'],
alpha=param['alpha'],
p=param['p'],
algorithm=param['algorithm'],
leaf_size=param['leaf_size'],
approx_min_span_tree=param['approx_min_span_tree'],
cluster_selection_method=param['cluster_selection_method'],
allow_single_cluster=param['allow_single_cluster'],
gen_min_span_tree=param['gen_min_span_tree'],
prediction_data=True).fit(X)
Below are the inputs we are feeding-
*min_cluster_size* = 50
*min_samples* = 5
*metric* = euclidean
*alpha* = 1.0
*p* = 1.5
*algorithm* = best
*leaf_size* = 30
*approx_min_span_tree* = True
*cluster_selection_method* = eom
*allow_single_cluster* = False
*gen_min_span_tree* = True
Can you help us with this? 5hrs+ seems to be a lot of time. We need to
optimise it.
Note: This is happening if we choose the algorithm as *Best and 10% of
input data*, with other algorithms it is finishing in reasonable time,
also if we choose Best as the algorithm but only pass *8% of input data,
it finishes within 2 minutes*.
—
Reply to this email directly, view it on GitHub
<#630>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUWVUNKX7RQIT7JKYYALY6TKCTAVCNFSM6AAAAABGSM4ND6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TMMBUHAYTGOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi @jc-healy Thanks for the swift response. |
Hi @jc-healy We are stuck and really looking forward for any inputs from your side in order to resolve the problem. |
Hi there, I grabbed your data and filtered out the categorical columns (and your customer ID column) before hitting it with hdbscan and it took 3 to 5 minutes for me to cluster the 198,000 records. Looking at your data I have two recommendations for clustering. First your numeric values are on vastly different scales. So Euclidean distance over this data will be dominated by your Income column which is on a vastly different scale that the "Members Within Household". To fix that I'd use something like a RobustScaler from sklearn.preprocessing to normalize your numeric columns. You can do fancier things but that's a pretty solid first thing to try. I'd also one hot encode your two categorical fields to convert them to numeric. I'd do this in a pipeline using sklearns OneHotEncoder. Again you can get fancier but this is a good start. As general good practice I'd suggest wrapping your preprocessing in a ColumnTransformer. That is a generally good practice for keeping track of your column transformations so they can be consistently applied to future data. Not necessary here but still a good habit. Here is some sample code to get you started: import hdbscan
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder
data = pd.read_csv('sample.csv')
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)
categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
('cat', OneHotEncoder(), categorical_features),
('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())
model = hdbscan.HDBSCAN(min_cluster_size = 50, min_samples = 5).fit(normalized_df) Cheers, |
Hi @jc-healy |
Closing this for now |
Hi,
I am attempting to execute a stream that is using HDBScan clustering algorithm on a set of input data to generate a model.
When I am selecting the Algorithm as Best and randomly passing 10% of the total input data (The input data is a csv file that has 15 columns, and ~169379 rows) , the stream executes and never finishes, I tracked it till 5hrs 9 mins and then had to stop.
This is the piece of code from the python script that is getting used to build the model, and it is runs forever and is taking all the time.
Below are the inputs we are feeding-
min_cluster_size = 50
min_samples = 5
metric = euclidean
alpha = 1.0
p = 1.5
algorithm = best
leaf_size = 30
approx_min_span_tree = True
cluster_selection_method = eom
allow_single_cluster = False
gen_min_span_tree = True
Can you help us with this? 5hrs+ seems to be a lot of time. We need to optimise it.
Note: This is happening if we choose the algorithm as Best and 10% of input data, with other algorithms it is finishing in reasonable time, also if we choose Best as the algorithm but only pass 8% of input data, it finishes within 2 minutes.
The text was updated successfully, but these errors were encountered: