-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I force approximate_predict to assign every embedding to an existing cluster? #599
Comments
I think you want to try the soft clustering options to manage to do that.
…On Wed, Jul 5, 2023 at 8:17 AM mirix ***@***.***> wrote:
Hello,
Let me see if I am understanding things correctly.
I am reducing dimensionality with UMAP:
clusterable_embedding_large = umap.UMAP(
n_neighbors=n_neighbors,
min_dist=.0,
n_components=comp,
random_state=31416,
metric='cosine'
).fit_transform(df_dist)
Then I split the UMAP embeddings according to predefined indexes (between
long and short sentences):
cel_long = clusterable_embedding_large[long_seg]
cel_shor = clusterable_embedding_large[shor_seg]
Then I cluster the long sentences only:
clusterer = hdbscan.HDBSCAN(
min_samples=1,
min_cluster_size=cluster_size,
#cluster_selection_method='eom',
cluster_selection_method='leaf',
cluster_selection_epsilon=5,
gen_min_span_tree=True,
prediction_data=True
).fit(cel_long)
Next I would like to assign each of the short sentences to one of the
pre-existing clusters:
labels = list(clusterer.labels_)
labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor)
labels_short = list(labels_short)
print(labels)
[0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]
print(labels_short)
[1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1]
However, I face two issues:
1.
Some points are not assigned (label -1).
2.
Some points are assigned to a new cluster which did not exist in the
original clustering (label 2).
The first issue I believe I understand, but I would like to avoid it, if
possible. Is it possible to force approximate_predict to assign a data
point to the nearest cluster no matter what?
On the other hand, I believe that the second issue was not possible. From
the docs:
With that done you can run [approximate_predict()](
https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict)
with the model and any new data points you wish to predict. Note that this
differs from re-running HDBSCAN with the new points added since no new
clusters will be considered – instead the new points will be labelled
according to the clusters already labelled by the model.
Can this be also avoided?
Best,
Ed
—
Reply to this email directly, view it on GitHub
<#599>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUBI7MUOLSMMA4ZRFOJDXOVLMDANCNFSM6AAAAAAZ63ZLOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks, it seems promising. I will look into that. In the meantime, I have found a workaround: I cluster all the points together as usual. Then, for each short sentence, I compute the average distance from each cluster (excluding short sentences) and reassign if required. This seems to solve the problem on the current dataset. |
In case your are interested, HDBSCAN works wonderfully for clustering speakers in a diarisation project: https://github.com/mirix/approaches-to-diarisation I am really impressed. The challenge now would be to come up with some heuristics or ML to guess the optimal parameters automatically. |
Hello,
Let me see if I am understanding things correctly.
I am reducing dimensionality with UMAP:
Then I split the UMAP embeddings according to predefined indexes (between long and short sentences):
Then I cluster the long sentences only:
Next I would like to assign each of the short sentences to one of the pre-existing clusters:
However, I face two issues:
Some points are not assigned (label -1).
Some points are assigned to a new cluster which did not exist in the original clustering (label 2).
The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what?
On the other hand, I believe that the second issue was not possible. From the docs:
With that done you can run [approximate_predict()](https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict) with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model.
Can this be also avoided?
Best,
Ed
The text was updated successfully, but these errors were encountered: