Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we reduce the dimensionality of topic_model.topic_embeddings_ ? #1959

Open
Batchounet opened this issue Apr 30, 2024 · 2 comments
Open

Comments

@Batchounet
Copy link

Dear Creator of the amazing BERTopic

I want to perform cosine similarity of the topic_embeddings to a list of labels. I found it to perform better than zeroshot (and faster !) for my use case. However, the embeddings in topic_model.topic_embeddings_ are 384 dimensional vectors, ie their dimension is not reduced using hdbscan. To my understanding, the cosine similarity could suffer from the curse of dimensionality because of that. Actually, ploting the max cosine similarity to my list of labels might suggest that, making most topics cosine similar to my labels to 0. 55 :
cosine score webpage

Should I add a dimensionality reduction step ? would it be possible to use directly the reduced embeddings for the topic_model ?

Again thanks for your work

@MaartenGr
Copy link
Owner

To my understanding, the cosine similarity could suffer from the curse of dimensionality because of that.

Actually, that's not entirely the case. Sure the curse of dimensionality has some influence but that is generally much less compared to other distance measures, like euclidean distance. There's a reason why you see cosine similarity (alongside dot product) used in embedding-based computations and that's because these distance measures work so well.

For the "highest precision" I would advise not reducing the dimensionality of the embeddings when using plain cosine similarity.

However, the embeddings in topic_model.topic_embeddings_ are 384 dimensional vectors, ie their dimension is not reduced using hdbscan.

Note that it's UMAP reducing the embeddings, not HDBSCAN.

@Batchounet
Copy link
Author

Thank you very much. Yes, I meant UMAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants