Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can this work with cluster made by top2vec ? #20

Open
behrica opened this issue May 25, 2021 · 17 comments
Open

Can this work with cluster made by top2vec ? #20

behrica opened this issue May 25, 2021 · 17 comments

Comments

@behrica
Copy link

behrica commented May 25, 2021

Thanks for your interesting package.

Do you think Clustergram could work with top2vec ?
https://github.com/ddangelov/Top2Vec

I saw that there is the option to create a clustergram from a DataFrame.

In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.

So I could indeed generate a data frame, like this:

x0 x1 ... x255 topic
0.5 0.2 .... -0.2 2
0.7 0.2 .... -0.1 2
0.5 0.2 .... -0.2 3

Does Clustergram assume anything on the rows of this data frame ?
I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.

@martinfleis
Copy link
Owner

If I understand correctly, columns x0 ... x255 are input data while topic is a resulting cluster label? Then you should be able to use Clustergram.from_data.

Assuming you have different versions of topic result, you need to create a df with as many topic columns as your results (ideally sorted).

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

If that means that resulting cluster centers are not the mean/median of the values, then yes. If you know them, you can use Clustergram.from_centers method instead to pass them directly.

If you can provide some minimal example, I can try to work it out.

Also note, that both from_data and from_centers may be buggy :). Worth playing with them to catch and fix them though.

@behrica
Copy link
Author

behrica commented May 26, 2021

I have indeed teh cluster centers and try to use the from_centers method.

I think I could construct easely the cluster_centers dictionary, but I have no idea what the labels data frame should contain.

Lets assume I have cluster centers which have 10 dimensions, and 1 , 2 and 3 clusters.

So the cluster centers dictionary should be

{
1: [[0,0,1,3,0,5,3,2,7,8]]
2: [
     [1,0,1,3,0,5,3,2,3,8],
     [4,0,5,3,7,5,3,2,9,8]]

3: [
     [0,0,1,3,0,5,3,2,7,8],
     [7,1,1,3,0,5,3,2,0,8],
     [0,0,5,3,0,5,3,2,7,8]]
}

correct ?

@behrica
Copy link
Author

behrica commented May 26, 2021

But I cannot see what the labels dataframe should be in this case.
Do we need still the original data as depicted above in some way as input to the from_vectors?
(I use now 10 dimensions, but the same applies to 255 dimensions)

@martinfleis
Copy link
Owner

labels dataframe contains labelling of individual observations from different clustering options. So in the most typical case of K-Means done between 2 and 5 clusters, the first column contains labels for k=2, second for k=3, third for k=4 and fourth for k=5.

k=2 k=3 k=4 k=5
observation_1 1 1 3 0
observation_2 0 0 1 4
observation_3 1 2 2 2

Assuming you have a similar option in top2vec, the first column will contain labels for a result A, second for a result B... From quickly looking at the code, I guess that your options will be based on differenct values in min_count?

@martinfleis
Copy link
Owner

The cluster centers dict above looks alright. You may just need to wrap each into a numpy array to get something like this:

centers = {
             1: np.array([[0, 0]]),
             2: np.array([[-1, -1], [1, 1]]),
             3: np.array([[-1, -1], [1, 1], [0, 0]]),
          }

@behrica
Copy link
Author

behrica commented May 26, 2021

labels dataframe contains labelling of individual observations from different clustering options. So in the most typical case of K-Means done between 2 and 5 clusters, the first column contains labels for k=2, second for k=3, third for k=4 and fourth for k=5.

k=2 k=3 k=4 k=5
observation_1 1 1 3 0
observation_2 0 0 1 4
observation_3 1 2 2 2
Assuming you have a similar option in top2vec, the first column will contain labels for a result A, second for a result B... From quickly looking at the code, I guess that your options will be based on differenct values in min_count?

By "label" you mean "which cluster" ?
So I read the table above as:

"In the situation of 2 clusters, observation_1 was in cluster 1, observation 2 in cluster 0, observation 3 in cluster 1"
...
In the situation of 5 clusters, observation_1 was in cluster 0, observation_2 was in cluster 4, observation_3 in cluster 2

So the table has one row for each observation, correct ?

@martinfleis
Copy link
Owner

Yes, precisely.

@behrica
Copy link
Author

behrica commented May 26, 2021

Ok, I will give it a try.

I use top2vec to cluster 55000 documents.

The initial run of top2vec created 401 clusters, which I can "reduce to any size", which I would the do step by step and go from 401 to 0

So my labels table would be big...

55000 * 401

Do you think it makes any sense to create a clustergram as big as this...
Our final goal is obviously to find the "best cluster size" from the clustergram...

@martinfleis
Copy link
Owner

Clustergram itself should deal with it but keep in mind that you'll need to be able to interpret it. The new interactive exploration can help you with that but still, that is a lot of options to look at. There's no assumption about the data? I.e. you normally know if you're looking for 5, 25 or 150 clusters.

@behrica
Copy link
Author

behrica commented May 26, 2021

As we deal with text, 55000 scientific paper abstracts, and word/paragraph vectors,
any mathematical assumptions are very difficult.
The vector representation of the text is so far away from the text , that this is very tricky
The notion of "how many topics are present in a given text corpus" is not well defined, and a continuum.

So frankly, we don't have a clue on how many clusters to expect.

top2vec does something sensitive, and chooses a certain number of topics automatically by some internal criteria.
That's one reason, why we like the top2vec approach

@martinfleis
Copy link
Owner

In that case, I'd suggest trying to get maximum from bokeh() visualisation of clustergram so you can explore different parts of it.

@martinfleis
Copy link
Owner

@behrica did manage to make it work by any chance?

@doubianimehdi
Copy link

@behrica I have the same goal as you but I'm using BERTopic ... I would be interested in seeing what you did if you managed to use it

@martinfleis
Copy link
Owner

@doubianimehdi Can you share a reproducible example of your problem? So I could try playing with that and figure out the solution?

@doubianimehdi
Copy link

I have an hdbscan method with the clusters informations but I don't know how to use it in clustergram ...

@martinfleis
Copy link
Owner

If you can share the code and some sample data so I can reproduce what you're doing, I can have a look at the way of using the result within a clustergram. You can check this guide on how to prepare such example - https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@behrica
Copy link
Author

behrica commented Nov 19, 2021

@martinfleis @doubianimehdi
Our overall goal was to do (automatic) hyper parameter optimisation with top2vec.
The top2vec code does not come with an implementation of a metric, so I was exploring some other forms of "cluster evaluation" and landed here.

In the meanwhile we found an implementation of a metric, so I did not explore further the usage of clustergram for top2vec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants