Skip to content

Clustering similar words based on word embeddings

Latest
Compare
Choose a tag to compare
@valearna valearna released this 18 Aug 22:04
· 17 commits to master since this release

This release adds some new features to the wormicloud interface:

  • clustering similar words: builds a similarity graph of words through a pre-trained word embedding model where each node is a word and edges connecting pairs of words are weighted by their similarity. A minimum similarity value can be specified to avoid a fully connected graph. A standard network clustering is then applied to the network to identify groups of similar words. The words in the word cloud are colored according to their membership to the clusters. An additional option allows the user to show only the most important words in each cluster (based on their counters) and hide the others. The new button "download clustering info" generates a file with the list of words in each cluster. In order to keep the word graph small and the computation of edges fast, only the top 500 words in terms of counter are clustered. The remaining words are not clustered and appear under the 'not clustered' label in the clustering info file.
  • Show number of curated objects per paper: uses WormBase API to retrieve the number of objects already curated for each paper returned by searches and display the count in the reference list. This option is disabled by default. It can be enabled through the 'advanced options' menu.

In additon, the code has been cleaned up by switching from React classes to React hooks and components are now better organized