Skip to content

vliu15/elmo-kmeans

Repository files navigation

ELMo Embeddings for Clustering

To convert text datasets into clusters based on topic. Currently in progress for benchmarking NVIDIA Rapids cuML/pyGDF for performance.

Requirements:

  • Python3 (>=3.6 for AllenNLP)
  • AllenNLP
  • TensorFlow
  • NumPy
  • SKLearn (for clustering)
  • Torch (for AllenNLP)
  • SciPy (for SKLearn stop words)

Install

To install the necessary dependencies:

apt-get install python3-pip
pip3 install allennlp
pip3 install tensorflow-gpu
pip3 install numpy
pip3 install sklearn
pip3 install torch
# pip3 install scipy

Usage

To generate sentence embeddings, make sure that the sentences.txt file is formatted as such:

<sentence/transcription>
<sentence/transcription>
# and so on...

Run python3 main.py with the following options:

  • --mode embed to embed the sentence file
  • --mode sif to enhance sentence embeddings with SIF
  • --mode cluster to cluster embeddings
  • --mode project to reduce dimensionality for visualization
  • --mode metadata to write metadata file
  • --mode tensorboard to create TensorBoard files

Adjust runtime flags in main.py

A couple auxiliary files:

  • Run sh clean.sh to convert transcriptions to lower case and remove stop words

Run

To run inside a Docker container:

docker build -t elmo-embeddings .
docker exec -it elmo-embeddings /bin/bash
python3.6 main.py

Outputs

An output folder will be created in the current directory containing:

  • embeddings.npy: a NumPy array of sentence embeddings (NumPy arrays) in binary format
  • embeddings_sif.npy: a NumPy array of sentence embeddings after SIF
  • embeddings_pc.npy: a NumPy array of sentence embeddings after PCA
  • embeddings_ts.npy: a NumPy array of sentence embeddings after t-SNE
  • km_labels.json: a list of cluster labels generated by KMeans
  • metadata.tsv: metadata of sentence labels for visualization

Other nested output folders:

  • tensorboard: for TensorBoard output logs
  • kmeans: for clustered sentences with kmeans
  • trimmed: for another copy of embeddings/sentences with specified clusters removed
  • hierarchy: for clustered sentences with hierarchical kmeans

GPU-acceleration:

  • Allow ELMo to use GPU for embedding (UPDATE: GPU speedup by as much as 4x)
  • Utilize NVIDIA Rapids cuML to GPU-accelerate clustering (by ~10x)

TO-DO

  • Preprocess transcriptions
  • Embed each sentence with ELMo
  • Enhance embeddings with SIF
  • Cluster using SKLearn KMeans (optional: hierarchically)
  • Find optimal k using elbow method and silhouette scores (optional)
  • Reduce dimensionality for visualization (PCA, t-SNE)
  • Run in TensorBoard
  • Conclusions