Skip to content

b7leung/Chat-Log-Statistical-Linguistic-Analysis

Repository files navigation

Introduction

With the advent of deep learning, new significant advances are being made at a rapid pace in the natural language processing (NLP) literature. However, for smaller companies, it can be difficult to integrate these new technologies without spending substantial resources on research and development. In our project we developed a Chat Log Language Analysis Suite to enable companies to readily understand the linguistic patterns of their user base. Many applications are integrated with social chatting systems, including video games, dating apps, and social media. With our product, even verbose chatters with thousands of logged chat messages can be summarized, evaluated, and compared with other users at a glance. Then, downstream use cases include flagging/suspending/banning toxic users, recommending advertisements or posts to users, and even using their “virtual” chatbot counterpart to predict their behavior to new inputs.

A presentation for this project can be found here.

Documentation & Testing Framework

Our documentation is located under "Docs/_build/html". Note that contents under nlp_suite/chatbot/style_transfer_paraphrase are neither tested nor documented, as we did not develop that code. For more information, please visit that project.

All tests can be run by:
pytest test_chat_log_suite.py

If you don't have a GPU, you exclude the tests which require one:
pytest -m "not chatbot_gpu" test_chat_log_suite.py

When running coverage with all tests, using coverage run --source=. -m pytest test_chat_log_suite.py we currently achieve a 84% code coverage:
image

One can also run coverage without the GPU tests with
coverage run --source=. -m pytest -m "not chatbot_gpu" test_chat_log_suite.py,
but this will result in a lower coverage.

Data

Full Dataset

The full dataset that we used is located at:
https://www.kaggle.com/jef1056/discord-data

This dataset was generated from scraping chat logs from many public discord servers, and contains about 110 million chat messages.

Example Inputs

The following are some example inputs containing chat logs that can be uploaded to our NLP suite dashboard.

Cached Data

The following are pretrained style transfer transformer weights for the chatbot. They can be placed in the "cached_user_data" directory.

Deployment Setup

With Chatbot

The NLP suite, with chatbot functionality, requires at least 6 GB of GPU memory. It can be set up is as follows.

  • Launch an AWS EC2 instance (we used a p2.xlarge instance with the "Ubuntu Server 18.04 LTS (HVM), SSD Volume Type" AMI), and SSH into it.
  • Install NVIDIA drivers
  • Install Anaconda
  • Clone this repo, and create a conda environment with the required dependencies:
    conda env create --file deployment/nlp_suite_conda_env.yaml
  • Under the created conda environment, run a Jupyter Notebook, or alternatively, Voila:
    jupyter notebook --no-browser --port=5666 --NotebookApp.token='' --NotebookApp.password='' --ip='0.0.0.0' --allow-root
    or
    voila dashboard.ipynb --no-browser --port 5666
  • Port forward from your EC2 instance to your local PC, in a linux or WSL terminal:
    ssh -i path/to/private_key.pem -N -L 8081:localhost:5666 ubuntu@instance_ip_address
  • Now, you can view the NLP suite at http://localhost:8081/ on your local PC.

Without Chatbot

The NLP suite can also be run on an instance without a GPU (say, t2.micro) but there will be no chatbot functionality. To do this, simply follow the instructions in the "with chatbot" case above, but omit installing NVIDIA drivers.

About

Interactive dashboard to analyze chat logs with NLP transformer models, providing sentiment analysis, clustering, style transfer, & generation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published