Skip to content

mitodl/semantic-mitopen

 
 

Repository files navigation

MIT OpenCourseware GPT

AI-powered search and chat for MIT OpenCourseware.

How It Works

MIT OpenCourseware GPT provides 2 things:

  1. A search interface.
  2. A chat interface.

Search

Search was created with OpenAI Embeddings (text-embedding-ada-002).

First, we looped over OCW content (primarily PDF's) and generated embeddings for each chunk of text in the page, saving them to a postgres database.

Then in the app we take the user's search query, generate an embedding, and use the result to find the pages that contain similar content

The comparison is done using cosine similarity across our database of vectors.

Results are then ranked by similarity score and returned to the user.

Chat

Chat builds on top of search. It uses search results to create a prompt that is fed into GPT-3.5-turbo.

This allows for a chat-like experience where the user can ask questions and get answers based on information from OCW courses.

Running Locally

Here's a quick overview of how to run it locally.

Requirements

  1. Set up OpenAI

You'll need an OpenAI API key to generate embeddings (locally).

  1. Set up a local image of PostgreSQL (I recommend the pgvector docker image)

There is a setup.sql file in the root of the repo that you can use to set up the database.

Run that in a SQL editor.

Note: Or, connect to any PostgreSQL server using the env variables defined below

Repo Setup

  1. Clone repo
git clone https://github.com/mitodl/semantic-mitopen.git
  1. Set up environment variables

Create a .env file in the root of the repo folder with the following variables:

OPENAI_API_KEY = # Your OpenAI API key
# Connection info for the Postgres database that will store text chunks and embeddings
POSTGRES_HOST =
POSTGRES_DB_NAME =
POSTGRES_USERNAME =
POSTGRES_TABLE_NAME = #if you used setup.sql, this should be "mit_open_chunks"
POSTGRES_SEARCH_FUNCTION = #if you used setup.sql, this should be "mit_open_gpt_search"
POSTGRES_PASSWORD =
# Connection info for the Postgres database from which MIT Open content will be retrieved
OPEN_POSTGRES_HOST=
OPEN_POSTGRES_DB_NAME=
OPEN_POSTGRES_USERNAME=
OPEN_POSTGRES_PASSWORD=

Dataset

  1. Run parsing script
docker-compose run --rm web python3 data/ocw-upload.py

App

  1. Run entire app
docker-compose up

Credits

Thanks to Alex Sima who developed the project on which this is heavily based.

About

Codebase for semantic search of MIT Open Courseware (AI-powered Search and Chat for MIT Open Courseware)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 64.3%
  • Python 26.1%
  • JavaScript 3.2%
  • PLpgSQL 2.9%
  • HTML 1.6%
  • CSS 1.5%
  • Dockerfile 0.4%