tokenizer
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
Here are 1,086 public repositories matching this topic...
DOM-aware tokenization for Hugging Face language models
-
Updated
Jun 1, 2024 - HTML
Sentiment analysis models using NLP and other important basics of NLP and subwords and a song lyric generator!
-
Updated
Jun 1, 2024 - Jupyter Notebook
⛄ Possibly the smallest Lua compiler ever
-
Updated
May 31, 2024 - Lua
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
-
Updated
May 31, 2024 - Python
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
-
Updated
May 31, 2024 - Python
Web tool to count LLM tokens (GPT, Claude, Llama, ...)
-
Updated
May 31, 2024 - TypeScript
Oxide is a hybrid database and streaming messaging system (think Kafka + MySQL); supporting data access via REST and SQL.
-
Updated
May 31, 2024 - Rust
retro style tokenization for language models
-
Updated
May 30, 2024 - Python
[READ ONLY] Locate available classes by parent, interface or trait. Subtree split of the Spiral Tokenizer component (see spiral/framework)
-
Updated
May 30, 2024 - PHP
Byte-Pair Encoding tokenizer for large language models
-
Updated
May 30, 2024 - Python
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
-
Updated
May 31, 2024 - Rust
🎤 vibrato: Viterbi-based accelerated tokenizer
-
Updated
May 30, 2024 - Rust
- Followers
- 10.1k followers
- Wikipedia
- Wikipedia