SOTA Weight-only Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"
-
Updated
May 29, 2024 - Python
SOTA Weight-only Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Unify Efficient Fine-Tuning of 100+ LLMs
Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf). I don't need a Star, but give me a pull request.
🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
🤗 Optimum Intel: Accelerate inference with Intel optimization tools
Dataflow compiler for QNN inference on FPGAs
Neural Network Compression Framework for enhanced OpenVINO™ inference
Model Compression Toolkit (MCT) is an open source project for neural network model optimization under efficient, constrained hardware. This project provides researchers, developers, and engineers advanced quantization and compression tools for deploying state-of-the-art neural networks.
Official implementation of Half-Quadratic Quantization (HQQ)
Fast inference engine for Transformer models
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
TinyChatEngine: On-Device LLM Inference Library
List of papers related to neural network quantization in recent AI conferences and journals.
Brevitas: neural network quantization in PyTorch
Faster Whisper transcription with CTranslate2
Lightweight Python PIL-libimagequant/pngquant interface with autonomous lib look-up.
Extremely fast color quantization. Reduce color information of a 24-bit RGB bitmap down to 8-bit.
Add a description, image, and links to the quantization topic page so that developers can more easily learn about it.
To associate your repository with the quantization topic, visit your repo's landing page and select "manage topics."