We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
A high-throughput and memory-efficient inference and serving engine for LLMs
Python 65.2k 11.9k
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Python 2.4k 315
Common recipes to run vLLM
Jupyter Notebook 275 100
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
Python 152 20
Intelligent Router for Mixture-of-Models
Go 2.4k 304
This repo hosts code for vLLM CI & Performance Benchmark infrastructure.
TPU inference for vLLM, with unified JAX and PyTorch support.
Community maintained hardware plugin for vLLM on Intel Gaudi
Community maintained hardware plugin for vLLM on Ascend
A framework for efficient model inference with omni-modality models
The vLLM XPU kernels for Intel GPU
Cost-efficient and pluggable Infrastructure components for GenAI inference