Metadata-Version: 2.4
Name: paroquant
Version: 0.1.6
Summary: ParoQuant — Pairwise Rotation Quantization for LLMs
Author: Z Lab
License-Expression: MIT
Project-URL: Homepage, https://paroquant.z-lab.ai
Project-URL: Paper, https://arxiv.org/abs/2511.10645
Project-URL: Models, https://huggingface.co/collections/z-lab/paroquant
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rich
Provides-Extra: transformers
Requires-Dist: torch>=2.8; extra == "transformers"
Requires-Dist: torchvision; extra == "transformers"
Requires-Dist: transformers>=4.55; extra == "transformers"
Requires-Dist: autoawq; extra == "transformers"
Provides-Extra: vllm
Requires-Dist: vllm>=0.15; extra == "vllm"
Requires-Dist: accelerate; extra == "vllm"
Provides-Extra: mlx
Requires-Dist: mlx; extra == "mlx"
Requires-Dist: mlx-lm; extra == "mlx"
Requires-Dist: mlx-vlm; extra == "mlx"
Provides-Extra: optim
Requires-Dist: paroquant[transformers]; extra == "optim"
Requires-Dist: datasets; extra == "optim"
Requires-Dist: simple_parsing; extra == "optim"
Requires-Dist: tqdm; extra == "optim"
Provides-Extra: eval
Requires-Dist: lm_eval; extra == "eval"
Requires-Dist: zstandard; extra == "eval"
Provides-Extra: agent
Requires-Dist: qwen-agent; extra == "agent"
Requires-Dist: mcp; extra == "agent"
Requires-Dist: soundfile; extra == "agent"
Requires-Dist: uv; extra == "agent"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Dynamic: license-file

# ParoQuant

**Pairwise Rotation Quantization for Efficient Reasoning LLM Inference**

<p align="center">
  <a href="https://arxiv.org/abs/2511.10645"><img src="https://img.shields.io/badge/arXiv-2511.10645-b31b1b.svg" alt="Paper"></a>
  <a href="https://paroquant.z-lab.ai"><img src="https://img.shields.io/badge/Blog-ParoQuant-blue" alt="Blog"></a>
  <a href="https://huggingface.co/collections/z-lab/paroquant"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Models-yellow" alt="Models"></a>
  <a href="https://pypi.org/project/paroquant/"><img src="https://img.shields.io/pypi/v/paroquant" alt="PyPI"></a>
</p>

State-of-the-art INT4 quantization for LLMs. ParoQuant uses learned pairwise rotations to suppress weight outliers, closing the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX).

<p align="center">
  <a href="https://youtu.be/fISG4CkizLM">
    <img src="https://img.youtube.com/vi/fISG4CkizLM/maxresdefault.jpg" width="80%">
  </a>
</p>

## Quick Start

### Installation

```bash
# NVIDIA GPU
pip install "paroquant[vllm]"

# Apple Silicon
pip install "paroquant[mlx]"
```

Pick a model from our [Hugging Face collection](https://huggingface.co/collections/z-lab/paroquant):

```bash
export MODEL=z-lab/Qwen3.5-4B-PARO
```

### Interactive Chat

```bash
python -m paroquant.cli.chat --model $MODEL
```

### OpenAI-Compatible API Server

```bash
python -m paroquant.cli.serve --model $MODEL --port 8000
```

Add `--llm-only` if you do not wish to load the VLM components.

### Agent with Tool Calling

Start the API server first, then install the agent dependencies and run:

```bash
pip install "paroquant[agent]"
python -m paroquant.cli.agent --model $MODEL
```

Tool use (web fetch, filesystem, time) requires [Node.js](https://nodejs.org/en/download).

### Docker (NVIDIA GPU)

```bash
# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
  ghcr.io/z-lab/paroquant:chat --model $MODEL

# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
  ghcr.io/z-lab/paroquant:serve --model $MODEL
```

## Models

All models are available on [Hugging Face](https://huggingface.co/collections/z-lab/paroquant). Swap the model name in the commands above to try any of them.

**Qwen3.5**

| Model | Checkpoint |
|---|---|
| Qwen3.5-0.8B | [`z-lab/Qwen3.5-0.8B-PARO`](https://huggingface.co/z-lab/Qwen3.5-0.8B-PARO) |
| Qwen3.5-2B | [`z-lab/Qwen3.5-2B-PARO`](https://huggingface.co/z-lab/Qwen3.5-2B-PARO) |
| Qwen3.5-4B | [`z-lab/Qwen3.5-4B-PARO`](https://huggingface.co/z-lab/Qwen3.5-4B-PARO) |
| Qwen3.5-9B | [`z-lab/Qwen3.5-9B-PARO`](https://huggingface.co/z-lab/Qwen3.5-9B-PARO) |

**Qwen3**

| Model | Checkpoint |
|---|---|
| Qwen3-0.6B | [`z-lab/Qwen3-0.6B-PARO`](https://huggingface.co/z-lab/Qwen3-0.6B-PARO) |
| Qwen3-1.7B | [`z-lab/Qwen3-1.7B-PARO`](https://huggingface.co/z-lab/Qwen3-1.7B-PARO) |
| Qwen3-4B | [`z-lab/Qwen3-4B-PARO`](https://huggingface.co/z-lab/Qwen3-4B-PARO) |
| Qwen3-8B | [`z-lab/Qwen3-8B-PARO`](https://huggingface.co/z-lab/Qwen3-8B-PARO) |
| Qwen3-14B | [`z-lab/Qwen3-14B-PARO`](https://huggingface.co/z-lab/Qwen3-14B-PARO) |

**Llama**

| Model | Checkpoint |
|---|---|
| Llama-2-7B | [`z-lab/Llama-2-7b-hf-PARO`](https://huggingface.co/z-lab/Llama-2-7b-hf-PARO) |
| Llama-3-8B | [`z-lab/Meta-Llama-3-8B-PARO`](https://huggingface.co/z-lab/Meta-Llama-3-8B-PARO) |
| Llama-3.1-8B-Instruct | [`z-lab/Llama-3.1-8B-Instruct-PARO`](https://huggingface.co/z-lab/Llama-3.1-8B-Instruct-PARO) |

Want a model that's not listed? [Open an issue](https://github.com/z-lab/paroquant/issues/new) and let us know.

## Reproduction

> [!NOTE]
> The main branch of this repository is under active development, and reproducibility is not guaranteed.
> Please use the [`legacy`](https://github.com/z-lab/paroquant/tree/legacy) branch to reproduce results from the paper.

## Quantize Your Own Model

```bash
git clone https://github.com/z-lab/paroquant && cd paroquant
pip install -e ".[optim,eval]"

# 1. Optimize rotation parameters
experiments/optimize/4bit.sh Qwen/Qwen3-8B

# 2. Export to HF checkpoint (--mode real for INT4, --mode pseudo for FP16)
python -m paroquant.cli.convert \
  --model Qwen/Qwen3-8B \
  --result-dir output/Qwen3-8B \
  --output-path models/Qwen3-8B-PARO
```

## Docker Images

| Image | Purpose |
|---|---|
| `ghcr.io/z-lab/paroquant:chat` | Interactive chat |
| `ghcr.io/z-lab/paroquant:chat-cu129` | Interactive chat (CUDA 12.9) |
| `ghcr.io/z-lab/paroquant:serve` | OpenAI-compatible API server |
| `ghcr.io/z-lab/paroquant:latest` | Optimization & evaluation |
| `ghcr.io/z-lab/paroquant:eval` | Reasoning task evaluation |

## Citation

```bibtex
@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}
```
