dtype=float32. Using bfloat16/float16 on CPU will cause errors.
Enable MCP to use tools from configured servers
Configure MCP Servers âNo MCP servers configured
Add MCP Server âRetrieval-Augmented Generation is under development.
Upload documents and enhance responses with relevant context.
Performance BenchmarkingInstall GuideLLM to enable advanced benchmarking features:
pip install "vllm-playground[benchmark]"
or
pip install guidellm
Built-in benchmarking is still available without GuideLLM.
No benchmark data available
Start the vLLM server and click "Run Benchmark" to test performance
Configure Model Context Protocol servers to extend LLM capabilities with external tools
Install the MCP package to enable this feature:
pip install vllm-playground[mcp]
or
pip install mcp
No MCP servers configured
Add a server to get started, or choose from presets below
npx (Node.js) - Required for Filesystem serveruvx (uv) - Required for Git, Fetch, Time servers
AI-powered coding assistant connected to your vLLM server
Install ttyd to enable the Claude Code terminal:
brew install ttyd
(macOS)
sudo apt install ttyd
(Ubuntu/Debian)
Note: You also need to install Claude Code CLI. See official setup guide for installation options.
Or run Claude Code manually in your terminal with the vLLM environment variables shown in manual setup.
Install Claude Code CLI to use this feature.
Recommended (native install):
curl -fsSL https://claude.ai/install.sh | bash
See official setup guide for all installation options (Homebrew, WinGet, npm).
Start a vLLM server first to use Claude Code.
meta-llama/Llama-3.1-8B-Instruct
65536 (Claude Code needs 32K+ for responses)âllama3_jsonmy-code)Requires vLLM v0.12.0+ for Anthropic API support. See setup tips below terminal for GPU memory guide.
Claude Code cannot use model names containing "/".
Set a Served Model Name (e.g., "my-model") in the vLLM Server configuration panel, then restart the server.
Configure vLLM ServerClaude Code requires tool calling to be enabled.
Enable Tool Calling in the vLLM Server configuration panel, then restart the server.
Configure vLLM Serverhttp://localhost:8000
Model: -
đĄ Claude Code is running with your vLLM server as the backend. Type your coding questions or requests above.
Press Ctrl+C to interrupt, Ctrl+D to exit.
/v1/messages endpoint)Meta Llama 3.1 8B Instruct - Best balance of context, tool calling, and GPU memory:
Model: meta-llama/Llama-3.1-8B-Instruct
Args: --max-model-len 65536 --enable-auto-tool-choice --tool-call-parser llama3_json --gpu-memory-utilization 0.9
| GPU VRAM | Recommended Model | Max Context |
|---|---|---|
| 24GB (L4) | Llama-3.1-8B-Instruct | 32K-48K |
| 40-48GB (A6000) | Llama-3.1-8B-Instruct | 65K |
| 80GB (A100/H100) | Llama-3.1-70B-Instruct | 128K |
--max-model-len or use /clear to reset context--max-model-len or use quantized model (AWQ/GPTQ)You can always run Claude Code directly in your terminal with these environment variables:
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=vllm-playground
claude --model your-model-name
Replace the port and model name with your vLLM server settings.
No resources available
No prompts available
Optimized configurations from the vLLM Recipes Repository. Select a recipe to auto-configure the playground.
Are you sure?
Choose export format:
Are you sure you want to clear all messages? This action cannot be undone.