Metadata-Version: 2.4
Name: shenron
Version: 0.20.9
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
License-File: LICENSE
Summary: Generate Shenron docker-compose deployments from model config files
Author: doubleword.ai
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/doublewordai/shenron
Project-URL: Repository, https://github.com/doublewordai/shenron

# Shenron

Shenron is a config-driven toolkit for deploying production LLM inference stacks. It supports two deployment modes:

1. **Helm chart** (recommended) — deploy on any Kubernetes cluster (k3s, microk8s, GKE, EKS, …)
2. **Docker Compose** (legacy) — single-node docker-compose deployments

---

## Helm Deployment (Recommended)

### Architecture

All external traffic enters through a single **Caddy** reverse proxy (the only `LoadBalancer` service). Caddy routes requests by path prefix to internal `ClusterIP` services:

```
Internet → Caddy (LoadBalancer :80/:443)
             ├── /llm/*       → onwards       (OpenAI-compatible API gateway)
             ├── /replica/*   → replica-manager (scaling API)
             └── /metrics/*   → prometheus     (metrics)
```

Caddy uses `handle_path` directives, which both match and strip the prefix. This means:
- `GET /llm/v1/models` → forwards as `GET /v1/models` to onwards
- `GET /metrics` → forwards as `GET /` to prometheus (which serves its UI at `/`)
- `POST /replica/v1/models/Qwen%2FQwen3-0.6B/replicas` → forwards to replica-manager

Behind onwards, per-model components are deployed:

```
onwards → router (per model, cache-aware load balancing)
            └── model pod(s) (SGLang/vLLM, GPU workloads)
```

### Prerequisites

**Kubernetes cluster** with:
- GPU nodes with NVIDIA drivers installed
- NVIDIA device plugin (`nvidia.com/gpu` resource available)
- A `RuntimeClass` named `nvidia`
- A default StorageClass (for Caddy certificate persistence)

#### Node Setup (microk8s)

```bash
# 1. Disable the built-in ingress addon (it binds to host ports 80/443 and
#    intercepts all traffic before Caddy can serve ACME challenges)
sudo microk8s disable ingress

# 2. Enable metallb with the node's PUBLIC IP
#    Replace <PUBLIC_IP> with your node's actual public IP address.
#    Using a private IP (e.g. 10.x.x.x) will prevent Let's Encrypt from
#    reaching Caddy for HTTP-01 challenges.
sudo microk8s enable metallb:<PUBLIC_IP>-<PUBLIC_IP>

# 3. Enable hostpath-storage (provides the default StorageClass for Caddy PVC)
sudo microk8s enable hostpath-storage

# 4. Enable GPU support
sudo microk8s enable gpu

# 5. Create the RuntimeClass
sudo microk8s kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF

# 6. Verify GPUs are visible
sudo microk8s kubectl describe node | grep -A5 nvidia.com/gpu
```

<details>
<summary><strong>k3s-specific setup</strong> (click to expand)</summary>

```bash
# 1. Disable Traefik (frees ports 80/443 for Caddy)
#    Add to /etc/rancher/k3s/config.yaml:
#      disable:
#        - traefik
#    Then: systemctl restart k3s

# 2. Configure containerd for NVIDIA runtime
cat > /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl << 'EOF'
[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  privileged_without_host_devices = false
  runtime_engine = ""
  runtime_root = ""
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
EOF

systemctl restart k3s

# 3. Create the RuntimeClass
kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF

# 4. Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
```

> **Note**: k3s does not need `caddy.hostNetwork: true` — klipper-lb handles
> ACME challenges correctly without it. The default StorageClass `local-path`
> is available out of the box.

</details>

### Quick Start

```bash
# 1. Install shenron CLI
uv pip install shenron

# 2. Create a DNS record for your node (optional — skip for HTTP-only)
shenron endpoint setup \
  --subdomain my-node \
  --public-ip <NODE_PUBLIC_IP> \
  --cloudflare-api-token $CF_TOKEN \
  --cloudflare-zone-id $CF_ZONE_ID

# 3. Download the Helm chart
shenron get --helm

# 4. Create required secrets
sudo microk8s kubectl create namespace shenron

sudo microk8s kubectl create secret generic system-api-key \
  -n shenron \
  --from-literal=SYSTEM_API_KEY='your-api-key'

sudo microk8s kubectl create secret generic replica-manager-auth \
  -n shenron \
  --from-literal=token='your-replica-manager-token'

# 5. Edit `shenron-helm/node-piccolo.yaml` or `shenron-helm/node-chiaotzu.yaml`, plus `shenron-helm/replicas.yaml`:
#    - image.tag: latest-sglang-cu130  (match your CUDA version)
#    - caddy.fqdn: "my-node.nodes.doubleword.ai"
#    - caddy.hostNetwork: true
#    - cluster.total_gpus: <number of GPUs on this node>
#    - Define model specs in the node-specific file
#    - Enable models with replicas > 0 in replicas.yaml

# 6. Deploy
cd shenron-helm
helmfile -f helmfile-piccolo apply
```

### Choosing the Right Image Tag

The image tag in the node-specific values file must match your GPU's CUDA compute capability:

| GPU Family | Compute Capability | Image Tag Suffix |
|---|---|---|
| Ampere (A100, A10G) | sm_80, sm_86 | `cu126` |
| Hopper (H100, H200) | sm_90 | `cu126` or `cu128` |
| Blackwell (B200, RTX PRO 6000) | sm_120 | `cu130` |

Format: `latest-{engine}-{cuda}` — e.g. `latest-sglang-cu130`, `latest-vllm-cu126`.

### Caddy Routing & TLS

**HTTP-only mode** (default): When `caddy.fqdn` is empty, Caddy serves on `:80` with no TLS.

**HTTPS mode**: Set `caddy.fqdn` to a domain pointing at your node's public IP. Caddy automatically obtains a Let's Encrypt certificate.

```yaml
caddy:
  fqdn: my-node.nodes.doubleword.ai
```

Use `shenron endpoint setup` to create a Cloudflare DNS record under `*.nodes.doubleword.ai` and get the FQDN value.

**Certificate persistence**: Enabled by default. Caddy stores certificates in a PVC so they survive pod restarts (important — Let's Encrypt rate-limits to 5 duplicate certs per week). The PVC uses whatever default StorageClass the cluster provides.

**`hostNetwork` (microk8s)**: On microk8s, `caddy.hostNetwork: true` is required for ACME HTTP-01 challenges to work. metallb uses kube-proxy DNAT rules that interfere with challenge handling — `hostNetwork` makes Caddy bind directly to the node's ports 80/443, bypassing kube-proxy. k3s does not need this (klipper-lb handles it automatically).

### Extra Caddy Routes

Add custom backend routes via `caddy.extraRoutes`:

```yaml
caddy:
  extraRoutes:
    - path: "/custom/*"
      service: my-backend-service
      port: 8080
```

### Values Reference

#### Top-Level

| Key | Default | Description |
|---|---|---|
| `image.repository` | `doublewordai/shenron` | Model container image |
| `image.tag` | `latest-vllm-cu130` | Image tag (engine + CUDA version) |
| `port` | `3000` | Container port for all model pods |
| `gpu.runtimeClassName` | `nvidia` | Kubernetes RuntimeClass for GPU pods |
| `cluster.total_gpus` | `0` | Total GPU budget for replica-manager scaling |

#### Caddy

| Key | Default | Description |
|---|---|---|
| `caddy.enabled` | `true` | Deploy Caddy reverse proxy |
| `caddy.fqdn` | `""` | FQDN for automatic TLS (empty = HTTP-only on :80) |
| `caddy.service.type` | `LoadBalancer` | Only public-facing service |
| `caddy.hostNetwork` | `false` | Bind directly to host ports 80/443 (required for microk8s) |
| `caddy.persistence.enabled` | `true` | PVC for Let's Encrypt cert storage |
| `caddy.persistence.storageClass` | `""` | Empty = cluster default StorageClass |
| `caddy.extraRoutes` | `[]` | Additional `[{path, service, port}]` routes |

#### Onwards (API Gateway)

| Key | Default | Description |
|---|---|---|
| `onwards.enabled` | `true` | Deploy onwards |
| `onwards.port` | `3000` | Onwards listen port |
| `onwards.service.type` | `ClusterIP` | ClusterIP when behind Caddy |
| `onwards.systemApiKeySecret.name` | `system-api-key` | Secret with API key |

#### Replica Manager

| Key | Default | Description |
|---|---|---|
| `replicaManager.enabled` | `true` | Deploy replica-manager |
| `replicaManager.port` | `8081` | Listen port |
| `replicaManager.service.type` | `ClusterIP` | ClusterIP when behind Caddy |
| `replicaManager.auth.tokenSecret.name` | `replica-manager-auth` | Auth token secret |
| `replicaManager.helm.chartPath` | `/opt/shenron/helm` | Chart path used by replica-manager Helm upgrades |
| `replicaManager.helm.chartMount.enabled` | `false` | Mount a node hostPath chart dir into replica-manager |
| `replicaManager.helm.chartMount.hostPath` | `""` | Node path to mount when chartMount is enabled |
| `replicaManager.helm.chartMount.hostPathType` | `Directory` | Kubernetes hostPath type for the chart mount |

#### Prometheus

| Key | Default | Description |
|---|---|---|
| `prometheus.enabled` | `false` | Deploy Prometheus |
| `prometheus.port` | `9090` | Prometheus port |
| `prometheus.service.type` | `ClusterIP` | ClusterIP when behind Caddy |

#### Scouter Reporter

| Key | Default | Description |
|---|---|---|
| `scouterReporter.enabled` | `false` | Deploy scouter reporters |
| `scouterReporter.collector.instanceSecret.name` | `scouter-reporter` | Collector instance secret |
| `scouterReporter.collector.apiKeySecret.name` | `scouter-reporter` | Ingest API key secret |

When enabled, create the secret:
```bash
sudo microk8s kubectl create secret generic scouter-reporter \
  -n shenron \
  --from-literal=collector-instance='your-collector-host' \
  --from-literal=ingest-api-key='your-ingest-key'
```

#### Models

Models are defined in `models:` as a map keyed by model name:

```yaml
models:
  "Qwen/Qwen3-0.6B":
    replicas: 1          # 0 = disabled
    num_gpus: 1           # GPUs per replica
    command:
      - "python"
      - "-m"
      - "sglang.launch_server"
      - "--model-path"
      - "Qwen/Qwen3-0.6B"
      - "--host"
      - "0.0.0.0"
      - "--enable-metrics"
    shm:
      enabled: true
      sizeLimit: 4Gi
    resources:
      requests:
        cpu: "8"
        memory: "8Gi"
      limits:
        cpu: "8"
        memory: "8Gi"
```

> **Note**: Do not include `--port` in `command` — the chart injects it from the top-level `port` value.

Each model with `replicas > 0` gets:
- A `Deployment` with GPU resources and the HuggingFace cache volume
- A headless `Service`
- An SGLang router `Deployment` + `Service` (when `router.enabled`)
- A scouter reporter `Deployment` (when `scouterReporter.enabled`)

### Replica Manager API

The replica manager provides a REST API for dynamic scaling:

```bash
# Health check
curl http://<host>/replica/healthz

# List models and current replicas
curl -H "Authorization: Bearer <token>" http://<host>/replica/v1/models

# Scale a model
curl -X POST -H "Authorization: Bearer <token>" \
  -d '{"replicas": 2}' \
  http://<host>/replica/v1/models/Qwen%2FQwen3-0.6B/replicas
```

Scaling is GPU-budget-aware: it validates `num_gpus × replicas` across all models against `cluster.total_gpus`.

### Upgrading

```bash
# After editing a node-specific values file or replicas.yaml:
cd shenron-helm
helmfile -f helmfile-piccolo apply

# Quick override without editing files:
sudo microk8s helm3 upgrade shenron ./shenron-helm -n shenron \
  --values ./shenron-helm/node-piccolo.yaml \
  --values ./shenron-helm/replicas.yaml \
  --set image.tag=latest-sglang-cu130
```

### Debugging

```bash
# Check pod status
sudo microk8s kubectl get pods -n shenron

# Model pod logs
sudo microk8s kubectl logs -n shenron deploy/shenron-caddy
sudo microk8s kubectl logs -n shenron deploy/shenron-onwards

# Follow logs for a model pod
sudo microk8s kubectl logs -f -n shenron -l shenron.ai/model-id=<model-id>

# Verify Caddyfile
sudo microk8s kubectl get configmap shenron-caddy-config -n shenron \
  -o jsonpath='{.data.Caddyfile}'

# Verify Onwards config
sudo microk8s kubectl get configmap shenron-onwards-config -n shenron \
  -o jsonpath='{.data.onwards_config\.json}' | python3 -m json.tool

# Test from inside the cluster
sudo microk8s kubectl run -n shenron curl --rm -it --image=curlimages/curl -- \
  curl -s http://shenron-onwards:3000/v1/models

# GPU visibility
sudo microk8s kubectl describe node | grep -A5 nvidia.com/gpu

# Check Caddy has a valid TLS certificate
curl -sv https://my-node.nodes.doubleword.ai/llm/v1/models 2>&1 | grep 'subject:'
```

---

## Docker Compose (Legacy)

> The docker-compose path is maintained for backward compatibility but is not recommended for new deployments. Use the Helm chart instead.

`shenron` reads a model config YAML and generates:
- `docker-compose.yml`
- `.generated/Caddyfile`
- `.generated/prometheus.yml`
- `.generated/scouter_reporter.env`
- `.generated/engine_start.sh`
- `.generated/engine_start_N.sh` + `.generated/sglangmux_start.sh` when `models:` has 2+ entries

### Quick Start

```bash
uv pip install shenron
shenron get
docker compose up -d
```

`shenron get` reads a per-release config index asset, shows available configs with arrow-key selection, downloads the chosen config, and generates deployment artifacts in the current directory. Using `--release latest` also rewrites `shenron_version` in the downloaded config to `latest`. You can also override config values on download with:
- `--api-key` (writes `api_key`)
- `--scouter-api-key` (writes `scouter_ingest_api_key`)
- `--scouter-collector-instance` (writes `scouter_collector_instance`)

`shenron .` expects exactly one config YAML (`*.yml` or `*.yaml`) in the current directory, unless you pass a config file path directly.

### Engine Configuration

- `engine`: `vllm` or `sglang` (default: `vllm`)
- `engine_args`: engine CLI args appended after core settings.
- `engine_env`: top-level default engine environment variables as alternating `KEY, VALUE` entries.
- `models[*].engine_envs`: per-model engine environment variables as alternating `KEY, VALUE` entries.
- `engine_port`, `engine_host`: engine bind settings used for generated scripts and targets.
- `engine_use_cuda_ipc_transport`: when `true`, exports `SGLANG_USE_CUDA_IPC_TRANSPORT=1` before launching SGLang.
- `models`: optional per-model engine config. With 1 entry, Shenron generates a single `engine_start.sh`. With 2+ entries, Shenron starts `sglangmux` (requires `engine: sglang`).
- `sglangmux_listen_port`, `sglangmux_host`, `sglangmux_upstream_timeout_secs`, `sglangmux_model_ready_timeout_secs`, `sglangmux_model_switch_timeout_secs`, `sglangmux_log_dir`: optional sglangmux settings.

`engine_args`, `engine_env`, and `models[*].engine_envs` values accept YAML scalars (string/number/bool). If you need to pass a structured value (like `--override-generation-config`), provide a YAML mapping and it will be JSON-encoded.

Legacy keys (`vllm_args`, `sglang_args`, `vllm_port`, `vllm_host`, `sglang_env`, `sglang_use_cuda_ipc_transport`) are still accepted as aliases.

### Multi-Model (sglangmux) Example

```yaml
engine: sglang
sglangmux_listen_port: 8100
models:
- model_name: Qwen/Qwen3-0.6B
  engine_port: 8001
  engine_args: [--tp, 1]
- model_name: Qwen/Qwen3-30B-A3B
  engine_port: 8002
  engine_args: [--tp, 2]
```

Rules:
- 2+ models requires `engine: sglang`
- Each `models[*].model_name` and `engine_port` must be unique
- `sglangmux_listen_port` must differ from all model ports

---

## Endpoint Setup (Cloudflare DNS)

The `shenron endpoint setup` command creates a DNS record under `*.nodes.doubleword.ai` via the Cloudflare API:

```bash
shenron endpoint setup \
  --subdomain my-node \
  --public-ip 1.2.3.4 \
  --cloudflare-api-token $CF_TOKEN \
  --cloudflare-zone-id $CF_ZONE_ID
```

This writes `.generated/node_endpoint.json` with the FQDN and Cloudflare record metadata. The FQDN can then be used in `caddy.fqdn` (Helm) or is automatically picked up by `shenron generate` (docker-compose).

> **Security**: All DNS operations are hard-restricted to `*.nodes.doubleword.ai`. This constraint is compiled into the binary and cannot be overridden at runtime.

---

## Configs

Starter configs for docker-compose mode are in `configs/`:

- `configs/Qwen06B-cu126-TP1.yml` / `cu129` / `cu130`
- `configs/Qwen30B-A3B-cu126-TP1.yml` / `cu129-TP1` / `cu129-TP2` / `cu130-TP2`
- `configs/Qwen235-A22B-cu129-TP2.yml` / `cu129-TP4` / `cu130-TP2`
- `configs/GPT-OSS-20B-cu126-TP1.yml` / `cu129-TP1`
- `configs/Qwen35-397B-A17B-cu130-TP8-sglang.yml`

## Development

```bash
# Run tests (Rust + CLI + compose checks)
./scripts/ci.sh

# Install local package for manual testing
python3 -m pip install -e .

# Generate from repo config (docker-compose mode)
shenron configs/Qwen06B-cu126-TP1.yml --output-dir /tmp/shenron-test

# Lint the Helm chart
helm lint helm/ --values helm/node-piccolo.yaml --values helm/replicas.yaml
```

## Release Automation

- `release-assets.yaml` publishes stamped config files (`*.yml`) as release assets.
- `release-assets.yaml` also publishes `configs-index.txt`, which powers `shenron get`.
- `release-assets.yaml` packages Helm chart assets as `shenron-<version>.tgz` + `index.yaml` (Helm repository format).
- `release-assets.yaml` mirrors `*.yml`, `configs-index.txt`, `shenron-*.tgz`, and `index.yaml` into `${OWNER}/shenron-configs` under the same tag as the main `shenron` release.
- Set `CONFIGS_REPO_TOKEN` (or reuse `RELEASE_PLEASE_TOKEN`) with write access to the configs repo release assets.
- `python-release.yaml` builds/publishes the `shenron` package to PyPI on release tags.
- Docker image build/push via Depot remains in `ci.yaml`.

## License

MIT, see `LICENSE`.

