Metadata-Version: 2.4
Name: llm-runtime-metrics
Version: 0.0.6
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Typing :: Typed
Requires-Dist: prometheus-client>=0.20
Summary: Rust-backed performance metrics and request tracing
Keywords: metrics,observability,prometheus,llm
Home-Page: https://github.com/basetenlabs/performance-metrics
Author: Baseten
License: Apache-2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/basetenlabs/performance-metrics
Project-URL: Issues, https://github.com/basetenlabs/performance-metrics/issues
Project-URL: Repository, https://github.com/basetenlabs/performance-metrics

# llm-runtime-metrics (Python)

Python bindings for request metrics and Prometheus export.

Install from PyPI as `llm-runtime-metrics`.

Import in Python as:

```python
import llm_runtime_metrics
```

The supported package root focuses on the request-metrics workflow.

## Add LLM Metrics To An Existing Prometheus Server

```python
from prometheus_client import CollectorRegistry, start_http_server
from llm_runtime_metrics import (
    REQUEST_FEATURE_IMAGE,
    REQUEST_FEATURE_TOOLS,
    RequestMetricsCollector,
    RequestMetricsFactory,
)

# Reuse your existing registry if you already have one.
registry = CollectorRegistry()

factory = RequestMetricsFactory(
    request_log_enabled=False,
    metric_prefix="llm_runtime",
    metrics_window_seconds=60.0,
    metrics_quantiles=[0.5, 0.9, 0.99],
)

# Registers a custom collector that pulls fresh samples from `factory` at scrape time.
RequestMetricsCollector(
    factory,
    base_labels={"service": "text-generation", "engine": "vllm"},
    registry=registry,
)

# If your app already exposes /metrics, wire this into that server instead.
start_http_server(8000, registry=registry)


# Example lifecycle hooks in your inference code:
def on_request_start(prompt_token_ids: list[int]):
    features = REQUEST_FEATURE_TOOLS | REQUEST_FEATURE_IMAGE
    return factory.new_request(prompt_token_ids, features=features)


def on_stream_step(req_metrics, full_output_token_ids: list[int], cached_tokens: int | None):
    # Use `is_diff=False` when passing cumulative token ids.
    req_metrics.record_tokens(full_output_token_ids, cached_tokens=cached_tokens, is_diff=False)


def on_request_success(req_metrics):
    req_metrics.success()


def on_request_cancel(req_metrics):
    req_metrics.cancel()
```

Available request feature bits:

- `REQUEST_FEATURE_NONE`
- `REQUEST_FEATURE_XGRAMMAR`
- `REQUEST_FEATURE_TOOLS`
- `REQUEST_FEATURE_IMAGE`

If you need plain text output instead of a collector, call:

```python
text = factory.prometheus_strfmt({"service": "text-generation", "engine": "vllm"})
```

