Metadata-Version: 2.4
Name: backend.ai-agent
Version: 25.15.8
Summary: Backend.AI Agent
Home-page: https://github.com/lablup/backend.ai
Author: Lablup Inc. and contributors
License: LGPLv3
Project-URL: Documentation, https://docs.backend.ai/
Project-URL: Source, https://github.com/lablup/backend.ai
Classifier: Intended Audience :: Developers
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Environment :: No Input/Output (Daemon)
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python :: 3.13
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)
Requires-Python: >=3.13,<3.14
Description-Content-Type: text/markdown
Requires-Dist: aiodocker==0.24.0
Requires-Dist: aiofiles~=24.1.0
Requires-Dist: aiohttp_cors~=0.8.1
Requires-Dist: aiohttp~=3.13.0
Requires-Dist: aiomonitor~=0.7.0
Requires-Dist: aiotools~=1.9.0
Requires-Dist: async_timeout~=4.0
Requires-Dist: attrs>=25.3
Requires-Dist: backend.ai-cli==25.15.8
Requires-Dist: backend.ai-common==25.15.8
Requires-Dist: backend.ai-kernel-binary==25.15.8
Requires-Dist: backend.ai-kernel-helper==25.15.8
Requires-Dist: backend.ai-kernel==25.15.8
Requires-Dist: backend.ai-krunner-static-gnu==4.4.0
Requires-Dist: backend.ai-logging==25.15.8
Requires-Dist: backend.ai-plugin==25.15.8
Requires-Dist: cachetools~=5.5.0
Requires-Dist: callosum~=1.0.3
Requires-Dist: cattrs~=24.1.1
Requires-Dist: click~=8.1.7
Requires-Dist: etcd-client-py~=0.4.1
Requires-Dist: janus~=2.0
Requires-Dist: kubernetes-asyncio~=33.3.0
Requires-Dist: kubernetes~=33.1.0
Requires-Dist: more-itertools~=10.5.0
Requires-Dist: networkx~=3.3.0
Requires-Dist: prometheus-client~=0.21.1
Requires-Dist: psutil~=7.0
Requires-Dist: pydantic[email]~=2.11.3
Requires-Dist: pyzmq~=26.4
Requires-Dist: ruamel.yaml~=0.18.10
Requires-Dist: setproctitle~=1.3.5
Requires-Dist: setuptools~=80.0.0
Requires-Dist: tenacity>=9.0
Requires-Dist: tomlkit~=0.13.2
Requires-Dist: trafaret~=2.1
Requires-Dist: types-aiofiles
Requires-Dist: types-cachetools
Requires-Dist: types-psutil
Requires-Dist: types-setuptools
Requires-Dist: typing_extensions~=4.11
Requires-Dist: uvloop~=0.21; sys_platform != "Windows"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: project-url
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Backend.AI Agent

The Backend.AI Agent is a small daemon that does:

* Reports the status and available resource slots of a worker to the manager
* Routes code execution requests to the designated kernel container
* Manages the lifecycle of kernel containers (create/monitor/destroy them)

## Package Structure

* `ai.backend`
  - `agent`: The agent package
    - `docker`: A docker-based backend implementation for the kernel lifecycle interface.
    - `server`: The agent daemon which communicates with the manager and the Docker daemon
    - `watcher`: A side-by-side daemon which provides a separate HTTP endpoint for accessing the status
      information of the agent daemon and manipulation of the agent's systemd service
  - `helpers`: A utility package that is available as `ai.backend.helpers` *inside* Python-based containers
  - `kernel`: Language-specific runtimes (mostly ipykernel client adaptor) which run *inside* containers
  - `runner`: Auxiliary components (usually self-contained binaries) mounted *inside* containers


## Installation

Please visit [the installation guides](https://github.com/lablup/backend.ai/wiki).


### Kernel/system configuration

#### Recommended kernel parameters in the bootloader (e.g., Grub):

```
cgroup_enable=memory swapaccount=1
```

#### Recommended resource limits:

**`/etc/security/limits.conf`**
```
root hard nofile 512000
root soft nofile 512000
root hard nproc 65536
root soft nproc 65536
user hard nofile 512000
user soft nofile 512000
user hard nproc 65536
user soft nproc 65536
```

**sysctl**
```
fs.file-max=2048000
fs.inotify.max_user_watches=524288
net.core.somaxconn=1024
net.ipv4.tcp_max_syn_backlog=1024
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_window_scaling=1
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_early_retrans=1
net.ipv4.ip_local_port_range=40000 65000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 12582912 16777216
net.ipv4.tcp_wmem=4096 12582912 16777216
net.netfilter.nf_conntrack_max=10485760
net.netfilter.nf_conntrack_tcp_timeout_established=432000
net.netfilter.nf_conntrack_tcp_timeout_close_wait=10
net.netfilter.nf_conntrack_tcp_timeout_fin_wait=10
net.netfilter.nf_conntrack_tcp_timeout_time_wait=10
```

The `ip_local_port_range` should not overlap with the container port range pool
(default: 30000 to 31000).

To apply netfilter settings during the boot time, you may need to add `nf_conntrack` to `/etc/modules`
so that `sysctl` could set the `net.netfilter.nf_conntrack_*` values.


### For development

#### Prerequisites

* Python 3.6 or higher with [pyenv](https://github.com/pyenv/pyenv)
and [pyenv-virtualenv](https://github.com/pyenv/pyenv-virtualenv) (optional but recommneded)
* Docker 18.03 or later with docker-compose (18.09 or later is recommended)

First, you need **a working manager installation**.
For the detailed instructions on installing the manager, please refer
[the manager's README](https://github.com/lablup/backend.ai-manager/blob/master/README.md)
and come back here again.

#### Preparing working copy

Install and activate [`git-lfs`](https://git-lfs.github.com/) to work with pre-built binaries in
`src/ai/backend/runner`.

### CPU Monitoring
- Track per-container CPU usage via cgroups
- Measure CPU time in user and system modes
- Calculate CPU utilization percentages
- Enforce CPU quotas and limits

### Memory Monitoring
- Track RSS (Resident Set Size) per container
- Measure cache and swap usage
- Detect OOM (Out-of-Memory) conditions
- Enforce memory limits via cgroups

### Shared Memory (shmem)
Containers can request shared memory (`/dev/shm`) for inter-process communication.

**Docker Memory Architecture**:
- shm (tmpfs) and app memory share the Memory cgroup space
- shm has an additional ShmSize limit (tmpfs maximum size)
- Effective shm limit = `min(ShmSize, Memory cgroup available space)`

**OOM Conditions**:
| Signal | Exit Code | Condition |
|--------|-----------|-----------|
| SIGKILL | 137 | shm + app > Memory cgroup limit |
| SIGBUS | 135 | shm > ShmSize |

**Configuration**:
- Set via `resource_opts.shmem` in session specification
- Docker HostConfig: `ShmSize` parameter

**References**:
- [Linux Kernel cgroup v1 Memory](https://docs.kernel.org/admin-guide/cgroup-v1/memory.html) - tmpfs/shm charged to cgroup
- [Linux Kernel cgroup v2](https://docs.kernel.org/admin-guide/cgroup-v2.html) - shmem in memory.stat

### GPU Monitoring
- Query NVIDIA GPUs via NVML (nvidia-ml-py)
- Query AMD GPUs via ROCm SMI
- Track GPU utilization and memory usage
- Measure GPU temperature and power consumption

### Disk I/O Monitoring
- Track read/write operations per container
- Measure I/O bandwidth usage
- Monitor disk space consumption
- Enforce I/O throttling when configured

## Plugin System

Agent can uses plugin system for accelerator support:

### CUDA Plugin
- Detect NVIDIA GPUs via `nvidia-smi`
- Allocate GPU devices to containers
- Set `CUDA_VISIBLE_DEVICES` environment variable
- Monitor GPU metrics via NVML

### ROCm Plugin
- Detect AMD GPUs via `rocm-smi`
- Allocate GPU devices to containers
- Set `HIP_VISIBLE_DEVICES` environment variable
- Monitor GPU metrics via ROCm

### TPU Plugin
- Detect Google TPUs
- Configure TPU access for TensorFlow
- Monitor TPU utilization

## Communication Protocols

### Manager → Agent (ZeroMQ RPC)
- **Port**: 6011 (default)
- **Protocol**: ZeroMQ request-response
- **Operations**:
  - `create_kernel`: Create new container
  - `destroy_kernel`: Terminate container
  - `restart_kernel`: Restart container
  - `execute_code`: Execute code in container
  - `get_status`: Query agent and kernel status

### Agent → Manager (HTTP Watcher API)
- **Port**: 6009 (default)
- **Protocol**: HTTP
- **Operations**:
  - Heartbeat signals
  - Resource usage reporting
  - Kernel status updates
  - Error notifications

### Agent → Storage Proxy
- **Protocol**: HTTP
- **Operations**:
  - Mount vfolder
  - Unmount vfolder
  - Query vfolder metadata

## Container Execution Flow

```
1. Manager sends create_kernel RPC
   ↓
2. Agent validates resource availability
   ↓
3. Agent pulls container image (if needed)
   ↓
4. Agent creates scratch directory
   ↓
5. Agent mounts vfolders via Storage Proxy
   ↓
6. Agent creates container with resources
   ↓
7. Agent starts container and runs init script
   ↓
8. Agent registers service ports
   ↓
9. Agent reports kernel status to Manager
   ↓
10. Container runs until termination
   ↓
11. Agent cleans up resources upon termination
```

Next, prepare the source clone of the agent and install from it as follows.
`pyenv` is just a recommendation; you may use other virtualenv management tools.

```console
$ git clone https://github.com/lablup/backend.ai-agent agent
$ cd agent
$ pyenv virtualenv venv-agent
$ pyenv local venv-agent
$ pip install -U pip setuptools
$ pip install -U -r requirements/dev.txt
```

### Linting

We use `flake8` and `mypy` to statically check our code styles and type consistency.
Enable those linters in your favorite IDE or editor.

### Halfstack (single-node development & testing)

With the halfstack, you can run the agent simply.
Note that you need a working manager running with the halfstack already!

#### Recommended directory structure

* `backend.ai-dev`
  - `manager` (git clone from [the manager repo](https://github.com/lablup/backend.ai-manager))
  - `agent` (git clone from here)
  - `common` (git clone from [the common repo](https://github.com/lablup/backend.ai-common))

Install `backend.ai-common` as an editable package in the agent (and the manager) virtualenvs
to keep the codebase up-to-date.

```console
$ cd agent
$ pip install -U -e ../common
```

#### Steps

```console
$ mkdir -p "./scratches"
$ cp config/halfstack.toml ./agent.toml
```

If you're running agent under linux, make sure you've set appropriate iptables rule 
before starting agent. This can be done by executing script `scripts/update-metadata-iptables.sh` 
before each agent start.

Then, run it (for debugging, append a `--debug` flag):

```console
$ python -m ai.backend.agent.server
```

To run the agent-watcher:

```console
$ python -m ai.backend.agent.watcher
```

The watcher shares the same configuration TOML file with the agent.
Note that the watcher is only meaningful if the agent is installed as a systemd service
named `backendai-agent.service`.

To run tests:

```console
$ python -m flake8 src tests
$ python -m pytest -m 'not integration' tests
```


## Deployment

### Configuration

Put a TOML-formatted agent configuration (see the sample in `config/sample.toml`)
in one of the following locations:

 * `agent.toml` (current working directory)
 * `~/.config/backend.ai/agent.toml` (user-config directory)
 * `/etc/backend.ai/agent.toml` (system-config directory)

Only the first found one is used by the daemon.

The agent reads most other configurations from the etcd v3 server where the cluster
administrator or the Backend.AI manager stores all the necessary settings.

The etcd address and namespace must match with the manager to make the agent
paired and activated.
By specifying distinguished namespaces, you may share a single etcd cluster with multiple
separate Backend.AI clusters.

By default the agent uses `/var/cache/scratches` directory for making temporary
home directories used by kernel containers (the `/home/work` volume mounted in
containers).  Note that the directory must exist in prior and the agent-running
user must have ownership of it.  You can change the location by
`scratch-root` option in `agent.toml`.

### Running from a command line

The minimal command to execute:

```sh
python -m ai.backend.agent.server
python -m ai.backend.agent.watcher
```

For more arguments and options, run the command with `--help` option.

### Example config for systemd

`/etc/systemd/system/backendai-agent.service`:

```dosini
[Unit]
Description=Backend.AI Agent
Requires=docker.service
After=network.target remote-fs.target docker.service

[Service]
Type=simple
User=root
Group=root
Environment=HOME=/home/user
ExecStart=/home/user/backend.ai/agent/run-agent.sh
WorkingDirectory=/home/user/backend.ai/agent
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
```

`/home/user/backend.ai/agent/run-agent.sh`:

```sh
#! /bin/sh
if [ -z "$PYENV_ROOT" ]; then
  export PYENV_ROOT="$HOME/.pyenv"
  export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

cd /home/user/backend.ai/agent
if [ "$#" -eq 0 ]; then
  sh /home/user/backend.ai/agent/scripts/update-metadata-iptables.sh
  exec python -m ai.backend.agent.server
else
  exec "$@"
fi
```

### Networking

The manager and agent should run in the same local network or different
networks reachable via VPNs, whereas the manager's API service must be exposed to
the public network or another private network that users have access to.

The manager must be able to access TCP ports 6001, 6009, and 30000 to 31000 of the agents in default
configurations.  You can of course change those port numbers and ranges in the configuration.

| Manager-to-Agent TCP Ports | Usage |
|:--------------------------:|-------|
| 6001                       | ZeroMQ-based RPC calls from managers to agents |
| 6009                       | HTTP watcher API |
| 30000-31000                | Port pool for in-container services |

The operation of agent itself does not require both incoming/outgoing access to
the public Internet, but if the user's computation programs need the Internet, the docker containers
should be able to access the public Internet (maybe via some corporate firewalls).

| Agent-to-X TCP Ports     | Usage |
|:------------------------:|-------|
| manager:5002             | ZeroMQ-based event push from agents to the manager |
| etcd:2379                | etcd API access |
| redis:6379               | Redis API access |
| docker-registry:{80,443} | HTTP watcher API |
| (Other hosts)            | Depending on user program requirements |


LICENSES
--------

[GNU Lesser General Public License](https://github.com/lablup/backend.ai-agent/blob/master/LICENSE)
[Dependencies](https://github.com/lablup/backend.ai-manager/blob/agent/DEPENDENCIES.md)
