Metadata-Version: 2.4
Name: afrilink-sdk
Version: 0.5.6
Summary: AfriLink SDK — One-line access to GPUs, models and datasets from your notebook
Home-page: https://github.com/dataspires/afrilink-sdk
Author: DataSpires
Author-email: DataSpires <info@dataspires.com>
License-Expression: MIT
Project-URL: Homepage, https://dataspires.com
Project-URL: Documentation, https://www.dataspires.com/#About-Us
Project-URL: Repository, https://github.com/DataSpires/afrilink-sdk
Project-URL: Bug Tracker, https://github.com/DataSpires/afrilink-sdk/issues
Keywords: hpc,high-performance-computing,finetuning,llm,lora,notebook,gpu,slurm,afrilink
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: full
Requires-Dist: requests>=2.28.0; extra == "full"
Requires-Dist: psutil>=5.9.0; extra == "full"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# AfriLink SDK

**Version:** 0.5.6

**Last Updated:** March 21, 2026

**Finetune LLMs on HPC from your notebook**

AfriLink SDK gives you one-line access to GPUs, models and datasets; all ready to use directly from your notebook interface. Authenticate, submit LoRA finetune jobs, download trained weights, and run inference without ever leaving your notebook.

```
pip install afrilink-sdk
```

---

## Quick Start

```python
from afrilink import AfriLinkClient

# 1. Authenticate (prompts for DataSpires email/password, then auto-handles HPC)
client = AfriLinkClient()
client.authenticate()

# 2. Prepare your dataset (pandas DataFrame with "text" column)
import pandas as pd
data = pd.DataFrame({"text": [
    "Below is an instruction...\n\n### Response:\nHere is the answer..."
]})

# 3. Submit a finetune job
job = client.finetune(
    model="qwen2.5-0.5b",    # The model you choose to finetune
    training_mode="low",      # How much training: "low", "medium", or "high"
    data=data,                # Your dataset (DataFrame, HF Dataset, or file path)
    gpus=1,                   # Number of A100 GPUs to use
    time_limit="01:00:00",    # Maximum time your job should run for
    backend="cineca",         # HPC backend: "cineca" (default), "eversetech", "agh", or "acf"
)
result = job.run(wait=True)   # blocks until SLURM job finishes

# 4. Download the trained adapter (only if job succeeded)
if result["status"] == "completed":
    client.download_model(result["job_id"], "./my-model")

    # 5. Load & run inference
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import PeftModel

    base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
    model = PeftModel.from_pretrained(base, "./my-model")
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

    out = model.generate(**tokenizer("Hello!", return_tensors="pt"), max_new_tokens=64)
    print(tokenizer.decode(out[0], skip_special_tokens=True))
else:
    print(f"Job failed with status: {result['status']}")
    print(f"Check logs: job.get_logs()")
```

---

## Installation

```bash
pip install afrilink-sdk
```

The package has **zero required dependencies** — heavy libraries (requests, torch, transformers, peft) are only needed at the point you actually use them and are pre-installed in most notebook environments.

---

## Authentication

AfriLink uses a two-phase auth flow. Both phases happen inside a single `client.authenticate()` call:

| Phase | What happens | User action |
|-------|-------------|-------------|
| **1. DataSpires** | Validates your DataSpires account for billing/telemetry | Enter email + password when prompted |
| **2. HPC** | Headless Selenium browser automation gets SSH certificates via Smallstep | Fully automatic (org credentials auto-provisioned) |

```python
from afrilink import AfriLinkClient

client = AfriLinkClient()
client.authenticate()   # prompts for DataSpires creds, then auto-handles HPC

# Or pass credentials explicitly:
client.authenticate(
    dataspires_email="you@example.com",
    dataspires_password="...",
)
```

After authentication you get:
- SSH certificate valid for ~12 hours (the SDK warns you before it expires — see [Session Recovery](#session-recovery))
- SLURM job manager ready to submit jobs
- SCP transfer manager ready to move files
- Telemetry tracker logging GPU-minutes to your DataSpires account

---

## API Reference

### `AfriLinkClient`

Main entry point. Created once per notebook session.

| Method | Description |
|--------|-------------|
| `authenticate()` | Full auth flow (DataSpires + HPC) |
| `finetune(model, training_mode, data, gpus, ...)` | Create a `FinetuneJob` |
| `download_model(job_id, local_dir)` | Download trained adapter weights |
| `upload_dataset(local_path, dataset_name)` | Upload dataset to HPC |
| `list_available_models(size=None)` | List models in the registry |
| `list_available_datasets()` | List datasets in the registry |
| `get_model_requirements(model, training_mode)` | GPU/memory recommendations |
| `list_jobs()` | List SLURM queue |
| `recover_session(download_dir=None)` | Re-authenticate + check/download tracked jobs |
| `cancel_job(job_id)` | Cancel a running job |
| `run_command(command)` | Run arbitrary shell command on HPC login node |
| `get_queue_status()` | SLURM partition info |
| `cert_minutes_remaining` | Minutes until SSH certificate expires |

### `client.finetune()`

```python
job = client.finetune(
    model="qwen2.5-0.5b",       # model ID from registry
    training_mode="low",          # "low" | "medium" | "high"
    data=my_dataframe,            # pandas DataFrame, HF Dataset, or file path
    gpus=1,                       # number of A100 GPUs
    time_limit="01:00:00",        # max wallclock (HH:MM:SS)
    backend="cineca",             # HPC backend cluster
    output_dir=None,              # default: $WORK/finetune_outputs
)
```

**HPC Backends:**

| Backend | Provider | Region | Status |
|---------|----------|--------|--------|
| `cineca` | CINECA Leonardo (EuroHPC) | Bologna, Italy | Available (default) |
| `eversetech` | EverseTech | Variable | Coming soon |
| `agh` | AGH | Variable | Coming soon |
| `acf` | ACF | Variable | Coming soon |

**Training modes:**

| Mode | Strategy | Quantization | Typical GPUs |
|------|----------|-------------|--------------|
| `low` | QLoRA (rank 8) | 4-bit | 1 |
| `medium` | LoRA (rank 16) | 8-bit / none | 1-2 |
| `high` | LoRA (rank 64) + DDP/FSDP | none | 2-4+ |

### `FinetuneJob`

Returned by `client.finetune()`.

| Method / Property | Description |
|-------------------|-------------|
| `run(wait=True)` | Submit to SLURM. `wait=True` polls until done. |
| `cancel()` | Cancel the SLURM job |
| `get_logs(tail=100)` | Fetch recent log lines |
| `status` | Current status string |
| `job_id` | AfriLink job ID (8-char UUID prefix) |
| `slurm_job_id` | SLURM numeric job ID (set after `run()`) |

`run()` returns a dict:

```python
{
    "job_id": "a1b2c3d4",
    "slurm_job_id": "12345678",
    "status": "completed",        # or "submitted" if wait=False
    "output_dir": "/path/...",
    "model_path": "/path/...",
}
```

### Session Recovery

SSH certificates expire after ~12 hours. The SDK monitors this automatically and warns you before expiry. When you see the warning — or when you return to a notebook after being away — call `recover_session()` to re-authenticate and pick up where you left off:

```python
# Re-authenticate and check on all tracked jobs
recovery = client.recover_session("./recovered-models")

# recovery.re_authenticated  — True if fresh SSH cert was obtained
# recovery.jobs               — status of each tracked SLURM job
# recovery.files_retrieved    — list of model dirs downloaded for completed jobs
```

What `recover_session()` does:

1. **Re-authenticates with CINECA** — gets a fresh SSH certificate without re-entering credentials
2. **Checks all tracked SLURM jobs** — reports status of every job submitted in this session
3. **Downloads completed models** — if you pass a `download_dir`, finished adapters are pulled automatically
4. **Registers email notification** — for jobs still running, you'll get an email when they finish

Your SLURM jobs keep running on the cluster even after your certificate expires — you just need fresh credentials to check on them or download results.

```python
# Minimal usage (just re-auth, no download)
client.recover_session()

# With download directory for completed jobs
client.recover_session("./my-models")
```

---

### `client.download_model()`

```python
client.download_model(result["job_id"], "./my-model")
```

Downloads adapter files (`adapter_config.json`, `adapter_model.safetensors`, tokenizer files) flat into the target directory — ready for `PeftModel.from_pretrained()`.

### Model & Dataset Registry

```python
# List all models
client.list_available_models()

# Filter by size
client.list_available_models(size="tiny")   # tiny | small | medium | large

# List datasets
client.list_available_datasets()

# Resource requirements
client.get_model_requirements("qwen2.5-0.5b", "low")
```

**Available models (v0.1.0):**

| ID | Name | Type | Params | Min VRAM |
|----|------|------|--------|----------|
| `qwen2.5-0.5b` | Qwen 2.5 0.5B | text | 0.5B | 4 GB |
| `gemma-3-270m` | Gemma 3 270M | text | 0.27B | 2 GB |
| `llama-3.2-1b` | Llama 3.2 1B | text | 1.0B | 4 GB |
| `deepseek-r1-1.5b` | DeepSeek R1 1.5B | text | 1.5B | 6 GB |
| `ministral-3b` | Ministral 3B | text | 3.3B | 8 GB |
| `florence-2-base` | Florence 2 Base | vision | 0.23B | 4 GB |
| `smolvlm-256m` | SmolVLM 256M | vision | 0.26B | 2 GB |
| `moondream2` | Moondream 2 | vision | 1.9B | 8 GB |
| `internvl2-1b` | InternVL2 1B | vision | 1.0B | 4 GB |
| `llava-1.5-7b` | LLaVA 1.5 7B | vision | 7.0B | 16 GB |

### Data Transfer

```python
# Upload a dataset
client.upload_dataset("./train.jsonl", dataset_name="my-data")

# Download model weights
client.download_model("a1b2c3d4", "./my-model")

# List remote files
client.transfer.list_remote_files("$WORK/finetune_outputs/")

# Run shell commands on HPC
client.run_command("squeue -u $USER")
```

### Dataset Formats

`client.finetune(data=...)` accepts:

| Type | How it's handled |
|------|-----------------|
| `pandas.DataFrame` | Serialised to JSONL, uploaded via SCP |
| `datasets.Dataset` | Saved to disk, uploaded via SCP |
| `str` (local path) | Uploaded via SCP |
| `str` (starts with `$`) | Treated as remote HPC path (no upload) |

Your DataFrame should have a `text` column with the full prompt+response formatted as a single string (Alpaca-style or chat template).

---

## Architecture

```
Notebook Interface                      High Performance Compute
+--------------+      SSH/SCP          +------------------+
| AfriLink SDK | ------------------->  |  Login Node      |
|              |  (Smallstep certs)    |  +- SLURM sbatch |
| DataSpires   |                       |  +- $WORK/       |
| (billing)    |                       |  |  +- containers|
|              |                       |  |  +- datasets  |
+--------------+                       |  |  +- finetune_ |
                                       |  |     outputs/  |
                                       |  |     +- {jobid}|
                                       |  +- Singularity  |
                                       |     container    |
                                       |     (A100 GPUs)  |
                                       +------------------+
```

---

## Publishing to PyPI

For maintainers:

```bash
cd afrilink-sdk
pip install build twine

# Build wheel + sdist
python -m build

# Upload to PyPI (requires PyPI API token)
twine upload dist/*
```

You'll need a PyPI account at https://pypi.org and an API token configured in `~/.pypirc` or passed via `--username __token__ --password pypi-...`.

---

## License

MIT
