Metadata-Version: 2.4
Name: sdc-agents
Version: 4.3.0
Summary: Purpose-scoped ADK agents for SDC4 data operations
Project-URL: Repository, https://github.com/SemanticDataCharter/SDC_Agents
Project-URL: Documentation, https://github.com/SemanticDataCharter/SDC_Agents/blob/main/docs/dev/SDC_AGENTS_PRD.md
Project-URL: Issues, https://github.com/SemanticDataCharter/SDC_Agents/issues
Project-URL: Changelog, https://github.com/SemanticDataCharter/SDC_Agents/blob/main/CHANGELOG.md
Author-email: "Timothy W. Cook" <tim@semanticdatacharter.org>
Maintainer-email: "Axius SDC, Inc." <contact@axius-sdc.com>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: adk,agents,data-modeling,rdf,sdc4,semantic-data,sparql,xml
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: click>=8
Requires-Dist: google-adk>=1.25
Requires-Dist: httpx>=0.27
Requires-Dist: jsonpath-ng>=1.6
Requires-Dist: motor>=3.6
Requires-Dist: pydantic>=2
Requires-Dist: pyyaml>=6
Requires-Dist: sqlalchemy>=2
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3; extra == 'bigquery'
Provides-Extra: dev
Requires-Dist: aiosqlite>=0.20; extra == 'dev'
Requires-Dist: black>=24; extra == 'dev'
Requires-Dist: mongomock>=4; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Provides-Extra: knowledge
Requires-Dist: chromadb>=0.5; extra == 'knowledge'
Requires-Dist: pymupdf>=1.24; extra == 'knowledge'
Requires-Dist: python-docx>=1.1; extra == 'knowledge'
Provides-Extra: toolbox
Requires-Dist: toolbox-adk>=0.6; extra == 'toolbox'
Provides-Extra: vertex-ai-search
Requires-Dist: google-cloud-aiplatform>=1.52; extra == 'vertex-ai-search'
Description-Content-Type: text/markdown

# SDC Agents

**Purpose-scoped ADK agents for producing SDC4-compliant data artifacts from existing datastores.**

[![PyPI](https://img.shields.io/pypi/v/sdc-agents.svg)](https://pypi.org/project/sdc-agents/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![SDC4](https://img.shields.io/badge/SDC-Generation_4-green.svg)](https://github.com/SemanticDataCharter/SDCRM)

![SDC Agents — The key to SDCStudio's self-assembling semantic infrastructure](SDC_Agents_Key.png)

---

## What is SDC Agents?

SDC Agents is an open-source suite of **nine purpose-scoped agents** built on Google's [Agent Development Kit (ADK)](https://google.github.io/adk-docs/) that transform data from SQL databases, CSV files, and JSON sources into validated, multi-format SDC4 artifacts — without requiring the user to write XML, RDF, or GQL by hand.

Each agent is an ADK `LlmAgent` with a narrowly scoped `BaseToolset`, auditable activity, and enforced isolation boundaries. No single agent can reach across scope boundaries — a compromised or misbehaving agent has blast radius limited to its purpose.

**MCP compatibility**: Each toolset can also be exported as an MCP server for framework-agnostic integration with non-ADK clients.

![From Craftsman to Factory: Traditional RAG/ETL vs Axius SDC pipeline](docs/images/craftsman-to-factory.png)

---

## Architecture: Nine Agents

| Agent | Purpose | Network | Datasource Access |
|---|---|---|---|
| **Catalog Agent** | Discover published SDC4 schemas and download artifacts from SDCStudio | HTTPS (optional token auth) | None |
| **Introspect Agent** | Examine customer datasources and extract structure (read-only) | None | Read-only |
| **Mapping Agent** | Suggest and manage column-to-component mappings | None | None |
| **Generator Agent** | Produce SDC4 XML instances from mapped data | None | Read-only |
| **Validation Agent** | Validate and sign XML instances via VaaS API | HTTPS (token auth) | None |
| **Distribution Agent** | Route artifact packages to customer-local destinations | Customer-local only | None |
| **Knowledge Agent** | Ingest customer context (CSV, JSON, TTL, Markdown, PDF, DOCX) into vector store | None | Read-only (files) |
| **Assembly Agent** | Discover components, propose hierarchies, assemble published models | HTTPS (Assembly API) | None |
| **Semantic Discovery Agent** | Search Vertex AI Search for SDC4 resources (ADK-only) | GCP (Vertex AI Search) | None |

### Security Principles

1. **No agent has both datasource access and network access**
2. **Read-only datasource access** — no agent can write to customer data
3. **Tools are declarative Python functions** — ADK derives schemas from type hints and docstrings
4. **Structured audit log** — every tool call logged with agent, tool, inputs, outputs, timestamp
5. **No credential sharing** — each `BaseToolset` receives only its own credential scope
6. **Fail closed** — errors are returned, never retried with escalated privileges

### IEEE 7000-2021 Alignment

SDC Agents is designed consistent with [IEEE 7000-2021](https://standards.ieee.org/ieee/7000/6781/) Value-based Engineering principles for ethical autonomous system design:

- **Transparency** — append-only structured audit log records every tool invocation with agent, tool, inputs, outputs, timestamp, and duration
- **Traceability** — all inter-agent handoffs are inspectable files on disk (`.sdc-cache/`), not opaque in-memory calls
- **Harm minimization** — purpose-scoped isolation ensures no single agent can access both customer datasources and external networks; blast radius is confined to each agent's scope
- **Stakeholder value preservation** — SDC4's curated, constraint-based semantic model (`xsd:restriction` only, immutable schemas) encodes data integrity and endurance as system-level guarantees, not optional features

### Data Flow

Agents communicate through **files on disk**, not direct calls. Every handoff is an inspectable, version-controllable artifact:

```
Catalog Agent → .sdc-cache/schemas/     ─┐
Introspect Agent → .sdc-cache/introspections/ ─┤
                                               ▼
                                    Mapping Agent → .sdc-cache/mappings/
                                               ▼
                                    Generator Agent → ./sdc-output/*.xml
                                               ▼
                                    Validation Agent → ./sdc-output/*.pkg.zip
                                               ▼
                                    Distribution Agent → customer destinations
```

---

## SDCStudio API Dependencies

SDC Agents consumes two sets of endpoints from [SDCStudio](https://github.com/Axius-SDC/SDCStudio):

- **Catalog API** (public, optional token auth) — schema discovery, component trees, skeleton templates, schema-level RDF, reference ontologies
- **VaaS API** (token auth) — XML validation, signing, artifact package generation

> **Authenticated Catalog Lookups**: When an API key is provided, catalog search results are filtered according to the Modeler's project preferences configured in SDCStudio. If the Modeler's `prj_filter` setting is enabled (the default), results are scoped to their default project. Without an API key, the catalog returns all published public schemas. This means the same `catalog_list_schemas` tool returns personalized results for authenticated users and broad results for anonymous browsing, with no change to the tool interface.

See [docs/dev/SDC_AGENTS_PRD.md](docs/dev/SDC_AGENTS_PRD.md) for the full API contract and agent specifications.

---

## Quick Start

### Prerequisites

- Python 3.11+
- Google ADK 1.25+ (`pip install google-adk`)

### Installation

```bash
pip install -e ".[dev]"
```

### Configuration

Copy `sdc-agents.example.yaml` to `sdc-agents.yaml` and fill in values:

```yaml
sdcstudio:
  base_url: "https://sdcstudio.example.com"
  api_key: "${SDC_API_KEY}"          # Token auth (Catalog preferences + VaaS validation)

cache:
  root: ".sdc-cache"
  ttl_hours: 24

audit:
  path: ".sdc-cache/audit.jsonl"
  log_level: "standard"    # "standard" summarizes outputs; "verbose" logs full payloads

datasources:
  my_database:
    type: sql
    connection_string: "${DB_CONNECTION}"   # env var substitution
  my_csv:
    type: csv
    path: "/data/exports/records.csv"

output:
  directory: "./output"
  formats:
    - "xml"

destinations:
  triplestore:
    type: fuseki
    endpoint: "${FUSEKI_URL}"
    auth: "${FUSEKI_AUTH}"
  graph_database:
    type: neo4j
    endpoint: "${NEO4J_URL}"
    database: "sdc4"
  archive:
    type: filesystem
    path: "./archive/{ct_id}/{instance_id}/"
    create_directories: true
```

Environment variables use `${VAR}` syntax. Missing variables cause an immediate `KeyError` (fail closed).

### Usage (ADK — Primary)

```python
from sdc_agents.common.config import load_config
from sdc_agents.agents.catalog import create_catalog_agent
from sdc_agents.agents.introspect import create_introspect_agent
from sdc_agents.agents.mapping import create_mapping_agent
from sdc_agents.agents.generator import create_generator_agent
from sdc_agents.agents.validation import create_validation_agent
from sdc_agents.agents.distribution import create_distribution_agent

config = load_config("sdc-agents.yaml")

# Each factory returns an LlmAgent with its scoped BaseToolset
catalog_agent = create_catalog_agent(config)
introspect_agent = create_introspect_agent(config)
mapping_agent = create_mapping_agent(config)
generator_agent = create_generator_agent(config)
validation_agent = create_validation_agent(config)
distribution_agent = create_distribution_agent(config)
```

Or construct agents directly with toolsets:

```python
from sdc_agents.common.config import load_config
from sdc_agents.toolsets.catalog import CatalogToolset
from google.adk.agents import LlmAgent

config = load_config("sdc-agents.yaml")

catalog_agent = LlmAgent(
    name="catalog",
    model="gemini-2.0-flash",
    description="Discovers SDC4 schemas from SDCStudio Catalog API.",
    instruction="Discover published SDC4 schemas and download artifacts.",
    tools=[CatalogToolset(config=config)],
)
```

### Usage (MCP — Secondary)

Each agent can be served as an MCP stdio server for non-ADK clients:

```bash
# Start the Catalog Agent as an MCP server
sdc-agents serve --mcp catalog

# Start the Introspect Agent as an MCP server
sdc-agents serve --mcp introspect

# Any of the 8 MCP agents: assembly, catalog, distribution, generator, introspect, knowledge, mapping, validation
sdc-agents serve --mcp validation
```

### CLI Commands

```bash
# Show configuration summary and agent inventory
sdc-agents info
sdc-agents info --config path/to/sdc-agents.yaml

# Validate a config file (useful in CI)
sdc-agents validate-config
sdc-agents validate-config --config path/to/sdc-agents.yaml

# Inspect the audit log
sdc-agents audit show                        # last 50 records
sdc-agents audit show --agent catalog        # filter by agent
sdc-agents audit show --last 24h --limit 20  # recent records
sdc-agents audit show --audit-path ./logs/audit.jsonl  # custom path
```

### Docker

A single image serves all 8 MCP-servable agents. Select the agent at runtime with `SDC_AGENT`:

```bash
# Serve a single agent as an MCP server
docker run -v ./sdc-agents.yaml:/home/sdc/sdc-agents.yaml:ro \
  -e SDC_AGENT=catalog \
  ghcr.io/semanticdatacharter/sdc-agents

# Run any CLI command
docker run -v ./sdc-agents.yaml:/home/sdc/sdc-agents.yaml:ro \
  ghcr.io/semanticdatacharter/sdc-agents info

docker run -v ./sdc-agents.yaml:/home/sdc/sdc-agents.yaml:ro \
  ghcr.io/semanticdatacharter/sdc-agents validate-config
```

Build locally:

```bash
docker build -t sdc-agents .
docker run sdc-agents  # prints usage hint
```

### CI/CD

- **CI** (`.github/workflows/ci.yml`): Runs on push to `dev` and PRs to `main`. Lints with ruff, checks formatting with black, runs pytest with coverage across Python 3.11/3.12/3.13.
- **Docker** (`.github/workflows/docker.yml`): Builds and pushes to GHCR on push to `main` and `v*` tags.
- **PyPI** (`.github/workflows/release.yml`): Publishes to PyPI on `v*` tags via OIDC trusted publisher (no API tokens).

**One-time setup** (maintainer):
1. Configure [PyPI trusted publisher](https://docs.pypi.org/trusted-publishers/) — owner: `SemanticDataCharter`, repo: `SDC_Agents`, workflow: `release.yml`, environment: `pypi`
2. Create a `pypi` environment in GitHub repo settings (Settings > Environments)

### Testing

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=sdc_agents

# Run specific test modules
pytest tests/toolsets/test_catalog.py
pytest tests/security/
```

---

## Documentation

- **[User Documentation](docs/user/index.md)** — configuration, tool reference, MCP integration, workflow guides
- **[Product Requirements](docs/dev/SDC_AGENTS_PRD.md)** — full agent specifications, tools, security model, type mapping tables
- **[Contributing](CONTRIBUTING.md)** — development setup, coding standards, PR workflow
- **[Security Policy](SECURITY.md)** — vulnerability reporting, agent isolation model
- **[Changelog](CHANGELOG.md)** — release history

---

## Implementation Phases

| Phase | Goal | Status |
|---|---|---|
| **Phase 1** | Catalog, Introspect, and Mapping agents with shared infra | **Complete** |
| **Phase 2** | Generator and Validation agents, Introspect extensions | **Complete** |
| **Phase 3** | Distribution Agent with multi-destination delivery | **Complete** |
| **Phase 4** | Production hardening: CLI, Docker, CI/CD, MCP export, documentation | **Complete** |
| **Phase 5** | Knowledge Agent + Component Assembly Agent | **Complete** |
| **Phase 5.5** | PDF/DOCX Knowledge Sources + Semantic Discovery Agent | **Complete** |
| **Phase 6** | ADK ecosystem contributions (`adk-sparql-tools`, Integration Page) | **Complete** |

### What's Implemented (Phases 1–3)

**Common infrastructure**:
- Pydantic config with `${VAR}` substitution (fail closed), append-only JSONL audit logger with credential redaction, cache manager with path helpers

**CatalogToolset** (5 tools): `catalog_list_schemas`, `catalog_get_schema`, `catalog_download_schema_rdf`, `catalog_download_skeleton`, `catalog_download_ontologies` — httpx async, cache-first for immutable schemas, optional token auth for Modeler-scoped results

**IntrospectToolset** (5 tools): `introspect_sql` (SELECT-only enforcement), `introspect_csv` (type inference for 10 types), `introspect_json` (JSONPath extraction), `introspect_mongodb` (BSON-to-SDC4 type mapping), `introspect_bigquery` (BigQuery schema extraction via `asyncio.to_thread`)

**MappingToolset** (3 tools): `mapping_suggest` (type compatibility + name similarity), `mapping_confirm`, `mapping_list`

**GeneratorToolset** (3 tools): `generate_instance`, `generate_batch`, `generate_preview` — skeleton-based XML generation with placeholder substitution and optional element pruning

**ValidationToolset** (3 tools): `validate_instance`, `sign_instance`, `validate_batch` — VaaS API with path confinement, token auth, artifact package (.pkg.zip) support

**DistributionToolset** (5 tools): `inspect_package`, `list_destinations`, `distribute_package`, `distribute_batch`, `bootstrap_triplestore` — httpx-only connectors for SPARQL Graph Store, Neo4j HTTP, REST API, and filesystem

**Agent factories**: `create_catalog_agent()`, `create_introspect_agent()`, `create_mapping_agent()`, `create_generator_agent()`, `create_validation_agent()`, `create_distribution_agent()`

**176+ tests, 82% coverage** — 9 toolsets with 32 disjoint tools, security isolation tests (SQL write rejection, datasource name enforcement, path confinement, credential redaction, no cross-scope tool leakage)

**Consumer-first**: all tests use `httpx.MockTransport` — zero live SDCStudio, Fuseki, or Neo4j dependency

---

## Related Projects

- **[SDCStudio](https://github.com/Axius-SDC/SDCStudio)** — SDC4 data model creation and management platform (provides Catalog and VaaS APIs)
- **[SDCRM](https://github.com/SemanticDataCharter/SDCRM)** — SDC4 Reference Model specification
- **[Form2SDCTemplate](https://github.com/SemanticDataCharter/Form2SDCTemplate)** — PDF/DOCX to SDC template conversion
- **[Google ADK](https://google.github.io/adk-docs/)** — Agent Development Kit (agent framework)

---

## License & Ownership

Copyright 2025-2026 [Axius SDC, Inc.](https://axius-sdc.github.io)

Licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details.

SDC Agents is controlled and maintained by Axius SDC, Inc. The [SemanticDataCharter](https://github.com/SemanticDataCharter) GitHub organization hosts the open-source SDC4 ecosystem on behalf of Axius SDC, Inc.
