Metadata-Version: 2.2
Name: cf-datahive
Version: 0.1.6
Summary: Canonical result and measurement data storage APIs for Cogniflow
Requires-Python: >=3.11
Requires-Dist: pyarrow>=12
Provides-Extra: pandas
Requires-Dist: pandas>=2.0; extra == "pandas"
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == "test"
Requires-Dist: pandas>=2.0; extra == "test"
Description-Content-Type: text/markdown

# cf_datahive

`cf_datahive` is the Data Hive package boundary for Python-facing APIs/tooling around the canonical data hive root (`workspace/<data_hive>`).

## Boundary (Current Phase)

- Python package role (`sandcastle/cf_datahive`): read-oriented API/tooling/validation for pipeline-facing workflows.
- Native role (`sandcastle/cf_datahive/src/cf_datahive/cpp`): write gatekeeper and only allowed writer under `workspace/data_hive`.
- Step packages must stay thin wrappers and call the native gatekeeper instead of implementing filesystem/parquet helpers.
- Downstream first-party native consumers must discover the packaged gatekeeper consumer surface through the owner package API instead of repo-relative path reach-in.

## Development workflow

- Current development mode is source-first via `scripts/fresh_install.ps1`.
- The package can now be built and published independently without changing the read/write ownership boundary above.

## Canonical layout

```
workspace/
  data_hive/
    <pipeline_id>/
      runs/
        <run_id>/
          manifest.json
          tables/
            <table_name>/
              part-0000.parquet
              part-0001.parquet
          artifacts/
            <artifact_name>
      latest.txt
```

- `latest.txt` stores the committed `run_id` and is updated atomically.
- `manifest.json` is the SOT for run metadata, table metadata, file hashes, and artifact hashes.

## Usage

```python
from pathlib import Path

from cf_datahive import (
    DataHiveClient,
    cf_datahive_cpp_consumer_cmake_path,
  cf_datahive_cpp_import_library_path,
  cf_datahive_cpp_include_path,
)

workspace_root = Path("workspace")
client = DataHiveClient(str(workspace_root))

runs = client.list_runs("opcua_fifo_avg")
if runs:
    latest = runs[0].run_id
    manifest = client.load_manifest("opcua_fifo_avg", latest)
    table = client.read_table("opcua_fifo_avg", latest, "measurements")
    print(manifest.status, table.num_rows)
    print(cf_datahive_cpp_include_path())
    print(cf_datahive_cpp_import_library_path())
    print(cf_datahive_cpp_consumer_cmake_path())
```

Native owner API:

  - `cf_datahive_cpp_include_path()` returns the packaged include root for the native gatekeeper.
  - `cf_datahive_cpp_library_path()` returns the packaged runtime library path.
  - `cf_datahive_cpp_import_library_path()` returns the packaged link artifact path that first-party native consumers link against.
  - `cf_datahive_cpp_runtime_dir()` returns the packaged runtime directory to stage alongside native consumers.
  - `cf_datahive_cpp_consumer_cmake_path()` returns the owner-provided CMake helper for downstream native consumers that need target import plus runtime staging without re-encoding backend policy.

## Native consumer ownership

`cf_datahive` owns the backend-specific native build, packaging, and runtime policy for `cf_datahive_cpp`.
First-party native consumers should link against that packaged owner surface instead of embedding `cf_datahive_cpp` sources or carrying their own DuckDB rules.

Typical consumer pattern:

```cmake
execute_process(
  COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_include_path())"
  OUTPUT_VARIABLE CF_DATAHIVE_CPP_INCLUDE_DIR
  OUTPUT_STRIP_TRAILING_WHITESPACE
)

execute_process(
  COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_library_path())"
  OUTPUT_VARIABLE CF_DATAHIVE_CPP_LIBRARY_PATH
  OUTPUT_STRIP_TRAILING_WHITESPACE
)

execute_process(
  COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_import_library_path())"
  OUTPUT_VARIABLE CF_DATAHIVE_CPP_IMPORT_LIBRARY_PATH
  OUTPUT_STRIP_TRAILING_WHITESPACE
)

include("${Python3_SITEARCH}/cf_datahive/native/cmake/cf_datahive_consumer.cmake")

cf_datahive_import_cpp_target(
  TARGET cf_datahive_cpp
  INCLUDE_DIR "${CF_DATAHIVE_CPP_INCLUDE_DIR}"
  LIBRARY_PATH "${CF_DATAHIVE_CPP_LIBRARY_PATH}"
  IMPORT_LIBRARY_PATH "${CF_DATAHIVE_CPP_IMPORT_LIBRARY_PATH}"
)

cf_datahive_stage_consumer_runtime(
  TARGET my_step_plugin
  RUNTIME_DIR "${Python3_SITEARCH}/cf_datahive/native/bin"
  DESTINATIONS "${CMAKE_CURRENT_SOURCE_DIR}/../bin" "${SKBUILD_PLATLIB_DIR}/my_step_package/bin"
)
```

DuckDB configuration remains owner-controlled under `cf_datahive` and moves out of consumer workflows:

- default mode is `static`
- shared mode can be selected with `CF_DATAHIVE_CPP_DUCKDB_LINKAGE=shared`
- owner-supported override vars are `CF_DATAHIVE_CPP_DUCKDB_INCLUDE`, `CF_DATAHIVE_CPP_DUCKDB_LIB`, `CF_DATAHIVE_CPP_DUCKDB_SOURCE`, and on Windows `CF_DATAHIVE_CPP_DUCKDB_DLL`
- the `cf_datahive` build/publish workflow is responsible for staging those owner dependencies before packaging the native consumer surface

## Manifest details

Each run stores a `RunManifest` (`schema_version="1.0"`) with:

- run lifecycle fields (`status`: `staged|committed|aborted`)
- table entries (`parquet`, schema fingerprint, row/file counts, optional file hashes)
- artifact entries (sha256, media type, size)
- optional `semantic_refs` placeholder map for future ontology links

Schema fingerprint is sha256 of Arrow schema serialization bytes.

## Guardrails

Run the repository guardrail check:

```
python tools/check_datahive_guardrails.py
```

The script performs C++/header scans and step-package checks that:

- use canonical `workspace/data_hive` literals outside the native gatekeeper location (hard fail)
- violate the thin-steps rule in `sandcastle/cf_basic_steps/*/src/*/cpp` (hard fail)
- reintroduce backend-specific ownership in `cf_basic_sinks` package surfaces (hard fail)

## Testing

Install test dependencies and run:

```
pip install -e "sandcastle/cf_datahive[test]"
pytest -q sandcastle/cf_datahive/tests
```

Published distribution name:

```bash
pip install cf-datahive
```

## Publishing

`cf_datahive` is published with the dedicated Windows workflow and now owns the packaged native consumer boundary that `cf-pipeline-engine` links against:

- Workflow: `.github/workflows/cf_datahive_windows_publish.yml`
- Package directory: `sandcastle/cf_datahive`
- PyPI tag: `cf-datahive-v<version>`
- TestPyPI tag: `cf-datahive-v<version>-test`

Local preflight:

```powershell
powershell -ExecutionPolicy Bypass -File scripts/mimic_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PythonExe py `
  -PythonVersion 3.13
```

Queue a dry-run dispatch:

```powershell
powershell -ExecutionPolicy Bypass -File scripts/queue_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PublishTarget testpypi `
  -Ref main `
  -RequireLocalPass `
  -DryRun
```

## Do / Don't

- Do: use `DataHiveClient` read APIs (`list_runs`, `load_manifest`, `read_table`, `open_artifact`) for inspection and validation.
- Do: route pipeline write ownership through `cf_datahive_cpp` in the sink path.
- Don't: write parquet files or artifacts directly into the canonical data hive root from pipeline steps.
- Don't: bypass manifest updates.
