Metadata-Version: 2.4
Name: worai
Version: 6.11.1
Summary: AI-powered CLI for WordLift knowledge graph and SEO workflows.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: copier<10.0.0,>=9.7.1
Requires-Dist: jinja2>=3.1.0
Requires-Dist: morph-kgc>=2.7.0
Requires-Dist: playwright>=1.48.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyshacl>=0.26.0
Requires-Dist: typer>=0.12.5
Requires-Dist: wordlift-sdk<7.0.0,>=6.9.0
Provides-Extra: dev
Requires-Dist: pytest>=8.3.4; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"

# worai

Command-line toolkit for WordLift operations and SEO checks.
Pronunciation: "waw-RYE"

Docs: https://docs.wordlift.io/worai/

## Install

- `pipx install worai`
- `pip install worai`

Full docs: https://docs.wordlift.io/worai/

Runtime dependency note:
- `wordlift-sdk>=6.9.0,<7.0.0` (installed automatically by pip)
- `copier` (required by `worai graph sync create`, installed automatically by pip)

If you plan to run `seocheck`, install Playwright browsers:
- `playwright install chromium`

## Quick Start

- `worai --help`
- `worai seocheck https://example.com/sitemap.xml`
- `worai google-search-console --site sc-domain:example.com --client-secrets ./client_secrets.json`
- `worai canonicals dedupe --input pages_with_titles.csv --site sc-domain:example.com --service-account ./service-account.json`
- `worai <command> --help`

## Configuration

Config file (TOML) discovery order:
- `--config`
- `WORAI_CONFIG`
- nearest `worai.toml` from current directory upward (for example `./worai.toml`, `../worai.toml`, `../../worai.toml`)
- `~/.config/worai/config.toml`
- `~/.worai.toml`

Profiles:
- `[profiles.<name>]` with `--profile` or `WORAI_PROFILE`

Common keys:
- `profiles.<name>.api_key`
- `log_level` (global default logging level: `debug|info|warning|error`)
- `profiles.<name>.log_level` (profile-specific override for root logging level)
- `profiles._base.log_level` (shared profile fallback when selected profile has no `log_level`)
- `profiles.<name>.mapping` (SDK profile contract)
- `profiles.<name>.gsc_site_id` (GSC property for commands that query Search Console)
- `profiles.<name>.oauth.client_secrets` (OAuth Desktop app client file)
- `profiles.<name>.oauth.token` (shared OAuth token file path)
- `profiles.<name>.oauth.service_account` (service account credential as file path or inline JSON)
- `profiles.<name>.ga_property_id` (preferred GA4 property key for analytics; `ga.id` remains supported)
- `profiles.<name>.canonicals.output` (supports `{profile}`, `{date}`, `{seq}` interpolation)
- `profiles.<name>.canonicals.interval`
- `profiles.<name>.canonicals.concurrency`
- `profiles.<name>.canonicals.request_timeout_sec`
- one source per profile (`urls`, `sitemap_url`, or `sheets_url` + `sheets_name` + `sheets_service_account`) for SDK profile validity
- `postprocessor_runtime` (graph sync runtime: `oneshot` or `persistent`; profile override supported)
- `ingest.source` (`auto|urls|sitemap|sheets|local`)
- `ingest.loader` (`auto|simple|proxy|playwright|premium_scraper|web_scrape_api|passthrough`)
- `ingest.passthrough_when_html` (default: `true`)
- `ingest.timeout_ms` (default: `30000`)
- `ingest.playwright_wait_until` (default: `domcontentloaded`)
- command-specific OAuth/GSC/GA options should be passed via CLI flags or environment variables.

Supported environment variables:
- `WORAI_CONFIG` — path to a config TOML file (overrides discovery order).
- `WORAI_PROFILE` — profile name under `[profiles.<name>]`.
- `WORAI_LOG_LEVEL` — default log level (`debug|info|warning|error`).
- `WORAI_LOG_FORMAT` — default log format (`text|json`).
- `WORDLIFT_API_KEY` — WordLift API key for entity operations.
- `GSC_CLIENT_SECRETS` — path to OAuth client secrets JSON for GSC.
- `GSC_ID` — GSC property URL.
- `OAUTH_TOKEN` — path to store the shared OAuth token (GSC + GA).
- `GSC_OUTPUT` — default output CSV path for GSC export.
- `GA_ID` — GA4 property ID for Analytics sections.
- `GSC_TOKEN` / `GA_TOKEN` — legacy aliases for `OAUTH_TOKEN` (must point to the same file if used).
- `WORAI_DISABLE_UPDATE_CHECK` — set to `1|true|yes|on` to disable startup update checks.

`.env` support:
- `worai` loads `.env` from the current working directory (and parent lookup) at startup.
- values from `.env` are treated as environment variables.
- existing environment variables take precedence over `.env` values.

Logging level precedence:
- `--log-level` (highest)
- `WORAI_LOG_LEVEL`
- `profiles.<name>.log_level` in `worai.toml` (when a profile is selected)
- `profiles._base.log_level` in `worai.toml` (when a profile is selected and no profile-specific value is set)
- global `log_level` in `worai.toml`
- `info` (default)
- Selected level is enforced on both root logger and active handlers, so dependency `INFO` logs are suppressed when using `warning` or `error`.

Example environment setup:
```
export WORDLIFT_API_KEY="wl_..."
export WORAI_CONFIG="~/worai.toml"
export WORAI_PROFILE="dev"
export GSC_CLIENT_SECRETS="~/client_secrets.json"
export OAUTH_TOKEN="~/oauth_token.json"
```

Example `worai.toml`:
```toml
[profiles.default]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
sitemap_url = "https://example.com/sitemap.xml"
ingest_loader = "web_scrape_api"
```

Ingestion profile examples:
```toml
[profiles.inventory_local]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
urls = ["https://example.com/page"]
ingest_source = "local"
ingest_loader = "passthrough"

[profiles.inventory_remote]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
sitemap_url = "https://example.com/sitemap.xml"
ingest_source = "sitemap"
ingest_loader = "web_scrape_api"

[profiles.graph_sync_proxy]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
urls = ["https://example.com/a", "https://example.com/b"]
ingest_source = "urls"
ingest_loader = "proxy"
ingest_timeout_ms = 30000
playwright_wait_until = "domcontentloaded"
```

## Commands

Full docs: https://docs.wordlift.io/worai/

- `seocheck` — run SEO checks for sitemap URLs and URL lists.
- `google-search-console` — export GSC page metrics as CSV.
- `canonicals dedupe` — dedupe canonical URLs by title using GSC impressions.
- `dedupe` — deduplicate WordLift entities by schema:url.
- `canonicalize-duplicate-pages` — select canonical URLs using GSC KPIs.
- `delete-entities-from-csv` — delete entities listed in a CSV.
- `find-faq-page-wrong-type` — find and patch FAQPage typing issues.
- `find-missing-names` — find entities missing schema:name/headline.
- `find-url-by-type` — list schema:url values by type from RDF.
- `graph` — run graph-specific workflows.
- `link-groups` — build or apply LinkGroup data from CSV.
- `patch` — patch entities from RDF.
- `structured-data` — generate JSON-LD/YARRRML mappings or materialize RDF from YARRRML.
- `agent` — launch codex/claude/gemini with worai MCP + skill guidance.
- `web-pages` — run ingestion-backed web page workflows.
- `validate` — deprecated JSON-LD validator command (use `graph validate` for RDF files/URLs; use `structured-data validate page` for webpage URLs).
- `self update` — check for new worai versions and optionally run the upgrade command.
- `upload-entities-from-turtle` — upload .ttl files with resume.
- `dil-import` - upload DILs from a CSV file.

Command help:
- `worai <command> --help`

Autocompletion:
- `worai --install-completion`
- `worai --show-completion`

Updates:
- `worai` checks for new versions periodically and prints a non-blocking notice when an update is available.
- run `worai self update` to check manually and see/apply the suggested upgrade command.

## Examples

seocheck
- `worai seocheck https://example.com/sitemap.xml`
- `worai seocheck https://example.com/sitemap.xml --output-dir ./seocheck-report --save-html`
- `worai seocheck https://example.com/sitemap.xml --output-dir ./seocheck-report --no-open-report`
- `worai seocheck https://example.com/sitemap.xml --user-agent "Mozilla/5.0 ..."`
- `worai seocheck https://example.com/sitemap.xml --sitemap-fetch-mode browser`
- `worai seocheck https://example.com/sitemap.xml --no-report-ui`
- `worai seocheck https://example.com/sitemap.xml --recheck-failed --recheck-from ./seocheck-report`

google-search-console
- `worai google-search-console --site sc-domain:example.com --client-secrets ./client_secrets.json`
  - Uses OAuth redirect port 8080 by default.

canonicals dedupe
- `worai canonicals dedupe --input pages_with_titles.csv --site sc-domain:example.com --service-account ./service-account.json`
- `worai canonicals dedupe --input pages_with_titles.csv --site sc-domain:example.com --token oauth_token.json`

seoreport (with Analytics)
- `worai seoreport --site sc-domain:example.com --ga-id 123456789 --format html`

canonicalize-duplicate-pages
- `worai canonicalize-duplicate-pages --input gsc_pages.csv --output canonical_targets.csv --kpi-window 28d --kpi-metric clicks`
- `worai canonicalize-duplicate-pages --input gsc_pages.csv --entity-type Product`

dedupe
- `worai dedupe --dry-run`

find-faq-page-wrong-type
- `worai find-faq-page-wrong-type ./data.ttl --dry-run --replace-type`
- `worai find-faq-page-wrong-type ./data.ttl --patch --replace-type`

find-missing-names
- `worai find-missing-names ./data.ttl`

find-url-by-type
- `worai find-url-by-type ./data.ttl schema:Service schema:Product`

link-groups
- `worai link-groups ./links.csv --format turtle`
- `worai link-groups ./links.csv --apply --dry-run --concurrency 4`

graph
- `worai --config ./worai.toml --profile acme graph sync run`
- `worai --profile acme graph sync run --debug`
- `worai graph sync create ./acme-graph`
- `worai graph sync create ./acme-graph --template ./graph-sync-template --defaults`
- `worai graph sync create ./acme-graph --data-file ./answers.yml --non-interactive`
- `worai graph sync create ./acme-graph --vcs-ref v1.2.3`
- `worai graph export`
- `worai --profile acme graph export`
- `worai --profile acme graph export ./acme-export.jsonld`
- `worai graph export ./acme-export.ttl --validate`
- `worai graph validate ./graph.ttl ./graph.jsonld --builtin-shape google-required --level warning --format text`
- `worai graph property delete seovoc:html --dry-run`
- `worai graph property delete https://w3id.org/seovoc/html --yes --workers 4`
  - `graph export` reads API key from `worai.toml` profile (root `--profile`, then `WORAI_PROFILE`, then `default`) and calls `/dataset/export`.
  - `graph export` output format is inferred from extension: `.ttl`, `.nt`, `.nq`, `.rdf`/`.xml`, `.jsonld`/`.json`.
  - `graph export` default filename: `export_<profile>_<yyyyMMdd>_<seq>.ttl` (sequence starts at `1`).
  - `graph export --validate` runs SHACL validation on the exported file and fails on SHACL errors/warnings.
  - `graph validate` accepts one or more local files or URLs and supports shape composition with:
    - `--builtin-shape <name>`
    - `--exclude-builtin-shape <name>`
    - `--shape <file-or-url>`
  - `graph validate --level warning|error` controls failure threshold; `--format text|json` controls output.
  - `graph property delete` sends `X-include-Private: true` by default for both GraphQL match discovery and entity PATCH requests.
  - `graph sync create` runs Copier in trusted mode by default so template `_tasks` execute.
  - `graph sync run` profile resolution is: root `--profile`, then `WORAI_PROFILE`, then `default`.
  - Mapping docs (for `[profiles.<name>]`): `docs/graph-sync-mappings-reference.md`, `docs/graph-sync-mappings-guide.md`, `docs/graph-sync-mappings-examples.md`
  - Internal template-agent workflow docs: `specs/graph-sync/AGENTS.md`, `specs/graph-sync/INDEX.md`, `specs/graph-sync/developer-agent-workflow.md`
  - Profile loading standard for non-sync commands: `specs/profile-loading-standard.md`
  - Configure exactly one source mode per run: `urls`, `sitemap_url` (+ optional pattern), or `sheets_url` + `sheets_name`.
  - Playwright-backed ingestion defaults to `ingest_timeout_ms = 30000` and `ingest.playwright_wait_until = "domcontentloaded"`.
  - `web_page_import_timeout` remains supported for `graph sync` as a legacy seconds-based alias.
  - SDK 6 defaults to persistent postprocessor runtime.
  - set `postprocessor_runtime = "oneshot"` in `worai.toml` to keep old one-process-per-callback behavior.
  - SDK `wordlift-sdk` 5.1.1+ postprocessor context migration:
    - `context.settings` -> `context.profile` (for example `context.profile["settings"]["api_url"]`)
    - `context.account.key` -> `context.account_key`
    - `context.account` remains the clean `/me` account object
  - SDK 6 ingestion uses explicit keys:
    - `INGEST_SOURCE` (`urls|sitemap|sheets|local|auto`)
    - `INGEST_LOADER` (`web_scrape_api|proxy|premium_scraper|playwright|simple|passthrough|auto`)
    - `INGEST_TIMEOUT_MS` (milliseconds)
    - `PLAYWRIGHT_WAIT_UNTIL` (`domcontentloaded|load|networkidle`)
  - SDK 6 migration deprecates integration use of `WEB_PAGE_IMPORT_MODE` and `WEB_PAGE_IMPORT_TIMEOUT`.
  - `graph sync run` uses `run_cloud_workflow` and emits per-graph progress and final KPI summaries through CLI logs (`on_info`, `on_progress`, `on_kpi`).
  - `graph sync run --debug` writes SDK callback artifacts under `output/debug_cloud/<profile>/` from the current working directory:
    - `static_templates.ttl`
    - `cloud_<sha256(url)>.ttl` for each callback URL.
  - SHACL validation settings mapping for SDK 6.2+:
    - use `shacl_validate_mode = "warn"|"fail"|"off"`
    - use `shacl_builtin_shapes`, `shacl_exclude_builtin_shapes`, `shacl_extra_shapes`
    - `shacl_validate_sync` and `shacl_shape_specs` are no longer supported

patch
- `worai patch ./data.ttl --dry-run --add-types`

structured-data
- `worai structured-data create https://example.com/article Review --output-dir ./structured-data`
- `worai structured-data create https://example.com/article --type Review --output-dir ./structured-data`
- `worai structured-data create https://example.com/article --type Review --debug`
- `worai structured-data create https://example.com/article --type Review --max-xhtml-chars 40000 --max-nesting-depth 2`
- `worai structured-data generate https://example.com/sitemap.xml --yarrrml ./mapping.yarrrml --output-dir ./out`
- `worai structured-data generate https://example.com/page --yarrrml ./mapping.yarrrml --format jsonld`
- `worai structured-data inventory https://example.com/sitemap.xml --output ./structured-data-inventory.csv`
- `worai structured-data inventory ./urls.txt --output ./structured-data-inventory.csv`
- `worai structured-data inventory https://docs.google.com/spreadsheets/d/<id>/edit --sheet-name URLs_US --output ./structured-data-inventory.csv`
- `worai structured-data inventory https://example.com/sitemap.xml --destination-sheet-id <spreadsheet_id> --destination-sheet-name Inventory`
- `worai structured-data inventory https://example.com/sitemap.xml --output ./structured-data-inventory.csv --concurrency auto`
- `worai structured-data inventory https://example.com/sitemap.xml --url-regex "/blog/" --output ./structured-data-inventory.csv`
- `worai structured-data inventory /path/to/debug_cloud/us --source-type debug-cloud --output ./structured-data-inventory.csv`
- `worai structured-data inventory /path/to/debug_cloud/us --ingest-source local --ingest-loader passthrough --output ./structured-data-inventory.csv`
- `worai structured-data inventory https://example.com/sitemap.xml --ingest-loader web_scrape_api --output ./structured-data-inventory.csv`

agent
- `worai agent --agent-cli codex`
- `worai agent --agent-cli codex -- --yolo --search`
- `worai agent --agent-cli claude --profile acme`
- `worai agent --agent-cli gemini --config ./worai.toml --profile acme`
- `worai agent mcp serve --profile acme`

web-pages
- `worai web-pages classify-types https://example.com/sitemap.xml --ingest-source sitemap --ingest-loader playwright --url-regex "/blog/" --output ./types.csv`
- `worai web-pages classify-types ./urls.txt --ingest-source urls --output ./types.csv`
- `worai web-pages classify-types https://docs.google.com/spreadsheets/d/<id>/edit --ingest-source sheets --sheet-name URLs --service-account ./service-account.json --output ./types.csv`
- `worai web-pages classify-types https://example.com/sitemap.xml --ingest-source sitemap --output ./types.csv --yes` (skip credit-consumption confirmation)

validate
- `worai graph validate ./data.jsonld --builtin-shape review-snippet --shape ./custom.ttl --level warning --format json`
- `worai validate jsonld --shape review-snippet --shape schema-review ./data.jsonld`
- `worai validate jsonld --format raw https://api.wordlift.io/data/example.jsonld`
- `worai structured-data validate page https://example.com/article --shape review-snippet`

self update
- `worai self update --check-only`
- `worai self update --yes`

upload-entities-from-turtle
- `worai upload-entities-from-turtle ./entities --recursive --limit 50`

dil-import
- `worai dil-import <wordlift_key> <path_to_csv_file>`

## Troubleshooting

- Playwright missing browsers:
  - `playwright install chromium`
- YARRRML conversion:
  - `npm install -g @rmlio/yarrrml-parser`
- RML execution:
  - `morph-kgc` is included in project dependencies
- Dependency notes:
  - Common runtime libs (e.g., `requests`, `rdflib`, `tqdm`, `advertools`, Google auth helpers) are provided transitively by `wordlift-sdk`.
- OAuth token issues:
  - Remove the token file and re-run `worai google-search-console` or `worai canonicals dedupe`.
  - If you are prompted to re-auth every run, delete the token file to force a new consent flow that includes a refresh token.
