Metadata-Version: 2.4
Name: theodolite-scanner
Version: 2.0.0
Summary: Scan cloud storage for sensitive data (PII, PHI, financial, credentials)
Project-URL: Homepage, https://theodolite.io
Project-URL: Documentation, https://theodolite.io
Project-URL: Repository, https://github.com/ndcarlson/sentinel
Author-email: Theodolite Security <support@theodolite.io>
License-Expression: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Information Technology
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Security
Requires-Python: >=3.9
Requires-Dist: click>=8.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: all
Requires-Dist: azure-identity>=1.0; extra == 'all'
Requires-Dist: azure-mgmt-resource<25.0,>=23.0; extra == 'all'
Requires-Dist: azure-mgmt-storage<25.0,>=21.0; extra == 'all'
Requires-Dist: azure-storage-blob>=12.0; extra == 'all'
Requires-Dist: boto3>=1.35.0; extra == 'all'
Requires-Dist: gliner>=0.2.0; extra == 'all'
Requires-Dist: google-cloud-storage>=2.0; extra == 'all'
Requires-Dist: openpyxl>=3.0; extra == 'all'
Requires-Dist: pypdf>=4.0; extra == 'all'
Requires-Dist: python-docx>=1.0; extra == 'all'
Requires-Dist: python-pptx>=0.6; extra == 'all'
Requires-Dist: spacy>=3.0; extra == 'all'
Requires-Dist: torch>=2.0; extra == 'all'
Requires-Dist: transformers>=4.35.0; extra == 'all'
Provides-Extra: aws
Requires-Dist: boto3>=1.35.0; extra == 'aws'
Provides-Extra: azure
Requires-Dist: azure-identity>=1.0; extra == 'azure'
Requires-Dist: azure-mgmt-resource<25.0,>=23.0; extra == 'azure'
Requires-Dist: azure-mgmt-storage<25.0,>=21.0; extra == 'azure'
Requires-Dist: azure-storage-blob>=12.0; extra == 'azure'
Requires-Dist: spacy>=3.0; extra == 'azure'
Provides-Extra: gcp
Requires-Dist: google-cloud-storage>=2.0; extra == 'gcp'
Requires-Dist: spacy>=3.0; extra == 'gcp'
Provides-Extra: gliner
Requires-Dist: gliner>=0.2.0; extra == 'gliner'
Requires-Dist: torch>=2.0; extra == 'gliner'
Requires-Dist: transformers>=4.35.0; extra == 'gliner'
Provides-Extra: ner
Requires-Dist: spacy>=3.0; extra == 'ner'
Provides-Extra: office
Requires-Dist: openpyxl>=3.0; extra == 'office'
Requires-Dist: python-docx>=1.0; extra == 'office'
Requires-Dist: python-pptx>=0.6; extra == 'office'
Provides-Extra: pdf
Requires-Dist: pypdf>=4.0; extra == 'pdf'
Description-Content-Type: text/markdown

# Theodolite Scanner

Standalone CLI scanner that runs **inside** a customer's cloud environment to detect sensitive data (PII, PHI, financial, credentials) in cloud storage. File downloads stay within the customer's cloud (free egress). Only a small findings JSON gets sent back to Theodolite.

## Installation

```bash
# AWS
pip install "theodolite-scanner[aws]"

# Azure
pip install "theodolite-scanner[azure]"

# GCP
pip install "theodolite-scanner[gcp]"

# With PDF support
pip install "theodolite-scanner[azure,pdf]"

# With Office document support (docx, xlsx, pptx)
pip install "theodolite-scanner[azure,office]"

# Everything
pip install "theodolite-scanner[all]"
```

## Usage

```bash
# Scan an Azure Blob container
theodolite-scan --provider azure \
  --container mycontainer \
  --account-url https://myaccount.blob.core.windows.net \
  --output findings.json

# Quick scan (5% sample)
theodolite-scan --provider azure \
  --container mycontainer \
  --account-url https://myaccount.blob.core.windows.net \
  --scan-type quick --output findings.json

# Filter by category
theodolite-scan --provider azure \
  --container mycontainer \
  --account-url https://myaccount.blob.core.windows.net \
  --categories pii,phi --output findings.json

# Upload directly to Theodolite
theodolite-scan --provider azure \
  --container mycontainer \
  --account-url https://myaccount.blob.core.windows.net \
  --upload https://api.theodolite.io --token <api-key>
```

## Authentication

Uses `DefaultAzureCredential` — picks up managed identity, Azure CLI creds, or environment variables automatically.

## Output Format

```json
{
  "scanner_version": "1.0.0",
  "scan_timestamp": "2026-02-28T12:00:00Z",
  "provider": "azure",
  "source": "https://myaccount.blob.core.windows.net/mycontainer",
  "scan_type": "full",
  "total_assets_discovered": 5000,
  "assets_scanned": 5000,
  "total_findings": 342,
  "findings_by_category": {"pii": 120, "phi": 80, "financial": 92, "credentials": 50},
  "findings_by_type": {"ssn": 50, "email": 70, "credit_card": 42},
  "findings_by_severity": {"critical": 10, "high": 45, "medium": 65, "low": 222},
  "estimated_breach_cost": 18250,
  "findings": [
    {
      "data_type": "ssn",
      "category": "pii",
      "severity": "high",
      "confidence": 0.95,
      "location": "mycontainer/data/users.csv",
      "line_number": 42,
      "pattern_name": "ssn_us",
      "estimated_breach_cost": 166
    }
  ]
}
```

## Detection Patterns

- **PII**: SSN, Driver's License, Passport, Email, Phone, Address, Date of Birth (US, UK, EU, Canada)
- **PHI**: Medical Records, Medicare/Medicaid IDs, ICD-10/CPT codes, NPI, Lab Results
- **Financial**: Credit Cards (Visa/MC/Amex/Discover), Bank Accounts, IBAN, SWIFT/BIC, Crypto Addresses
- **Credentials**: API Keys (AWS/GCP/GitHub/Slack/Stripe/OpenAI), Private Keys, Passwords, JWTs

## Supported File Formats

- **Text**: txt, csv, json, xml, yaml, log, sql, source code files
- **PDF**: With `[pdf]` extra (pypdf)
- **Office**: docx, xlsx, pptx with `[office]` extra
- **RTF**: With striprtf (included in `[office]`)
