Metadata-Version: 2.4
Name: neon-vla
Version: 0.1.1
Summary: Open-source G1 humanoid VLA with video foundation model backbone
Project-URL: Homepage, https://github.com/cagataycali/neon
Project-URL: Repository, https://github.com/cagataycali/neon
Author: Cagatay Cali
License: MIT
License-File: LICENSE
Keywords: g1,humanoid,robotics,vision-language-action,vla
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: datasets>=3.0.0
Requires-Dist: einops>=0.7.0
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: torch>=2.2.0
Requires-Dist: transformers<5.3.0,>=4.48.0
Provides-Extra: agent
Requires-Dist: strands-agents>=0.1.0; extra == 'agent'
Provides-Extra: all
Requires-Dist: accelerate>=1.2.0; extra == 'all'
Requires-Dist: bitsandbytes>=0.45.0; extra == 'all'
Requires-Dist: boto3>=1.34.0; extra == 'all'
Requires-Dist: lerobot>=0.5.0; extra == 'all'
Requires-Dist: mujoco-mjx>=3.0.0; extra == 'all'
Requires-Dist: mujoco>=3.0.0; extra == 'all'
Requires-Dist: mypy>=1.0; extra == 'all'
Requires-Dist: peft>=0.14.0; extra == 'all'
Requires-Dist: pyarrow>=14.0.0; extra == 'all'
Requires-Dist: pyaudio>=0.2.13; extra == 'all'
Requires-Dist: pytest>=7.0; extra == 'all'
Requires-Dist: ruff>=0.3.0; extra == 'all'
Requires-Dist: sagemaker>=2.232.0; extra == 'all'
Requires-Dist: sagemaker[huggingface]>=2.232.0; extra == 'all'
Requires-Dist: segment-anything>=1.0; extra == 'all'
Requires-Dist: sounddevice>=0.4.6; extra == 'all'
Requires-Dist: strands-agents>=0.1.0; extra == 'all'
Requires-Dist: strands-cosmos>=0.1.0; extra == 'all'
Requires-Dist: trl>=0.15.0; extra == 'all'
Requires-Dist: ultralytics>=8.0.0; extra == 'all'
Requires-Dist: wandb>=0.16.0; extra == 'all'
Provides-Extra: collect
Requires-Dist: lerobot>=0.5.0; extra == 'collect'
Requires-Dist: pyarrow>=14.0.0; extra == 'collect'
Requires-Dist: pyaudio>=0.2.13; extra == 'collect'
Requires-Dist: segment-anything>=1.0; extra == 'collect'
Requires-Dist: sounddevice>=0.4.6; extra == 'collect'
Requires-Dist: ultralytics>=8.0.0; extra == 'collect'
Provides-Extra: cosmos
Requires-Dist: strands-cosmos>=0.1.0; extra == 'cosmos'
Provides-Extra: dev
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.3.0; extra == 'dev'
Provides-Extra: isaac
Requires-Dist: isaacsim>=4.0.0; extra == 'isaac'
Provides-Extra: kimodo
Requires-Dist: kimodo>=0.1.0; extra == 'kimodo'
Provides-Extra: lerobot
Requires-Dist: lerobot>=0.5.0; extra == 'lerobot'
Provides-Extra: newton
Requires-Dist: newton-sim>=0.5.0; extra == 'newton'
Requires-Dist: warp-lang>=1.0.0; extra == 'newton'
Provides-Extra: sagemaker
Requires-Dist: accelerate>=1.2.0; extra == 'sagemaker'
Requires-Dist: boto3>=1.34.0; extra == 'sagemaker'
Requires-Dist: peft>=0.14.0; extra == 'sagemaker'
Requires-Dist: sagemaker>=2.232.0; extra == 'sagemaker'
Requires-Dist: sagemaker[huggingface]>=2.232.0; extra == 'sagemaker'
Provides-Extra: sim
Requires-Dist: mujoco-mjx>=3.0.0; extra == 'sim'
Requires-Dist: mujoco>=3.0.0; extra == 'sim'
Provides-Extra: train
Requires-Dist: accelerate>=1.2.0; extra == 'train'
Requires-Dist: bitsandbytes>=0.45.0; extra == 'train'
Requires-Dist: peft>=0.14.0; extra == 'train'
Requires-Dist: trl>=0.15.0; extra == 'train'
Requires-Dist: wandb>=0.16.0; extra == 'train'
Description-Content-Type: text/markdown

<div align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="docs/assets/neon-banner.svg">
    <source media="(prefers-color-scheme: light)" srcset="docs/assets/neon-banner.svg">
    <img src="docs/assets/neon-banner.svg" alt="Neon — Teaching robots to see time" width="960"/>
  </picture>
  
  [![PyPI](https://img.shields.io/pypi/v/neon-vla?color=00e5ff&label=neon-vla)](https://pypi.org/project/neon-vla/)
  [![License: MIT](https://img.shields.io/badge/License-MIT-00bfa5.svg)](LICENSE)
  [![Tests](https://github.com/cagataycali/neon/actions/workflows/ci.yml/badge.svg)](https://github.com/cagataycali/neon/actions)
  [![Python 3.10+](https://img.shields.io/badge/python-3.10+-d500f9.svg)]()
  [![Docs](https://img.shields.io/badge/docs-live-00e5ff.svg)](https://cagataycali.github.io/neon)
</div>

<div align="center">
  
  **[▶️ Watch the explainer video](docs/assets/neon-explainer.mp4)**
  
</div>

---

## The Idea

A child watches a ball roll off a table and reaches out to catch it. She doesn't look at a photograph — she sees the *motion*. The arc, the acceleration, the moment it leaves the edge. She predicts the future from the flow of time.

**Every robot today is blind to this.** State-of-the-art Vision-Language-Action models look at the world through frozen snapshots. They see *where* things are, but not *where things are going*. It's like trying to catch that ball with your eyes closed between blinks.

**Neon's insight is one sentence:**

> Video foundation models already understand motion — we just connect them to robot bodies.

Models like Qwen2.5-Omni and Cosmos-Reason2 have watched millions of hours of video. They've learned that cups fall when pushed, that doors swing on hinges, that hands reach before they grasp. This temporal understanding — physics, dynamics, cause and effect — is exactly what a robot needs. It's sitting there, pre-trained, waiting.

So we do something radical in its simplicity. We take a **7-billion-parameter video model**, **freeze it entirely**, and train a *tiny* action decoder on top — just **6 million parameters, 0.08% of the total** — that translates the video model's rich temporal understanding into 29 joint commands for a humanoid body, 16 timesteps into the future.

**The video model sees. The decoder acts.**

```bash
pip install neon-vla
```

---

## How It Works

```mermaid
graph LR
    CAM["📹 Camera"] --> VB["Video Backbone<br/><b>7B frozen</b><br/>Qwen2.5-Omni / Cosmos"]
    MIC["🎤 Voice"] --> VB
    PROP["🦾 Joints"] --> PE["Proprio Encoder"]
    LIDAR["📡 LiDAR"] --> LE["PointCloud Encoder"]
    EEF["🤲 EEF State"] --> EE["EEF Encoder"]
    VB --> FUS["Feature Fusion"]
    PE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH["Action Heads<br/><b>~6M trainable</b><br/>Parameter Golf v2"]
    AH --> ACT["🤖 29 DoF × 16 steps"]
    AH --> SPEECH["🔊 Speech Out"]
    
    style VB fill:#0097a7,color:#fff,stroke:#0097a7
    style AH fill:#e65100,color:#fff,stroke:#e65100
    style FUS fill:#333,color:#fff
```

<details>
<summary><b>Full architecture diagram — all 6 input modalities</b></summary>

```mermaid
graph TD
    subgraph "Inputs (6 modalities)"
        CAM["📹 Camera Frames"]
        VID["🎬 Video Frames"]
        MIC["🎤 Audio (16kHz)"]
        TXT["📝 Language"]
        PROP["🦾 Joint States (29 DoF)"]
        LID["📡 LiDAR Point Cloud (N×4)"]
        EEF["🤲 EEF State (14 DoF)"]
    end

    subgraph "Neon VLA"
        VB["Video Backbone<br/>Qwen2.5-Omni / Cosmos-Reason2<br/><i>frozen, 3-7B</i>"]
        AE["Whisper Audio Encoder<br/><i>frozen, 39M</i>"]
        PE["Proprio Encoder<br/><i>trainable MLP</i>"]
        LE["PointCloud Encoder<br/><i>trainable PointNet-style</i>"]
        EE["EEF Encoder<br/><i>trainable MLP</i>"]
        FUS["Feature Fusion<br/>Linear + ReLU²"]
        AH["Action Heads<br/>Parameter Golf v2<br/><i>trainable, ~6M</i>"]
        SH["Speech Head<br/>PersonaPlex TTS"]
    end

    subgraph "Outputs"
        ARM["Arms (14 DoF)"]
        LOCO["Locomotion (vx, vy, ω)"]
        HEAD["Head (2 DoF)"]
        VOICE["🔊 Voice"]
    end

    CAM --> VB
    VID --> VB
    TXT --> VB
    MIC --> AE
    PROP --> PE
    LID --> LE
    EEF --> EE
    VB --> FUS
    AE --> FUS
    PE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH
    FUS --> SH
    AH --> ARM
    AH --> LOCO
    AH --> HEAD
    SH --> VOICE

    style VB fill:#0097a7,color:#fff
    style AH fill:#e65100,color:#fff
    style FUS fill:#333,color:#fff
```
</details>

### Why video models, not image models?

| | Traditional VLAs | Neon |
|---|---|---|
| **Vision** | Single frame (photograph) | Temporal sequence (**video**) |
| **Physics** | None — must learn from scratch | Cosmos-Reason2 — **pre-trained on physical world** |
| **Prediction** | 1 action at a time | **16-step action chunking** (anticipates the future) |
| **Audio** | Separate pipeline | **Native** — Qwen2.5-Omni hears and speaks |
| **Spatial** | No depth | **LiDAR point clouds** → PointNet encoder |
| **Trainable params** | Billions | **~6M** (0.08% of total) |

---

## Quick Start

### As a VLA model

```python
from neon.model.neon_vla import NeonVLA, NeonConfig

model = NeonVLA(NeonConfig(control_mode="arms_only"))
model.load_backbone()

# Full omni-modal prediction
output = model.predict(
    image=camera_frame,                     # 📹 what the robot sees
    instruction="Pick up the red cup",      # 📝 what you want
    proprioception=joint_states,            # 🦾 where the robot is
    audio=voice_waveform,                   # 🎤 spoken command (16kHz)
    lidar=point_cloud,                      # 📡 spatial awareness (N×4)
    eef_state=ee_positions,                 # 🤲 hand positions (14-DOF)
    speak=True,                             # 🔊 robot narrates its action
)

output.actions      # → (16, 17) — 16 timesteps × 17 joints
output.upper_body   # → (16, 14) — arm positions
output.locomotion   # → (16, 3)  — velocity commands (vx, vy, ω)
output.speech_path  # → "/tmp/neon_speech_xyz.wav"
```

### As a strands-robots policy (plug-and-play)

```python
# Direct usage
from neon import NeonPolicy
policy = NeonPolicy(host="192.168.123.10", port=8300)
actions = policy.get_actions_sync(obs, "pick up the red cup")

# Via strands-robots (auto-discovered on install)
from strands_robots.policies import create_policy
policy = create_policy("neon", host="robot-ip", port=8300)

# Smart resolution from HuggingFace model ID
policy = create_policy("cagataydev/neon-g1-v1-dev")
```

### Run the inference server

```bash
# On the robot (Jetson Orin / any CUDA machine)
neon-serve --model cagataydev/neon-g1-v1-dev --port 8300

# The server accepts ALL modalities via HTTP:
curl -X POST http://robot:8300/predict \
  -H "Content-Type: application/json" \
  -d '{"image_base64": "...", "instruction": "pick up the cup", "proprioception": [...]}'

# Check what modalities the model supports:
curl http://robot:8300/health
# → {"modalities": {"camera": true, "audio": true, "lidar": false, ...}}
```

---

## The Action Decoder — Parameter Golf v2

Our decoder heads come from a competition to build the smallest working language model. Every trick matters when your entire trainable model fits in **25 megabytes**:

| Technique | What | Why it matters |
|---|---|---|
| **ReLU²** | `max(0, x)²` | Smoother than GELU, cheaper than SiLU |
| **RMSNorm** | `x / √(mean(x²))` | Half the cost of LayerNorm |
| **Soft-Capping** | `c · tanh(x/c)` | Never kills gradients at boundaries |
| **Residual Scales** | `h + α·h_skip` | Learned α — network decides backbone trust |
| **U-Net Skip** | Layer 0 → last layer | Gradient highway through deep decoders |
| **β₁ = 0.85** | Lower Adam momentum | Faster adaptation to shifting distributions |
| **Grad Clip 0.3** | Tight clipping | Prevents divergence in small heads |

---

## Data Soup — A Thousand Bodies, Unified

A robot built for a specific body usually needs data from that body. We break this with **relative actions** — displacements in the gripper's local frame. The same reaching motion produces the same numbers whether performed by a Franka, an SO-100, or our G1 humanoid.

```python
# Same physical motion = same numbers, any robot
rel_xyz = prev_rotm.T @ (curr_xyz - prev_xyz)    # Position delta
rel_rot = rotm2euler(prev_rotm.T @ curr_rotm)     # Rotation delta
action  = [rel_xyz, rel_rot, gripper_state]        # Cross-embodiment!
```

Seven source types, one training stream:

```mermaid
graph LR
    LR["🤖 LeRobot<br/>Bridge + DROID"] --> MIX
    AG["🦾 Agibot-World<br/>Bimanual 1M+"] --> MIX
    COS["🌌 Cosmos DreamGen<br/>Synthetic"] --> MIX
    S4D["📸 Stereo4D<br/>Kitchen depth"] --> MIX
    VC["🗣️ Voice Commands<br/>50K instructions"] --> MIX
    TEL["🎮 G1 Teleoperation<br/>LiDAR + EEF + Audio"] --> MIX
    DR["💭 GR00T-Dreams<br/>Humanoid demos"] --> MIX

    MIX["Data Soup 🥣"] --> TRAIN["NeonTrainer<br/>All 6 modalities"]

    style MIX fill:#e65100,color:#fff,stroke:#e65100
```

---

## G1 Humanoid — 29 Degrees of Freedom

```mermaid
graph TD
    G1["Unitree G1<br/>29 DoF"] --> LA["Left Arm · 7"]
    G1 --> RA["Right Arm · 7"]
    G1 --> T["Torso · 1"]
    G1 --> H["Head · 2"]
    G1 --> LL["Left Leg · 6"]
    G1 --> RL["Right Leg · 6"]

    style G1 fill:#e65100,color:#fff,stroke:#e65100
    style LA fill:#00695c,color:#fff
    style RA fill:#00695c,color:#fff
    style LL fill:#0097a7,color:#fff
    style RL fill:#0097a7,color:#fff
```

| Mode | Joints | Use Case |
|---|---|---|
| `arms_only` | 14 arms + 3 loco = **17** | Tabletop manipulation |
| `upper_body` | + 3 head/torso = **20** | Manipulation + gaze tracking |
| `whole_body` | All **29** | Full locomotion + manipulation |

---

## Training

Eight presets, from a laptop GPU to a cloud A100:

| Config | Backbone | Mode | GPU | Notes |
|---|---|---|---|---|
| `edge_3b` | Qwen2.5-Omni-**3B** | arms | RTX 3090 / L4 | Edge deployment |
| `default_arms_only` | Qwen2.5-Omni-**7B** | arms | A100 40GB | Standard |
| `default_wholebody` | Qwen2.5-Omni-**7B** | whole | A100 80GB | Full body |
| `cosmos_physics` | Cosmos-Reason2-**8B** | arms | A100 40GB | Physics-heavy |
| `large_arms` | Qwen2.5-Omni-**7B** | arms | A100 40GB | ~44M heads (GR00T-scale) |
| `large_cosmos` | Cosmos-Reason2-**8B** | arms | A100 40GB | Physics + large heads |
| `large_wholebody` | Qwen2.5-Omni-**7B** | whole | A100 80GB | 29 DoF + large heads |
| `g1_omnimodal` | Qwen2.5-Omni-**7B** | whole | A100 40GB+ | **All sensors**: LiDAR + EEF + audio |

```bash
# Train on HuggingFace Jobs (recommended)
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 8h -- \
    python -m neon.training.train --backbone Qwen/Qwen2.5-Omni-7B --mode arms_only

# Or locally
from neon.training.config import default_arms_only_config
from neon.training.train import train
train(default_arms_only_config())
```

---

## By The Numbers

| | |
|---|---|
| **Backbone** | 3-7B params (frozen) |
| **Decoder** | ~6M params (trainable) — **0.08%** |
| **Action Space** | 29 DoF, 16-step chunking |
| **Input Modalities** | 6 (camera, video, audio, LiDAR, EEF, proprioception) |
| **Latency** | 50ms on Jetson Orin |
| **Data Sources** | 7 types, cross-embodiment |
| **Training Presets** | 8 configs (edge to omni-modal) |
| **Tests** | 168 passing (CPU, no GPU needed) |
| **License** | MIT |

---

## Project Structure

```
neon/
├── neon/
│   ├── model/
│   │   ├── neon_vla.py          # Complete VLA pipeline + PointCloudEncoder + EEFEncoder
│   │   ├── action_heads.py      # Parameter Golf v2 decoders (ReLU², soft-cap)
│   │   ├── video_backbone.py    # Qwen / Cosmos adapter (3B-8B)
│   │   └── audio.py             # Whisper encoder + PersonaPlex TTS
│   ├── data/
│   │   ├── action_space.py      # G1 29-DoF joint definitions + normalization
│   │   ├── data_soup.py         # 7-source data mixing (NeonEpisode w/ all modalities)
│   │   └── relative_actions.py  # Cosmos-style relative EE actions
│   ├── training/
│   │   ├── config.py            # TrainConfig + 8 presets (incl. g1_omnimodal)
│   │   └── train.py             # NeonTrainer (omni-modal collation + loss)
│   ├── inference/
│   │   ├── server.py            # HTTP inference server (all 6 modalities)
│   │   └── g1_controller.py     # Unitree SDK interface
│   ├── streams/
│   │   ├── channels.py          # Typed data channels (Camera, Joint, LiDAR, Audio, Text, ToolCall)
│   │   ├── recorder.py          # StreamRecorder → LeRobot dataset
│   │   └── session.py           # StreamSession — full robot loop
│   ├── dashboard/
│   │   └── bridge.py            # WebSocket dashboard (camera, joints, LiDAR viz)
│   └── policy.py                # NeonPolicy — strands-robots integration (HTTP/ZMQ)
├── tests/                       # 168 tests (all CPU, no GPU needed)
├── paper/                       # LaTeX white paper + soul manifesto
├── video/                       # Remotion explainer video source
└── docs/                        # MkDocs Material site
```

---

## strands-robots Integration

Neon ships as a first-class [strands-robots](https://github.com/strands-labs/robots) policy. On `pip install neon-vla`, it auto-registers and is immediately discoverable:

```python
from strands_robots.policies import create_policy

# The NeonPolicy bridges VLA inference (5-10 Hz) to robot control (50 Hz)
# via RTC action queue with temporal blending
policy = create_policy("neon", host="192.168.123.10", port=8300)

# Full omni-modal observation
obs = {
    "observation.images.front": camera_frame,   # (H, W, 3) uint8
    "observation.state": joint_positions,        # (17,) float32
    "observation.audio": voice_waveform,         # (16000,) float32
    "observation.lidar": point_cloud,            # (4096, 4) float32
    "observation.eef_state": ee_state,           # (14,) float32
}
actions = policy.get_actions_sync(obs, "pick up the red cup")
```

Three blend schedules: **linear**, **step**, **exponential**. The policy auto-discovers server capabilities via `/health`.

---

## Video

The explainer video is built with [Remotion](https://remotion.dev) — code-as-video, version-controlled, reproducible:

```bash
cd video
npm install
npx remotion studio          # Preview in browser
npx remotion render src/index.ts NeonExplainer out/neon-explainer.mp4
```

---

## Papers

| Document | Pages | What |
|---|---|---|
| [`paper/neon.tex`](paper/) | 6 | Full technical report — math, proofs, pseudocode |
| [`paper/soul.tex`](paper/) | 1 | The Soul of Neon — *"Teaching Robots to See Time"* |

PDFs attached to every [GitHub release](https://github.com/cagataycali/neon/releases).

---

## Related Work

- [GR00T N1](https://github.com/nvidia/isaac-gr00t) — Architecture reference for humanoid VLA
- [GR00T-WholeBodyControl](https://github.com/NVlabs/GR00T-WholeBodyControl) — RL whole-body policies (learnings adopted in Neon)
- [Cosmos-Predict2.5](https://github.com/nvidia/cosmos-predict2) — Relative actions, world model reasoning
- [OmniVLA](https://github.com/cagataycali/OmniVLA) — Omni-modal VLA reference
- [MicroGPT Parameter Golf](https://github.com/cagataycali/strands-microgpt) — Source of action head optimizations
- [Strands Agents](https://strandsagents.com) — Agent framework for robot integration
- [strands-robots](https://github.com/strands-labs/robots) — Robot SDK (NeonPolicy integrates via entry-point)

---

## Citation

```bibtex
@software{neon2026,
  title   = {Neon: Open-Source Vision-Language-Action Model for Humanoid Whole-Body Control},
  author  = {Cali, Cagatay},
  year    = {2026},
  url     = {https://github.com/cagataycali/neon},
  license = {MIT}
}
```

---

<div align="center">
  <br/>
  <em>"The difference between seeing a photograph and watching a video<br/>is the difference between knowing and understanding."</em>
  <br/><br/>
  <strong>One idea. An invitation.</strong>
  <br/><br/>
  <a href="https://cagataycali.github.io/neon">📖 Docs</a> · <a href="https://pypi.org/project/neon-vla/">📦 PyPI</a> · <a href="https://github.com/cagataycali/neon/releases">📄 Papers</a>
</div>
