WheelWright AI Framework
Version 2.0.160 · Generated 2026-04-10 · Source: llms-full.md
Complete documentation for the WheelWright AI Framework — the hub-and-spoke knowledge operating system.
llms-full.md. For the canonical version, see the source on GitHub. Ensure you’re working against a tagged release.Philosophy
WheelWright treats AI context as a compounding asset. The framework provides two universal primitives — Skills (executable capabilities) and Lugs (actionable knowledge records) — connected by a single execution contract: Perceive / Execute / Verify (PEV).
Architecture
The hub-and-spoke model: the Hub is the analytical clearinghouse and shared memory. Each Spoke is a project with its own Ozi orchestration agent, Advisors, and local Lug store. Advisors navigate by folder convention — directory position defines scope.
Lug Schema Specification
Version 1.1.0
Lugs are WAI’s universal communication primitive — actionable records, not summaries. Every Lug represents decomposed, meaningful work with full traceability.
Lug Types
| Type | Purpose | Example |
|---|---|---|
task | Work to be done | “Migrate auth from JWT to sessions” |
diagnosis | Problem identified | “SQL injection in auth handler” |
prescription | Recommended fix | “Parameterize query at line 47” |
decision | Judgment call | “Accepted risk on X because Y” |
observation | Pattern recorded | “Coverage dropped 82% → 73%” |
preference | Workflow preference | “User prefers terse confirmations” |
signal | High-impact (≥8) | “Architecture change affects API” |
update | Framework/template version notification | “Template v3 available, running v2” |
session | Session summary | “Security review ran, 2 issues found” |
impact ≥ 8. The type field describes what kind of information it carries. Impact determines who sees it.Required Fields
id: "lug-2026-02-11-001" # Unique: lug-{date}-{sequence}
type: "diagnosis" # See Lug Types table
title: "SQL injection in auth handler"
status: "published" # draft | published | acknowledged | in_progress | resolved
impact: 9 # 1-10. ≥8 = signal (visible to other nodes)
created_at: "2026-02-11T14:30:00Z"
created_by: "security-reviewer" # Agent/skill that created this Lug
node: "ownersshare/cto" # Node path where this Lug livesTraceability Fields
# Git linkage
repo_version: "a3f7b2c" # Commit hash where this work landed
branch: "main"
changelog_note: "Fixed SQL injection per Lug SEC-047"
# Lug lineage
parent_id: "lug-2026-02-10-015" # Parent Lug if decomposed
source_id: "hub:lug-2026-02-09-003" # External Lug ID from another node
source_node: "hub"
source_acknowledged: true
# Decision tracing (type: decision)
alternatives_considered:
- option: "Migrate to sessions"
chosen: true
reasoning: "Simpler state management"
- option: "Keep JWT with refresh tokens"
chosen: false
reasoning: "Added complexity, edge cases"Diagnosis & Prescription Fields
# For type: diagnosis
severity: "critical" # critical | high | medium | low
category: "security"
evidence: "Line 47 uses string concatenation in SQL query"
affected_files:
- "src/auth/handler.js"
# For type: prescription (always linked to a diagnosis)
diagnosis_id: "lug-2026-02-11-001"
prescription: "Replace concatenation with parameterized query"
estimated_effort: "15 minutes"
auto_applicable: falseCalibration & Preference Fields
# Calibration (applied when resolved)
resolution: "accepted" # accepted | deferred | dismissed | modified
resolution_reason: "Applied as prescribed"
resolved_at: "2026-02-11T15:00:00Z"
resolved_by: "main-agent"
# Preference (type: preference)
category: "communication" # communication | workflow | tooling
observation: "User prefers terse confirmations"
guidance: "Keep verifications to numbered list, <10 lines"
applies_to: "all" # all | hub | spoke | specific pathOutbound Monitoring Fields
# Cross-node signal delivery tracking (spoke → Hub)
outbound_submitted_to: "hub/intake"
outbound_submitted_at: "2026-02-11T15:00:00Z"
outbound_acknowledged: false # Flips true when Hub processes
outbound_acknowledged_at: nullPEV Fields (Perceive / Execute / Verify)
Optional structured execution context. When present, the agent follows these instead of interpreting the title. Simple Lugs don’t need PEV. Complex Lugs where precision matters should carry it.
perceive:
look_at:
- "src/auth/handler.js"
- "tests/auth/handler.test.js"
current_state: "Email field accepts any string, no validation"
success_state: "Email field validates against RFC 5322"
context: "Phone validator in src/validators/phone.ts uses same pattern"
execute:
approach: "Add email validation using the pattern in phone.ts"
constraints:
- "Do not modify the existing phone validator"
- "Follow project convention for error messages"
avoid:
- "Do not use regex-only validation — use the validator library"
reference_patterns:
- "src/validators/phone.ts"
verify:
commands:
- "npm test -- --grep 'email validation'"
- "npm run lint"
expected_output: "All tests pass, no lint errors"
manual_check: "Try submitting with invalid email — should show error"Lifecycle
draft → published → acknowledged → in_progress → resolved
↓
(calibration applied)| State | Meaning |
|---|---|
draft | Created but not yet ready for action |
published | Active and visible to relevant agents |
acknowledged | Another node has seen this (cross-node Lugs) |
in_progress | Work has started |
resolved | Completed — calibration fields populated |
Special cases: Sub-agent Lugs (diagnosis, prescription) are created as published — no draft needed. Session Lugs are created as resolved — they are retrospective records.
Impact Scoring
| Score | Visibility | Example |
|---|---|---|
| 1–3 | Local only | “Refactored helper” |
| 4–7 | Project-wide | “API contract changed” |
| 8–10 | Wheel-wide signal → Hub | “Architecture pattern across projects” |
Heuristics: Does this affect other extensions in this project? → 4+. Does this affect other projects? → 8+. Does this change a shared interface or contract? → 7+. Is this a framework-level learning? → 9+. Is this a policy or security concern? → 8+.
Cross-Node Communication
Outbound (Spoke → Hub): When a Lug with impact ≥ 8 is created: (1) write to local WAI-Lugs.jsonl, (2) copy to hub/intake/{node-path}/{lug-id}.yaml, (3) record in local manifest as pending. On next wakeup, hub-watcher checks acknowledgment — surfaces to user if still pending.
Inbound (Hub → Spoke): On wakeup, spoke reads Hub Lugs newer than its hub_lug_cursor, creates local Lugs with source_id, decomposes into actionable work, updates cursor.
Hub Processing: Hub reads all items in hub/intake/, evaluates wheel-wide patterns, creates Hub-level Lugs, moves processed items to hub/intake/processed/, aggregates patterns across spokes into observation Lugs.
Decision Records
Decision Lugs are the apprenticeship engine. They capture reasoning and alternatives, teaching agents the conductor’s judgment over time.
Session Ledger
WAI-Ledger.jsonl — append-only log of commitments and their resolutions. Ensures commitments survive context loss and agent crashes.
| Type | Creator | Meaning |
|---|---|---|
request | conductor | “I want this done” |
agreement | agent | “I will do this, here’s how” |
clarification | either | “Do you mean X or Y?” / “I mean X” |
amendment | either | “Let’s change the approach” |
delivery | agent | “Done” + commit hash |
verification | conductor | “Confirmed” or “Doesn’t match” |
rejection | conductor | “Doesn’t fulfill the agreement, because...” |
{"id":"led-2026-02-12-001","type":"request","content":"Migrate Lug schema to v2","source":"conductor","status":"open"}
{"id":"led-2026-02-12-002","type":"agreement","content":"Will add PEV fields","source":"agent","references":"led-2026-02-12-001","status":"open"}
{"id":"led-2026-02-12-003","type":"delivery","content":"350 Lugs upgraded","source":"agent","references":"led-2026-02-12-001","commit":"73112e9","status":"fulfilled"}Integration: Wakeup — read ledger, surface open commitments. Closeout — session-observer flags unfulfilled. Resume — new agent reads ledger, compares against codebase state. Integrity — WAI-Ledger.jsonl is append-only (declared in WAI-Integrity.md).
Skill Contract Specification
Version 1.1.0
Skills are executable capabilities — sub-agents with defined scope, cost profile, and output contract.
Skill Types
| Type | Purpose | Tier | Write Access |
|---|---|---|---|
reviewer | Analyze, produce diagnoses | lightweight | Lugs only |
watcher | Monitor state changes | lightweight | Lugs only |
guardian | Enforce policies, block | standard | Lugs + block |
worker | Implement tasks | advanced | Code + Lugs |
advisor | BRIEF alignment | standard | Lugs only |
orchestrator | Reconcile, plan | advanced | Lugs + plans |
Contract Schema
skill: security-review version: 1.2.0 type: reviewer model: tier: lightweight min_context: 32000 trigger: event: on_load frequency: per_session scope: reads: ["src/**", "WAI-Lugs.jsonl"] writes: ["WAI-Lugs.jsonl"] never: ["src/**", ".env*"]Trigger Configuration
| Event | Fires When |
|---|---|
on_load | Wakeup sequence |
on_commit | After git commit |
on_content_change | Source files modified |
on_demand | Explicitly requested |
pre_refactor | Before structural changes |
Scope & Permissions
never overrides writes. Only worker Skills write source code. Scope violations are logged as Lugs.
Tests & Use Cases
Every Skill MUST include use_cases — documentation, agent context, and institutional memory of why the Skill exists.
safe-refactor (Guardian)
Git checkpoint before structural changes. Cannot be skipped. Origin: A rogue agent destroyed a Hub folder on 2026-02-10 with no recovery.
qc-check (Reviewer)
Runs tests, verifies startup, diagnoses failures. Agents fix mechanical problems autonomously — never asks the user to debug.
hub-watcher (Watcher)
Checks Hub for signals, updates, and pending acknowledgments. Priority 1 in wakeup sequence.
framework-updater (Worker)
Applies template updates. Categorizes changes as safe/review/breaking. Auto-applies safe, creates Lugs for the rest. Depends on safe-refactor.
brief-advisor (Advisor)
Reviews BRIEF against Lug patterns. Detects contradictions between policy and practice. The apprenticeship engine.
Idempotency Rules
- Before creating a Lug, check if an equivalent already exists (same type, title pattern, affected scope).
- If found and still open: update the existing Lug (bump priority if recurring).
- If found and resolved: create a new Lug referencing the original as a regression.
- Lug ID is the idempotency key for cross-node references.
- Sub-agents running the same check twice should produce the same findings unless the codebase changed.
BRIEF Integrity Checking
A Hub-level or spoke-level Skill that compares Lug patterns against BRIEF policies — surfaces contradictions between policy and practice.
- BRIEF states “maintain 80% test coverage” but 3 QC Lugs about declining coverage were dismissed → surface contradiction
- BRIEF states “security findings resolved within 48 hours” but a critical diagnosis Lug is 72 hours old → surface alert
- BRIEF states “no direct database queries in API handlers” but a diagnosis Lug found one → confirm policy still holds
This is awareness, not enforcement. The conductor decides whether to update the policy or address the violation.
Bench Test
Feature v1.0.0 · Internal · Open Dashboard →
Bench Test is the reception-side evaluator for WAI Tracks — a prompt laboratory, synchronization library, and scoring dashboard built inside WheelWright Vault.
The runtime prompt captures a session. Bench Test receives the output, scores it, compares it to prior runs, and generates grounded improvement suggestions. It turns prompt evolution from guesswork into a repeatable engineering practice.
Use Cases
| Use Case | What You Do | What You Get |
|---|---|---|
| Baseline a prompt version | Upload a track from a fresh prompt | Objective 0–10 score across 6 categories |
| Detect regressions | Upload track after prompt edit, compare to baseline | Per-category delta — improved / unchanged / regressed |
| Evidence-grounded iteration | Review generated suggestions | Specific findings from your actual track, not generic tips |
| Manage a change queue | Adopt, defer, or reject each suggestion | Curated list of prompt changes for next iteration |
| Build a prompt history | Keep running Bench Test across versions | Full audit trail of why the prompt changed over time |
Workflow
Step 1 — Prepare your Track
Export the WAI Track JSONL from your session. The standard path is:
WAI-Spoke/sessions/track_session-YYYYMMDD-HHMM.jsonlEach line must be a valid JSON object. The evaluator records parse failures as warnings — it does not reject a track because of a few malformed lines.
Step 2 — Create a Run
Go to /dashboard/bench-test and fill in the form. Required fields: Project, Prompt Version, and the Track JSONL (upload file or paste). All other fields are optional metadata that help you filter and compare runs later.
Or POST directly via the API:
curl -X POST https://wheelwright.ai/api/bench-test/runs \
-H 'X-API-Key: <your-key>' \
-H 'Content-Type: application/json' \
-d '{
"project": "WAIWeb",
"promptVersion": "v2.0.18",
"trackContent": "{\"turn\":1,...}\n{\"turn\":2,...}",
"model": "claude-sonnet-4-6",
"sessionCodename": "session-20260323-0844"
}'Step 3 — Review the Evaluation
The run detail page shows the overall score, a category breakdown with bar indicators, critical issues, and strengths. Everything on this page traces back to specific findings in your track — not boilerplate.
Step 4 — Attach Supporting Artifacts (optional)
Add a chat transcript, reviewer notes, or a review document to the same run:
curl -X POST https://wheelwright.ai/api/bench-test/runs/{runId}/artifacts \
-H 'X-API-Key: <your-key>' \
-H 'Content-Type: application/json' \
-d '{
"artifactType": "chat_transcript",
"content": "<raw transcript text>"
}'Valid artifact types: chat_transcript, review, notes, derived
Step 5 — Compare to a Prior Run
If a prior run exists for the same project, a Compare vs Previous button appears on the detail page. The comparison view shows side-by-side category scores, per-category deltas, and a badge summary. Link directly:
/dashboard/bench-test/compare?a={priorRunId}&b={thisRunId}Step 6 — Work the Suggestions
Bench Test generates a list of suggested prompt improvements grounded in the evaluation findings. Each is classified:
| Classification | Meaning |
|---|---|
critical | Active failure — address before next run |
structural | Architectural gap — worth a dedicated prompt change |
optional | Nice-to-have — consider when trimming later |
Mark each suggestion Adopt, Defer, or Reject. Adopted suggestions form your change list for the next iteration.
Scoring
Each category scores 0–10. The overall score is a weighted average.
| Category | Weight | What It Checks |
|---|---|---|
| Integrity | 25% | Sequential turns, no duplicates, no gaps, clean parse |
| Schema | 20% | Required fields present, no key drift (e vs type) |
| Signal Capture | 20% | Decisions, insights, thinking, open threads populated |
| Drift Handling | 10% | Evolution field present, phase transitions documented |
| Readability | 15% | Focus and action field length and substance |
| Export Reliability | 10% | Closing phase, parse error rate, truncation signals |
| Score | Grade |
|---|---|
| 9–10 | Excellent — production-quality prompt output |
| 7–8 | Good — minor gaps, nothing structural |
| 5–6 | Acceptable — real issues present, addressable |
| 3–4 | Needs Work — structural problems affecting signal value |
| 0–2 | Poor — fundamental capture failure |
API Reference
| Endpoint | Method | Purpose |
|---|---|---|
/api/bench-test/runs | POST | Create run, ingest JSONL, run evaluation |
/api/bench-test/runs | GET | List all runs for the authenticated user |
/api/bench-test/runs/:id/artifacts | POST | Attach transcript / review / notes |
/api/bench-test/runs/:id/artifacts | GET | List artifacts for a run |
/api/bench-test/suggestions/:id | PATCH | Set adoption status (adopted / deferred / rejected) |
All endpoints require auth: GitHub session cookie or X-API-Key header (generate from the Vault dashboard).
Submission Tips
| Category | How to score higher |
|---|---|
| Integrity | Export the complete JSONL — no partial sessions |
| Schema | Use consistent field names throughout the session |
| Signal Capture | Prompt explicitly for decisions, insights, thinking, open every turn |
| Drift Handling | Populate evolution on every turn after turn 1 |
| Readability | Keep focus to 15–80 chars; make action a substantive sentence |
| Export Reliability | End the session with phase: "review" or phase: "closeout" |