Kernel and pods
Alpha is a thin orchestrator. Pods are Claude Code instances charged with an isolable mission. This page describes the spawn contract, the lifecycle, and the swarm guardrails.
Alpha: a thin orchestrator, not a worker
The Alpha pod is the Claude Code instance that receives the initial user prompt. Its role is not to do the work itself but to:
- Classify the intent through the router (
on_user_prompt.py). - Decompose if needed into a
PROJ → JOB → TASK → SUBhierarchy. - Spawn specialized worker pods (researcher, coder, devops, analyst).
- Observe their progress through the IPC bus and the entity feed.
- Synthesize the results without polluting its own context.
This separation exists for a simple reason: Alpha’s context window is precious. Loading 200 kB of scraping logs into Alpha to extract a single URL wastes a budget we want to reserve for decision-making and synthesis.
The pod: double recursivity
A pod is another Claude Code instance loaded with its own primitives. This creates what we call double recursivity: Claude Code orchestrating Claude Code.
Each pod has:
- its own context window (empty at start);
- its own
CLAUDE.md(often a targeted subset); - its own hooks (isolated lifecycle);
- its own tool / MCP server set;
- its own tmux for human observability.
The backend (Kubernetes pod, local tmux, VPS) is an implementation detail. The invariant is: one pod = one Claude Code instance = one mission.
When to spawn a pod
Spawning a pod has a cost: initialization, boot tokens, orchestration overhead. The decision is not the default. Spawn if at least one of the conditions below is true:
- The scope is isolable — clear objective, defined output contract.
- The tool profile differs — the pod needs tools/MCP Alpha doesn’t use.
- The context window benefits from a reset — task independent of Alpha’s context.
- Parallelization — multiple independent tasks can run at the same time.
- Useful recursivity — the pod must itself orchestrate sub-tasks.
Conversely, do not spawn if:
- The task is still ambiguous — clarify first.
- The output contract is undefined (expected format, validation criteria).
- The orchestration cost exceeds the task complexity.
- Two pods would pull the same context — consolidate into one.
- The task is trivial (< 5 inline tools).
The spawn contract
Each pod must receive, at spawn time, seven elements:
| Field | Content |
|---|---|
| Mission | Task description in one to three sentences |
| Boundaries | Scope, constraints, what the pod must not do |
| Minimal context | Parent lineage + sibling results via context_builder |
| Tools / MCP | Agent type infers the set: researcher, coder, devops, analyst, orchestrator |
| Output contract | JSON format, bus channel, valid statuses |
| Failure policy | Retry count, escalation, timeout |
| Autonomy mode | Always execution — the pod never asks for permission |
A mission that cannot be formalized into these seven fields is not yet ready to be spawned. That’s a signal the orchestrator should clarify the request, not delegate its ambiguity.
The spawn primitive
The only canonical entry point to spawn a pod is the script
spawn_pod_deterministic.sh. It enforces in order:
- Semaphore acquisition (Redis) — hard gate, exit 2 if full.
- Tmux spawn — dedicated window named after the pod.
/remote-controlhard-verify — 3 retries × 3 seconds; exit 1 and release semaphore on failure.createdevent published on the Redis streamnika:feed:entities, visible to thecg:workerconsumer group.
No tmux send-keys "claude ..." is allowed outside this script.
This discipline guarantees that all pods go through the same guardrails,
are visible in the fleet, and clean up their resources properly.
Pod lifecycle
The standard pod lifecycle has five verbs:
spawn → invoke → readjust → observe → kill
- spawn: instance creation,
SessionStarthooks boot, context injection. - invoke: mission execution. Lifetime governed by the contract.
- readjust: on partial failure, mutate mutable primitives (never the kernel) and retry.
- observe: Alpha reads the Redis stream, the JSONL bus, and the YAML hierarchy state.
- kill:
SubagentStop→ publishpod_done+ semaphore release + summary ingested to Qdrant.
Swarm guardrails
- Max concurrency: 6 simultaneous pods (Redis semaphore).
- Max recursion depth: 4 levels (
PROJ → JOB → TASK → SUB). - TTL backstop: 1-hour auto-release if a pod hangs without heartbeat.
- Stale pod policy:
task_watcher.pyscans every 2 minutes. - Handoff audit:
pod_handoff.audit_and_decide()produces a verdictDONE / PARTIAL / STUCK / DORMANTat every pod end.
These limits are not suggestions. They exist because an unbounded swarm ends up saturating its own resources before the operator notices.