Kernel and pods

The kernel is a thin orchestrator. Pods are CLI-agent instances charged with an isolable mission. This page describes the spawn contract, the lifecycle, and the swarm guardrails.

Updated · 8 July 2026

The kernel: a thin orchestrator, not a worker

The kernel agent is the CLI-agent instance that receives the initial user prompt. Its role is not to do the work itself but to:

Classify the intent through the router (on_user_prompt.py).
Decompose if needed into a PROJ → JOB → TASK → SUB hierarchy.
Spawn specialized worker pods (researcher, coder, devops, analyst).
Observe their progress through the IPC bus and the entity feed.
Synthesize the results without polluting its own context.

This separation exists for a simple reason: the kernel’s context window is precious. Loading 200 kB of scraping logs into the kernel to extract a single URL wastes a budget we want to reserve for decision-making and synthesis.

The pod: double recursivity

A pod is another CLI-agent instance loaded with its own primitives. This creates what we call double recursivity: a CLI agent orchestrating other instances of itself.

Each pod has:

its own context window (empty at start);
its own kernel instruction file (often a targeted subset);
its own hooks (isolated lifecycle);
its own tool / MCP server set;
its own tmux for human observability.

The backend (Kubernetes pod, local tmux, VPS) is an implementation detail. The invariant is: one pod = one CLI-agent instance = one mission.

When to spawn a pod

Spawning a pod has a cost: initialization, boot tokens, orchestration overhead. The decision is not the default. Spawn if at least one of the conditions below is true:

The scope is isolable — clear objective, defined output contract.
The tool profile differs — the pod needs tools/MCP the kernel doesn’t use.
The context window benefits from a reset — task independent of the kernel’s context.
Parallelization — multiple independent tasks can run at the same time.
Useful recursivity — the pod must itself orchestrate sub-tasks.

Conversely, do not spawn if:

The task is still ambiguous — clarify first.
The output contract is undefined (expected format, validation criteria).
The orchestration cost exceeds the task complexity.
Two pods would pull the same context — consolidate into one.
The task is trivial (< 5 inline tools).

The spawn contract

Each pod must receive, at spawn time, seven elements:

Field	Content
Mission	Task description in one to three sentences
Boundaries	Scope, constraints, what the pod must not do
Minimal context	Parent lineage + sibling results via context_builder
Tools / MCP	Agent type infers the set: researcher, coder, devops, analyst, orchestrator
Output contract	JSON format, bus channel, valid statuses
Failure policy	Retry count, escalation, timeout
Autonomy mode	Always `execution` — the pod never asks for permission

A mission that cannot be formalized into these seven fields is not yet ready to be spawned. That’s a signal the orchestrator should clarify the request, not delegate its ambiguity.

DAG-aware decomposition: target latency is the critical path

The PROJ → JOB → TASK decomposition is not just a tree: each unit declares its input dependencies (which other units’ outputs it needs). From these declarations the orchestrator derives a DAG (directed acyclic graph) that dictates the execution plan:

Units with no mutual dependency run in parallel (scatter-gather) — their execution times are masked by one another.
A true data dependency imposes series — and only that. “It’s simpler to do one after the other” is not a dependency.

A project’s target latency is therefore the length of its critical path, never the sum of its tasks. An orchestrator that serializes out of convenience turns a one-hour project into a one-day project — with exactly the same work.

Auto-restart on DoD: autonomy that respects states

Every level of the hierarchy carries an explicit Definition of Done. An autonomy driver periodically walks the hierarchy and relaunches work whose DoD is not met — this is what lets a multi-day project survive session endings, context compactions, and restarts, without a human having to say “continue”.

Two guardrails make this relaunch safe:

The blocked / paused states are respected. A job suspended by the operator, or blocked awaiting a human decision, is never relaunched automatically. Resumption belongs to whoever set the block. Without this rule, autonomy produces sterile resumes that redo the same work in a loop.
Relaunch goes through the meta-curator’s claim verification: a job “done” without evidence is demoted and relaunched; a genuinely finished job is not re-executed.

The spawn primitive

The only canonical entry point to spawn a pod is the script spawn_pod_deterministic.sh. It enforces in order:

Semaphore acquisition (Redis) — hard gate, exit 2 if full.
Tmux spawn — dedicated window named after the pod.
Readiness hard-verify — 3 retries × 3 seconds; exit 1 and release the semaphore on failure.
created event published on the Redis stream nika:feed:entities, visible to the cg:worker consumer group.

No tmux send-keys "<agentic-cli> ..." is allowed outside this script. This discipline guarantees that all pods go through the same guardrails, are visible in the fleet, and clean up their resources properly.

Pod lifecycle

The standard pod lifecycle has five verbs:

spawn → invoke → readjust → observe → kill

spawn: instance creation, SessionStart hooks boot, context injection.
invoke: mission execution. Lifetime governed by the contract.
readjust: on partial failure, mutate mutable primitives (never the kernel) and retry.
observe: the kernel reads the Redis stream, the JSONL bus, and the YAML hierarchy state.
kill: SubagentStop → publish pod_done + semaphore release + summary ingested to Qdrant.

Swarm guardrails

Max concurrency: 6 simultaneous pods (Redis semaphore).
Max recursion depth: 4 levels (PROJ → JOB → TASK → SUB).
TTL backstop: 1-hour auto-release if a pod hangs without heartbeat.
Stale pod policy: task_watcher.py scans every 2 minutes.
Handoff audit: pod_handoff.audit_and_decide() produces a verdict DONE / PARTIAL / STUCK / DORMANT at every pod end.

These limits are not suggestions. They exist because an unbounded swarm ends up saturating its own resources before the operator notices.