Observability and controllers

How Nika OS observes its own pods without polluting its context, and the online controller that pilots hyperparameters from observed signals.

Observability as a primitive

A non-observable agentic system is a system you cannot pilot. Nika OS therefore exposes observability as a first-class primitive, not as an add-on.

Three signal sources coexist:

The Redis stream per pod — every tool call produces an XADD.
The JSONL bus — append-only, permanent audit trail.
The fleet state — Redis dictionary nika:fleet:state maintained by the fleet_consumer.py daemon.

The fleet daemon

fleet_consumer.py runs permanently in a dedicated tmux (nika-os:FLEET-CONSUMER). It:

consumes the cg:worker stream of the entity feed;
maintains an aggregated state per pod (status, last tool, error count, loss score);
exposes this state via nika:fleet:state (Redis HASH read by fleet_status.py).

Without this daemon, pods are blind to each other. Before any action likely to cause an inter-pod conflict (shared file edit, deployment, schema change), a pod must call:

python3 scripts/fleet_status.py --active

This command returns within milliseconds the list of active pods, their entity, their last action, and their loss score. No context-window read is required.

Pod streaming without context pollution

A classic mistake in multi-agent orchestration is to ask the orchestrator to read each pod’s transcripts to understand what they are doing. This loads the orchestrator’s context window with low-priority content (detailed logs) and saturates it quickly.

Nika OS draws the following separation:

Significant events (creation, completion, failure, escalation) reach the kernel through the entity feed.
Detailed logs stay local on the pod (logs/) and in Redis Streams. The kernel reads them on demand, never by default.
A dedicated skill (pod-observe) lets the kernel do a semantic search on a specific pod’s tool calls, without loading its transcript.

This discipline lets the kernel coexist with 6 active worker pods in parallel without saturating its own context window.

The online controller

Observing is not enough: you have to act on what you observe. For hyperparameters that have a reward measurable at decision time (model temperature, pod retry count, concurrency, number of RAG chunks), Nika OS uses an online controller.

At each measurement, the controller combines three ingredients:

an estimate of whether the observed deviation is noise or a real drift;
the cost of the deviation from target — a loss that grows non-linearly with distance, so corrections can be prioritized: a close gap costs little, a distant gap costs a lot;
an allocation of the intervention that grows with confidence — intervene little when the signal is uncertain, firmly when it is clear.

flowchart LR
    M["Measurement y(t)"] --> W["Noise or drift?<br/>P(real drift)"]
    M --> T["Cost of deviation<br/>grows with distance"]
    W --> K["Allocation<br/>∝ confidence"]
    K --> D{"Decision"}
    T --> D
    D -->|"low P"| N["Noise<br/>do nothing"]
    D -->|"intermediate P"| S["Doubt<br/>soft correction<br/>+ wait for confirmation"]
    D -->|"high P + strong cost"| F["Drift<br/>firm correction<br/>+ immediate"]

    classDef input fill:#F5F1E8,color:#2C3E42,stroke:#7DB5A5,stroke-width:2px;
    classDef calc fill:#F5F1E8,color:#2C3E42,stroke:#A86640,stroke-width:1.5px;
    classDef decide fill:#2C3E42,color:#F5F1E8,stroke:#1A262A,stroke-width:2px;
    classDef noop fill:#7DB5A5,color:#F5F1E8,stroke:#5E9384;
    classDef soft fill:#E99971,color:#F5F1E8,stroke:#C97A55;
    classDef firm fill:#C97A55,color:#F5F1E8,stroke:#A86640;
    class M input;
    class W,T,K calc;
    class D decide;
    class N noop;
    class S soft;
    class F firm;

Three distinct zones emerge:

Noise (low drift probability) — do nothing.
Doubt (intermediate zone) — soft correction, wait for confirmation.
Drift (high probability and strong cost) — firm and immediate correction.

This behavior makes the system antifragile in the Taleb sense: instead of oscillating at every noise, it waits until it has the statistical information needed before intervening, and learns from each observed drift.

Application to agent orchestration

The online controller is used to pilot pods. The observed metric is a loss score computed on the pod’s tool calls and outputs.

When a pod’s loss crosses a threshold, the system:

Does not kill the pod immediately (equivalent to “do nothing” in the noise zone).
If the loss keeps growing past the intermediate threshold, it mutates the mutable primitives (temperature, retry policy, prompts).
If the loss stays high after mutation, it escalates: WhatsApp alert → JSONL bus → review request to the human operator.

The pod alerting system (pod-alerts skill) reads the nika:kernel:alerts stream filled by the pod stream embedder. At the beginning of a triage session, the operator can ask “any alerts?” and receive the list of pods in trouble with their observed loss.

Going further

The online controller is only one of the system’s three evolution mechanisms. The Meta-curator and evolution page describes how the harness — prompts, skills, hooks, tools and success measures — evolves per task type, and how continuous supervision and nightly consolidation complement each other.

Judging a deliverable: the difficulty of “Definition of Done”

Observing a pod is easy (status, latency, exit code). Judging whether it really accomplished its task is significantly harder. This difficulty is at the heart of Nika OS R&D.

Why DoD is tricky

For a classic job (build, tests, deployment), DoD is binary: exit 0 or not, green tests or not. For an agentic job, DoD is multi-dimensional:

Did the pod produce the right files (consistency with the brief)?
Do the files have the expected quality (readability, completeness)?
Are the business invariants respected (no PII, no false claims, no irreversible commitment)?
Does the style match (professional tone, consistent wording)?
Were there undocumented side effects (unlisted commit, file modification out of scope)?

None of these criteria is observable through a simple exit code. All require a semantic judgment.

LLM-as-judge — the evaluation layer

Our pod_handoff primitive applies a structured judgment at the end of each pod (SubagentStop hook). The verdict is one of:

Verdict	Meaning	Action
DONE	DoD met, deliverables produced, invariants respected	kill (no-op)
PARTIAL	Part of the deliverables missing or with minor defect	relaunch with enriched brief
STUCK	Pod stayed stuck without producing	relaunch retry_count++
DORMANT	Long-duration STUCK without transcript or session — multi-day project, not an error	archive without escalation
NO_DOD	Brief without explicit Definition of Done	human escalation

Why an LLM (and not only rules)

Several criteria are implemented in deterministic rules (regex, file counting, frontmatter reading). But others require a language model able to understand the brief, compare against outputs, and emit a nuanced score. This layer is necessarily subjective. To reduce bias:

Cross-CLI judge — the same output is judged by 2–3 different CLI agents (e.g. Codex, Gemini CLI, or any MCP-compatible agentic CLI); we take the median or the consensus.
Algorithmic check before LLM — we start with deterministic checks (factuality against RAG, JSON schema, length, etc.); the LLM is called only on the residual uncertainty.
Mandatory audit log — every LLM verdict is traced in JSONL with the prompt, the output, the estimated confidence, and the human verdict a posteriori when there is one. The GEPA tournament uses these gaps to evolve the judgment prompt itself.

Exploration as first-class citizen

Direct consequence: you cannot optimize what you cannot measure. Nika OS therefore accepts an irreducible share of exploration:

Several CLIs answer in parallel to the same prompt (multi-pod tournament) to generate output diversity.
A deterministic meta-pod combines the outputs through classic algorithms (consensus voting, outlier detection, factual check) before engaging the LLM-as-judge.
The controller weights the judgment confidence: if dispersion is high (low agreement among judges), we decrease the allocation and intervene less strongly. If agreement is high, we act with confidence.

This philosophy — benefit from variability rather than fight it — is the operational expression of the antifragility doctrine described above. An individual pod can be wrong; a pod swarm coupled to a multi-perspective evaluator is harder to fool.