Observability and controllers
How Nika OS observes its own pods without polluting its context. The KTW (Kelly–Taguchi–Weibull) controller at the heart of probabilistic piloting.
Observability as a primitive
A non-observable agentic system is a system you cannot pilot. Nika OS therefore exposes observability as a first-class primitive, not as an add-on.
Three signal sources coexist:
- The Redis stream per pod — every tool call produces an
XADD. - The JSONL bus — append-only, permanent audit trail.
- The fleet state — Redis dictionary
nika:fleet:statemaintained by thefleet_consumer.pydaemon.
The fleet daemon
fleet_consumer.py runs permanently in a dedicated tmux
(nika-os:FLEET-CONSUMER). It:
- consumes the
cg:workerstream of the entity feed; - maintains an aggregated state per pod (status, last tool, error count, KTW loss);
- exposes this state via
nika:fleet:state(Redis HASH read byfleet_status.py).
Without this daemon, pods are blind to each other. Before any action likely to cause an inter-pod conflict (shared file edit, deployment, schema change), a pod must call:
python3 scripts/fleet_status.py --active
This command returns within milliseconds the list of active pods, their entity, their last action, and their loss score. No context-window read is required.
Pod streaming without context pollution
A classic mistake in multi-agent orchestration is to ask the orchestrator to read each pod’s transcripts to understand what they are doing. This loads the orchestrator’s context window with low-priority content (detailed logs) and saturates it quickly.
Nika OS draws the following separation:
- Significant events (creation, completion, failure, escalation) reach Alpha through the entity feed.
- Detailed logs stay local on the pod (
logs/) and in Redis Streams. Alpha reads them on demand, never by default. - A dedicated skill (
pod-observe) lets Alpha do a semantic search on a specific pod’s tool calls, without loading its transcript.
This discipline lets Alpha coexist with 6 active worker pods in parallel without saturating its own context window.
The KTW controller
At the heart of BCUB3’s industrial observability system — and at the origin of two INPI patents filed in January 2026 — lives the KTW controller, for Kelly–Taguchi–Weibull. This controller serves two purposes:
- Pilot a physical process (bottle filler, oven, press).
- Pilot an agent’s hyperparameters (e.g. model temperature, pod retry count) based on an observed metric.
Kelly: how much to intervene
The Kelly formula (1956) computes the optimal fraction of a capital to bet on an event with known probability. Transposed to industrial piloting, it computes the optimal correction fraction to apply to a machine setpoint when we have an estimated probability that the deviation is a real drift.
Taguchi: which loss to minimize
The Taguchi loss translates the gap between a measurement and its target into cost. Instead of a binary function (“good” or “bad”), Taguchi posits a continuous quadratic function: the further from target, the higher the loss. This lets the controller prioritize its corrections: a gap close to the target costs little, a distant gap costs a lot.
Weibull: which probability of drift
To estimate whether a deviation is a real drift or measurement noise, we fit a Weibull distribution on the last 50 measurements. The shape parameter of this distribution tells whether we observe a stable distribution (noise) or a heavy tail (drift in progress).
Combining the three
The KTW controller takes the following decision at each measurement:
correction = kelly_fraction(Weibull_probability) × Taguchi_loss
flowchart LR
M["Measurement y(t)"] --> W["Weibull<br/>P(real drift)"]
M --> T["Taguchi<br/>L = k·(y-target)²"]
W --> K["Kelly fraction<br/>f(P)"]
K --> D{"Decision"}
T --> D
D -->|"low P"| N["Noise<br/>do nothing"]
D -->|"intermediate P"| S["Doubt<br/>soft correction<br/>+ wait for confirmation"]
D -->|"high P + strong L"| F["Drift<br/>firm correction<br/>+ immediate"]
classDef input fill:#F5F1E8,color:#2C3E42,stroke:#7DB5A5,stroke-width:2px;
classDef calc fill:#F5F1E8,color:#2C3E42,stroke:#A86640,stroke-width:1.5px;
classDef decide fill:#2C3E42,color:#F5F1E8,stroke:#1A262A,stroke-width:2px;
classDef noop fill:#7DB5A5,color:#F5F1E8,stroke:#5E9384;
classDef soft fill:#E99971,color:#F5F1E8,stroke:#C97A55;
classDef firm fill:#C97A55,color:#F5F1E8,stroke:#A86640;
class M input;
class W,T,K calc;
class D decide;
class N noop;
class S soft;
class F firm;
Three distinct zones emerge:
- Noise (low Weibull probability) — do nothing.
- Doubt (intermediate zone) — soft correction, wait for confirmation.
- Drift (high Weibull probability and strong Taguchi loss) — firm and immediate correction.
This behavior makes the system antifragile in the Taleb sense: instead of oscillating at every noise, it waits until it has the statistical information needed before intervening, and learns from each observed drift.
Application to agent orchestration
The same KTW controller is used to pilot pods. The observed metric is no longer a physical measurement but a loss score computed on the pod’s tool calls and outputs.
When a pod’s loss crosses a threshold, the system:
- Does not kill the pod immediately (equivalent to “do nothing” in the noise zone).
- If the loss keeps growing past the intermediate threshold, it mutates the mutable primitives (temperature, retry policy, prompts).
- If the loss stays high after mutation, it escalates:
WhatsApp alert → JSONL bus → review request to the human operator.
The pod alerting system (pod-alerts skill) reads the
nika:alpha:alerts stream filled by the pod stream embedder. At the
beginning of a triage session, the operator can ask “any alerts?” and
receive the list of pods in trouble with their observed loss.
Going further
The KTW B1 and B2 patents are described in detail on the main page of the BCUB3 Lab. The sWELU neural activation function (INPI patent FR2513029) is complementary: it improves the training of small models used for Weibull edge estimators.
Judging a deliverable: the difficulty of “Definition of Done”
Observing a pod is easy (status, latency, exit code). Judging whether it really accomplished its task is significantly harder. This difficulty is at the heart of Nika OS R&D.
Why DoD is tricky
For a classic job (build, tests, deployment), DoD is binary: exit 0 or not, green tests or not. For an agentic job, DoD is multi-dimensional:
- Did the pod produce the right files (consistency with the brief)?
- Do the files have the expected quality (readability, completeness)?
- Are the business invariants respected (no PII, no false claims, no irreversible commitment)?
- Does the style match (professional tone, consistent wording)?
- Were there undocumented side effects (unlisted commit, file modification out of scope)?
None of these criteria is observable through a simple exit code. All require a semantic judgment.
LLM-as-judge — the evaluation layer
Our pod_handoff primitive applies a structured judgment at the end
of each pod (SubagentStop hook). The verdict is one of:
| Verdict | Meaning | Action |
|---|---|---|
| DONE | DoD met, deliverables produced, invariants respected | kill (no-op) |
| PARTIAL | Part of the deliverables missing or with minor defect | relaunch with enriched brief |
| STUCK | Pod stayed stuck without producing | relaunch retry_count++ |
| DORMANT | Long-duration STUCK without transcript or session — multi-day project, not an error | archive without escalation |
| NO_DOD | Brief without explicit Definition of Done | human escalation |
Why an LLM (and not only rules)
Several criteria are implemented in deterministic rules (regex, file counting, frontmatter reading). But others require a language model able to understand the brief, compare against outputs, and emit a nuanced score. This layer is necessarily subjective. To reduce bias:
- Cross-CLI judge — the same output is judged by 2–3 different LLMs (Claude, Gemini, Mistral); we take the median or the consensus.
- Algorithmic check before LLM — we start with deterministic checks (factuality against RAG, JSON schema, length, etc.); the LLM is called only on the residual uncertainty.
- Mandatory audit log — every LLM verdict is traced in JSONL with the prompt, the output, the estimated confidence, and the human verdict a posteriori when there is one. The GEPA tournament uses these gaps to evolve the judgment prompt itself.
Exploration as first-class citizen
Direct consequence: you cannot optimize what you cannot measure. Nika OS therefore accepts an irreducible share of exploration:
- Several CLIs answer in parallel to the same prompt (multi-pod tournament) to generate output diversity.
- A deterministic meta-pod combines the outputs through classic algorithms (consensus voting, outlier detection, factual check) before engaging the LLM-as-judge.
- The KTW algorithm weights the judgment confidence: if dispersion is high (low agreement among judges), we decrease the Kelly fraction and intervene less strongly. If agreement is high, we act with confidence.
This philosophy — benefit from variability rather than fight it — is the operational expression of the antifragility doctrine described above. An individual pod can be wrong; a pod swarm coupled to a multi-perspective evaluator is harder to fool.