IPC and bus

The nervous system that connects pods together: Redis Streams for real-time, JSONL bus for audit trail, consumer groups for coordination.

Four communication layers

The IPC subsystem of Nika OS combines four layers, each with a distinct role. All are fail-open: if Redis goes down, pods keep running in degraded mode rather than failing in cascade.

LayerRedis patternTTLUsage
Signalingnika:ipc:{job_id} (STREAM)2 hEvents: result, signal, request, error
Working memorynika:wm:{job_id}:{artifact} (HASH)1 hIntermediate results < 512 kB
Contractsnika:contract:{job_id} (HASH)2 hMulti-pod pipeline: steps, roles, I/O
Entity feednika:feed:entities (STREAM)24 hAll PROJ/JOB/TASK changes

Consumer groups

The nika:feed:entities stream is consumed by three groups, which share events according to their role:

  • cg:orchestrator — the Alpha pod receives all events. It uses this to observe the swarm without having to query each pod individually.
  • cg:worker — worker pods are load-balanced in this group. When a new job appears, a single worker is notified, avoiding races.
  • cg:monitor — the dashboard (Grafana, K3s monitoring pods) also receives all events for real-time metrics.

This separation is what lets Alpha keep its context window clean. Alpha does not have to read the logs of every pod: it subscribes to the entity feed and receives only significant transitions (created, completed, failed). If Alpha wants detail, it reads the JSONL bus or queries Qdrant.

Dual-write: Redis + Qdrant

Significant events (completed, failed, created for root hierarchies) are dual-written:

  • to Redis Streams for real-time latency;
  • to Qdrant for long-term persistence and semantic recall.

The Qdrant event carries the complete NIKA_META envelope. Six months later, one can query “how many domain=hooks jobs failed with intent=refactor in April 2026?” and get an instant answer.

JSONL bus: the audit trail

In parallel with Redis, each orchestration message is append-only in a JSONL file:

  • _bus/alpha_bus.jsonl — Alpha’s global channel.
  • _bus/channels/{entity_id}.jsonl — one file per PROJ/JOB/TASK entity.

Why a flat file when Redis does the job? Three reasons:

  1. Survives Redis outages — pods can still log.
  2. Offline audit — an operator can grep six months of history without touching a database.
  3. Forensics — when something goes wrong, the JSONL bus is the last source of truth, independent of other systems.

Standard message types:

session_start         — pod or Alpha startup
session_end_summary   — termination with enriched summary
subagent_completed    — a pod finished, summary ingested to Qdrant
review_request        — the parent pod must validate a deliverable
autonomous_dispatch   — dispatch fired by autonomy_engine (3% probabilistic)
job_completed         — hierarchical entity completed

Redis Streams telemetry

Each tool call in a pod triggers an XADD on the stream agent:events:{session_id} (maxlen=5000, TTL 24 h). In parallel, the PostToolUse hook updates:

  • agent:state:{session_id} (HSET heartbeat, TTL 2 h) — is the pod still alive?
  • agent:metrics:{date} (HINCRBY, TTL 7 d) — daily counters per tool.
  • nika:ipc:metrics:{date} (HINCRBY, TTL 7 d) — IPC-specific counters.

The pod observes without blocking. If the Redis connection flickers, the tool call still succeeds — telemetry is best-effort, not critical for execution.

The fail-open pattern

The whole IPC subsystem follows a strict pattern: fail-open with TTL everywhere. A pod that cannot reach Redis:

  • continues local execution;
  • logs the error locally (file in logs/);
  • does not write a JSONL bus message during downtime;
  • resumes normal acks when Redis recovers.

No pod should ever block because an ancillary system is slow or unavailable. This is a conservative design decision, justified by the fact that pods often run in long loops (hours), and any silent hang ends up costing dearly in debugging.

In one sentence

Nika OS IPC is designed so that dozens of pods can coexist without stepping on each other, without saturating Alpha’s memory, and without falling in cascade if a component flickers. Redis carries real-time; Qdrant carries memory; JSONL carries proof.