Documentation

Architecture

How Mnemix is actually built — the worker, the storage tiers, the gate stages, and where the latency budget goes. Every number on this page is an internal engineering budget from the locked spec, not a marketing claim; the only public latency phrase is designed for sub-300ms voice recall.

Status labels matter on this page: live = serving production traffic today; in development = spec-locked, being built; target = locked architecture direction.

One worker, one domain

All public traffic lands on a single Cloudflare Worker behind mcp.mnemix.ai — REST v1 routes and the MCP transport (/mcp) on the same host. There are no other public hostnames. Edge placement means the request is served from a PoP near the caller, not a fixed region.

            caller / agent / MCP client
                        │
                        ▼
        ┌──────────────────────────────────┐
        │  Main Worker — mcp.mnemix.ai     │   live
        │  · REST v1 (recall, calls/end,   │
        │    caller)                       │
        │  · MCP transport at /mcp         │
        │  · auth (hashed API keys)        │
        │  · token-bucket rate limiting    │
        │  · validation gate (Stage 1+2)   │   in development
        └──────┬─────────┬─────────┬───────┘
               │         │         │
        Hyperdrive     Redis     QStash
               │      (cache +  (async jobs:
               ▼      ratelimit)  enrichment,
          Postgres                summaries)
        (+ pgvector,
         RLS, ranges)

Storage tiers

Hyperdrive matters more than it sounds: Workers can't hold long-lived database pools, and Hyperdrive's edge pooling removes the multi-round-trip connection handshake that kills cold serverless queries.

Tenant isolation

Every tenant-scoped table carries row-level security, enforced at the database layer under a non-bypassing application role — not just application-code WHERE clauses. Personal caller data never crosses tenants; only public business data is cached cross-tenant. High-volume audit tables are partitioned annually so retention policies stay enforceable.

The bi-temporal substrate

evidence_refs and locked_facts carry generated tstzrange columns (valid_range, tx_range), GiST-indexed so as-of reconstruction is an index scan, not an archaeology project. Writes go through supersession-safe procedures (assert_fact, evolve_fact) — never raw updates. Memory objects are deliberately not bi-temporal (locked decision D3); they're the recall layer the substrate governs.

The gate, inside the worker

The validation gate is two stages in one process boundary — no separate gate service, no extra hop:

Stage 1 — deterministic evaluator (no LLM): runs the bundle's invariant / assertion / forbidden rules as compiled expectation suites. Budget: ≤50ms P99. Fully replayable — a past verdict re-runs byte-identical.
Stage 2 — isolated grader (qualitative rules only): a role-assigned lightweight model (default Haiku 4.5) scored against a golden set, process-isolated from the acting path so raw model output never crosses the boundary — only the band (PASS / CHATTER / REJECT) does. Budgets: ≤4ms cached, ≤80ms uncached P99. The model is swappable by operators, but only with a fresh acceptance report proving the role contract.

External verdicts are allowed / denied / needs_human; the grader's bands map into them (PASS → allowed, REJECT → denied, CHATTER → one retry → needs_human).

The hot path, end to end

A voice recall has a conversational budget, so everything on the path is parallel, cached, or deferred:

Auth + rate limit — hashed-key lookup and a Redis token bucket (fails open: if Redis is down, traffic serves rather than blocks).
Memory — read-through cache first; Postgres (via Hyperdrive) only on a miss.
Enrichment — Trestle, Twilio Lookup, and Baylio race in parallel against a hard deadline; whatever resolves in budget ships in the response, the rest completes asynchronously via QStash jobs and writes back for next time.
Write-back — calls/end accepts and returns; summarization happens in background jobs, never in front of the response.

Every response carries measured timing_ms — the architecture is judged by real per-request numbers, not published averages.

Where it's heading

The locked direction: the COLD Qdrant tier for large-scale episodic recall, the gate contract graduating to the public API, and retrieval fusion (signal-weighted reranking across tiers) — all spec-locked, all labeled here as they ship.