Architecture
Architecture
How Mnemix is actually built — the worker, the storage tiers, the gate stages, and where the latency budget goes. Every number on this page is an internal engineering budget from the locked spec, not a marketing claim; the only public latency phrase is designed for sub-300ms voice recall.
Status labels matter on this page: live = serving production traffic today; in development = spec-locked, being built; target = locked architecture direction.
One worker, one domain
All public traffic lands on a single Cloudflare Worker behind mcp.mnemix.ai — REST v1 routes and the MCP transport (/mcp) on the same host. There are no other public hostnames. Edge placement means the request is served from a PoP near the caller, not a fixed region.
caller / agent / MCP client
│
▼
┌──────────────────────────────────┐
│ Main Worker — mcp.mnemix.ai │ live
│ · REST v1 (recall, calls/end, │
│ caller) │
│ · MCP transport at /mcp │
│ · auth (hashed API keys) │
│ · token-bucket rate limiting │
│ · validation gate (Stage 1+2) │ in development
└──────┬─────────┬─────────┬───────┘
│ │ │
Hyperdrive Redis QStash
│ (cache + (async jobs:
▼ ratelimit) enrichment,
Postgres summaries)
(+ pgvector,
RLS, ranges)
Storage tiers
| Tier | Technology | Role | Status |
| :--- | :--- | :--- | :--- |
| HOT | Upstash Redis (multi-region) | Read-through contact cache, token buckets, dispatch claims | live |
| WARM | Supabase Postgres via Cloudflare Hyperdrive | Source of truth: contacts, interactions, memory, the bi-temporal substrate (evidence_refs, locked_facts), audit | live |
| WARM (vector) | pgvector in Postgres (HNSW) | In-Postgres semantic slice for tenant-scale retrieval | live |
| COLD | Qdrant cluster | Episodic ANN at large scale (payload-filtered, quantized) | target |
| Assets | Cloudflare R2 | Brand + artifact storage | live |
Hyperdrive matters more than it sounds: Workers can't hold long-lived database pools, and Hyperdrive's edge pooling removes the multi-round-trip connection handshake that kills cold serverless queries.
Tenant isolation
Every tenant-scoped table carries row-level security, enforced at the database layer under a non-bypassing application role — not just application-code WHERE clauses. Personal caller data never crosses tenants; only public business data is cached cross-tenant. High-volume audit tables are partitioned annually so retention policies stay enforceable.
The bi-temporal substrate
evidence_refs and locked_facts carry generated tstzrange columns (valid_range, tx_range), GiST-indexed so as-of reconstruction is an index scan, not an archaeology project. Writes go through supersession-safe procedures (assert_fact, evolve_fact) — never raw updates. Memory objects are deliberately not bi-temporal (locked decision D3); they're the recall layer the substrate governs.
The gate, inside the worker
The validation gate is two stages in one process boundary — no separate gate service, no extra hop:
- Stage 1 — deterministic evaluator (no LLM): runs the bundle's
invariant/assertion/forbiddenrules as compiled expectation suites. Budget: ≤50ms P99. Fully replayable — a past verdict re-runs byte-identical. - Stage 2 — isolated grader (qualitative rules only): a role-assigned lightweight model (default Haiku 4.5) scored against a golden set, process-isolated from the acting path so raw model output never crosses the boundary — only the band (
PASS/CHATTER/REJECT) does. Budgets: ≤4ms cached, ≤80ms uncached P99. The model is swappable by operators, but only with a fresh acceptance report proving the role contract.
External verdicts are allowed / denied / needs_human; the grader's bands map into them (PASS → allowed, REJECT → denied, CHATTER → one retry → needs_human).
The hot path, end to end
A voice recall has a conversational budget, so everything on the path is parallel, cached, or deferred:
- Auth + rate limit — hashed-key lookup and a Redis token bucket (fails open: if Redis is down, traffic serves rather than blocks).
- Memory — read-through cache first; Postgres (via Hyperdrive) only on a miss.
- Enrichment — Trestle, Twilio Lookup, and Baylio race in parallel against a hard deadline; whatever resolves in budget ships in the response, the rest completes asynchronously via QStash jobs and writes back for next time.
- Write-back —
calls/endaccepts and returns; summarization happens in background jobs, never in front of the response.
Every response carries measured timing_ms — the architecture is judged by real per-request numbers, not published averages.
Where it's heading
The locked direction: the COLD Qdrant tier for large-scale episodic recall, the gate contract graduating to the public API, and retrieval fusion (signal-weighted reranking across tiers) — all spec-locked, all labeled here as they ship.