Text-based memory re-feeds the same retrieved context into the model on every turn, so serving cost climbs with the conversation. Atelya OS keeps the relevant working set as reused KV: compute it once, reuse it. Same answers, a fraction of the prefill.
The shaded gap is the saving: reusing KV instead of re-prefilling memory runs about 6–54× cheaper over a session, at parity answer quality — roughly 6× by default, up to ~54× for a stable working set.
Driving the real engine (CacheBlend KV reuse + residency) and scored by an independent LLM judge — the same judge on both arms. Every figure reproduces from the repo.
Fidelity
Answer-for-answer agreement with a cold full re-prefill of the same context (n=200, LLM-judged). Reusing KV instead of recomputing it does not change the answer.
Prefill cost
Less prefill compute per query. ~6× is the conservative default (recall + blend); ~54× when a stable working set is reused across a session. Break-even at ~1 query.
Quality
Head-to-head vs Mem0 at a matched budget: 60% vs 55% answer correctness (n=20, within noise). A cost result at comparable quality — not a recall-accuracy win.
Reproduces on a single RTX 4070 in minutes. Full methodology, per-query data, and honest limits → BENCHMARK.md. The absolute accuracy is retrieval-bound and identical across both arms; the claim here is fidelity + cost, not recall SOTA.
Text memory lives outside the model and gets re-prefilled. Atelya is the bridge: it recalls the relevant set, then serves its precomputed KV — so the model attends to memory it never has to recompute.
Embeddings pull the chunks relevant to this turn. Atelya composes with your recall layer — bring your own retriever.
Each chunk is served from its precomputed attention KV — position-independent, ~15% selective recompute — instead of a full re-prefill.
A value-model policy (relevance × recency × reuse − size) keeps the hottest memory resident on GPU and tiers the rest to CPU/disk.
The re-prefill tax grows with memory size × turns. Atelya pays off most where both are large — long sessions over a stable, sizeable working set.
Repo context, file trees, and prior edits reused across dozens of turns. Re-prefilling the codebase on every step is the dominant input cost.
Long per-user history plus a product knowledge base injected on every reply — the same context, paid for again and again.
Many questions against one large corpus. Recall the relevant chunks once, then reuse their KV across the whole investigation.
Multi-step tasks where accumulated state grows through the session. The longer it runs, the more the flat curve saves.
vLLM or SGLang on a CUDA GPU. You own inference, so Atelya injects reused KV — the working set is prefilled once and never recomputed. This is the flat cost curve at the top of the page.
You don't control the model, but you still cut the bill. Atelya's drop-in proxy does two things: it recalls only the memory relevant to the turn instead of re-sending everything, and it keeps that memory in the provider's prompt cache — so repeat context is billed at cache-read rates (roughly a tenth on Anthropic, about half on OpenAI), not full price.
Not the flat curve — you can't reuse KV in a model you don't host, and the saving depends on how stable your context is. But it's a 5-line swap: point base_url at amem proxy, keep your model.
Mem0, Zep, and EverOS lead public recall benchmarks (LoCoMo / LongMemEval). Atelya composes with a recall layer; its edge is the cost of serving memory at parity fidelity.
CacheBlend reuses per-token attention KV. SSM / Mamba-hybrid models keep a compressed recurrent state — there's no per-token KV to blend, so the moat doesn't apply to them.
KV is ~1000× the size of the text it represents; Atelya tiers it across GPU / CPU / disk. You buy lower compute with more storage.
Tell us about your stack — model, serving setup, and how much memory your agents carry. We're looking for a few teams self-hosting vLLM / SGLang to run the cost curve on real workloads.
hello@atelyaos.com