KV-native memory · self-hosted agents

Stop re-prefilling your agent's memory every turn.

Text-based memory re-feeds the same retrieved context into the model on every turn, so serving cost climbs with the conversation. Atelya OS keeps the relevant working set as reused KV: compute it once, reuse it. Same answers, a fraction of the prefill.

$pip install atelyacopy
Talk to us →
Cumulative prefill cost over a session
schematic · ratio measured (n=200 · RTX 4070)
turn 1 turn 30 → cumulative tokens prefilled
Text memory — re-prefilled every turn (linear) Atelya — KV reused (one-time prefill, then ~flat)

The shaded gap is the saving: reusing KV instead of re-prefilling memory runs about 6–54× cheaper over a session, at parity answer quality — roughly 6× by default, up to ~54× for a stable working set.

01
The evidence

Measured, not estimated.

Driving the real engine (CacheBlend KV reuse + residency) and scored by an independent LLM judge — the same judge on both arms. Every figure reproduces from the repo.

Fidelity

97.5%

Answer-for-answer agreement with a cold full re-prefill of the same context (n=200, LLM-judged). Reusing KV instead of recomputing it does not change the answer.

Prefill cost

~6–54×

Less prefill compute per query. ~6× is the conservative default (recall + blend); ~54× when a stable working set is reused across a session. Break-even at ~1 query.

Quality

Parity

Head-to-head vs Mem0 at a matched budget: 60% vs 55% answer correctness (n=20, within noise). A cost result at comparable quality — not a recall-accuracy win.

Reproduces on a single RTX 4070 in minutes. Full methodology, per-query data, and honest limits → BENCHMARK.md. The absolute accuracy is retrieval-bound and identical across both arms; the claim here is fidelity + cost, not recall SOTA.

02
The bridge

Memory moves from text into the model.

Text memory lives outside the model and gets re-prefilled. Atelya is the bridge: it recalls the relevant set, then serves its precomputed KV — so the model attends to memory it never has to recompute.

Recall / bge

Embeddings pull the chunks relevant to this turn. Atelya composes with your recall layer — bring your own retriever.

Reuse / CacheBlend

Each chunk is served from its precomputed attention KV — position-independent, ~15% selective recompute — instead of a full re-prefill.

Reside / KvPolicy

A value-model policy (relevance × recency × reuse − size) keeps the hottest memory resident on GPU and tiers the rest to CPU/disk.

03
Where it pays off

For agents that carry a lot of context.

The re-prefill tax grows with memory size × turns. Atelya pays off most where both are large — long sessions over a stable, sizeable working set.

Coding agents

Repo context, file trees, and prior edits reused across dozens of turns. Re-prefilling the codebase on every step is the dominant input cost.

Support & conversational agents

Long per-user history plus a product knowledge base injected on every reply — the same context, paid for again and again.

Research & document agents

Many questions against one large corpus. Recall the relevant chunks once, then reuse their KV across the whole investigation.

Long-running / autonomous agents

Multi-step tasks where accumulated state grows through the session. The longer it runs, the more the flat curve saves.

04
However you serve the model

Two ways to cut the memory bill.

You self-host flat curve · 6–54×

vLLM or SGLang on a CUDA GPU. You own inference, so Atelya injects reused KV — the working set is prefilled once and never recomputed. This is the flat cost curve at the top of the page.

Drop-in KV-native server: amem kv-serve — CacheBlend reuse + residency.

Closed API — Claude / GPT cache-read pricing

You don't control the model, but you still cut the bill. Atelya's drop-in proxy does two things: it recalls only the memory relevant to the turn instead of re-sending everything, and it keeps that memory in the provider's prompt cache — so repeat context is billed at cache-read rates (roughly a tenth on Anthropic, about half on OpenAI), not full price.

Not the flat curve — you can't reuse KV in a model you don't host, and the saving depends on how stable your context is. But it's a 5-line swap: point base_url at amem proxy, keep your model.

05
Straight about the edges

What Atelya is not.

Not a recall-accuracy leaderboard.

Mem0, Zep, and EverOS lead public recall benchmarks (LoCoMo / LongMemEval). Atelya composes with a recall layer; its edge is the cost of serving memory at parity fidelity.

Transformer-only.

CacheBlend reuses per-token attention KV. SSM / Mamba-hybrid models keep a compressed recurrent state — there's no per-token KV to blend, so the moat doesn't apply to them.

A storage trade.

KV is ~1000× the size of the text it represents; Atelya tiers it across GPU / CPU / disk. You buy lower compute with more storage.

Design partners

Running open-model agents that remember a lot?

Tell us about your stack — model, serving setup, and how much memory your agents carry. We're looking for a few teams self-hosting vLLM / SGLang to run the cost curve on real workloads.

hello@atelyaos.com