Stop Replaying Your Entire Chat History. It’s Not “Context”, It’s Just Expensive Noise.

I also posted this article here Post | LinkedIn on May 11th, 2026.

Most “chat app” tutorials still teach the same pattern:
1. User asks Q1
2. You send Q1 → get A1
3. User asks Q2
4. You resend Q1 + A1 + Q2 → get A2
5. User asks Q3
6. You resend Q1 + A1 + Q2 + A2 + Q3 → get A3

It works fine if your ambition stops at “toy chatbot”.

If you’re building a chat‑like system that needs to be reliable, multilingual, entity‑aware, or cost‑efficient, replaying the entire transcript every turn is unnecessary. It burns tokens, increases noise, and raises the chance the model will drift.

And to head off the *pedants*: I’m talking specifically about chat‑like applications here — not every possible LLM workflow.

The replay pattern exists because LLMs are stateless; your application doesn’t have to be.

Here’s my architecture, so it isn’t just “theory”

User: “When was grandpa Fred born?”
App: Plans, fetches the data → “12 May 1954, London.”

User: “How many grandchildren did he have?”
App: Plans, resolves “he”, fetches the data → 15.

No replay of the entire conversation, sending Fred’s biography again.

By tracking state server‑side and running deterministic discourse resolution, I can rewrite the plan as if the user had asked: “How many grandchildren did grandpa Fred have?”

I send the model only what it cannot infer from the latest message: No noise. No drift. No hallucinating uncles.

The principle is simple:
1. Only send the model what it cannot know from the user’s latest message.
2. Everything else belongs in your application logic.

Of course, it’s harder than replaying the transcript. But if you care about correctness, cost, or long‑running conversations, it’s an approach that scales.

If you want to build serious chat‑like systems, start here:
– State management: the only non‑negotiable.
– Referent resolution: essential once you have multiple entities or pronouns.
– Structured planning: not required for simple bots, but foundational for deterministic, tool‑using agents.

These aren’t buzzwords.
They’re the difference between a chatbot and an engineered system.

Here’s my architecture, so it isn’t just “theory”

Related Posts