This post is less humour, more plain tech speak. I’ll be back with more laughter in the next post.

What I learned building a genealogy AI through five architectural generations

Most AI demos work once. Production systems need to work on question 47 of a multi-turn conversation in Polish, where the user asked about “her” and meant someone from three turns ago.

This is a write-up of the non-obvious decisions that made the difference — including the ones that came from getting things wrong first.

The breaking point: embeddings for kinship terms

In an earlier architecture, the planner worked via tool calling. Each genealogy operation was a named function with a natural-language description. The model would read the question, reason over the tool descriptions, and select the right calls in the right order.

It worked well enough until kinship terms started causing subtle misfires.

“Sister” and “aunt” live close together in embedding space. Both female, both relational, both family. The model would occasionally wire up the wrong operation — not reliably wrong, which would be easy to catch, but *occasionally* wrong in ways that produced plausible-sounding but incorrect answers. The same pattern showed up across other near-synonym pairs in the kinship vocabulary.

You can’t fix that by rewriting the descriptions. The failure is structural: semantic similarity is the wrong selection mechanism for precise operations over structured data. Tool calling is designed for fuzzy natural-language intent matching. What I needed was something typed and compositional.

That’s what pushed me toward operator algebra.

The algebra approach

Instead of calling functions, the planner now emits a semantic operator graph in a custom algebra over genealogy concepts. The LLM never sees raw data and never executes anything — it only writes a complete, self-contained plan. A deterministic runtime then compiles and executes it.

This separation pays off in several ways:

Parse-time validation. The plan is structurally checked before any query runs. Malformed plans fail fast with typed errors, not mid-execution surprises.

Full observability. Every plan is a text artefact — loggable, diffable across prompt versions, visualisable in a debugger. No black box.

Trivial caching. A plan is a pure function of (normalised question, prompt version). The cache key writes itself. On a hit: zero LLM tokens, sub-millisecond lookup. The cache auto-invalidates when the extractor prompt changes — version IDs are embedded in the key.

Privacy by design. The planner reasons over abstract algebra, not over the user’s family data. The GEDCOM file never touches the model layer.

The honest trade-off: the algebra prompt isn’t small. Tool calling let me give the model a narrow, task-scoped context. The operator grammar is rich enough to express complex genealogical queries across multiple languages, and the prompt reflects that. The gain in reliability and debuggability has been worth it, but it’s a real trade-off, not a free lunch.

Determinism compounds

Once execution is deterministic, a lot of previously hard problems become mechanical.

The quality gate is a good example. After the answer agent produces a response, a fast Haiku call judges it against the plan and the retrieved facts. Five possible verdicts. If the plan was the problem, it re-runs with Sonnet plus feedback, bypassing cache. If the answer was the problem, it re-runs the answer agent with a stronger model. The retry paths require no heuristics — execution state is fully reproducible, so retry is just re-entry.

The inference loop follows the same pattern. For hypothesis questions (“could X be the father of Y?”), the planner emits a terminal inference operator rather than a value-return operator. The orchestrator detects this, runs an inference agent over the retrieved facts, and allows up to two follow-up sub-questions — each re-entering the full planner pipeline, cache-first. Because sub-questions are internally generated and precise, they skip the discourse enrichment step. Clean re-entry.

Determinism also made the eval harness trivial to build. Plan comparisons across prompt versions, regression detection, score tracking — all straightforward once the execution layer produces stable, reproducible outputs.

The synthetic fact sheet

The answer agent never receives raw query results. Before it’s called, a deterministic C# layer constructs a structured payload from the execution output:

PEOPLE — named entities relevant to the answer

FACTS — grounded key/value statements derived from the plan

EVENTS — structured event records (birth, death, marriage, etc.)

The model is constrained to reference only what’s in these blocks. It cannot reach into the GEDCOM, cannot use outside knowledge, cannot hallucinate a fact that isn’t in FACTS. This grounding — not a bigger model — is what produces reliable answers. Haiku, given a well-constructed fact sheet, handles the majority of genealogy questions correctly.

Multi-agent specialisation at the model tier

Each agent does exactly one thing. Model assignment is deliberate:

Sonnet for planning — the algebra requires genuine compositional reasoning

Haiku for everything else: extraction, subject resolution, quality gate, answer generation, guardrails

I started with Qwen2.5 for the lighter agents. In retrospect, moving to Haiku earlier would have paid off — the instruction-following behaviour is significantly better, and that matters when you need tight JSON contracts and precise output shapes. I sometimes wonder whether Haiku earlier might even have made tool calling viable for longer. Probably not, but it would have been a closer contest.

The cost difference between Haiku and Sonnet is large enough that keeping Sonnet scoped to the planning step — and nothing else — is where the per-query economics become workable at scale.

Security looks different for agentic systems

The OWASP Agentic Top 10 is a useful starting frame. Defence-in-depth here means catching attacks at two boundaries:

    User input
         │
         ▼
┌─────────────────┐
│  Ingress Guard  │──► BLOCKED  (jailbreak / unsafe content)
└────────┬────────┘
         │ safe
         ▼
┌─────────────────────────────┐
│   Multi-Agent Pipeline      │
│   Plan → Execute → Answer   │
└────────┬────────────────────┘
         │
         ▼
┌─────────────────┐
│  Egress Guard   │──► BLOCKED  (sensitive output / policy violation)
└────────┬────────┘
         │ safe
         ▼
      Answer

A few patterns that emerged in practice:

Sub-question injection. The inference agent emits natural-language sub-questions that re-enter the planner. A jailbroken inference agent could attempt to inject planner DSL directly. The orchestrator validates sub-questions before planning: rejects anything over a length threshold, rejects strings containing operator syntax.

External data sandboxing. Supplemental data from web fetches is XML-fenced before reaching the LLM context. The model is instructed to treat that block as data, not instructions.

Web fetch body cap. Streaming reads are capped at 128 KB. Malicious or oversized external pages cannot cause token overflow.

None of this is exotic — but agentic systems accumulate more surfaces than stateless APIs, and the failure modes are less familiar to most teams.

Language-agnostic from the start

Genealogy users skew older, rarely reset conversations, and ask follow-ups across many turns. “How many children did she have?” — asked after a question about someone three exchanges back — is a normal interaction pattern. Getting pronoun resolution right across languages matters.

The approach is two-stage. A deterministic multilingual preprocessor (no LLM) classifies tokens as pronouns, possessives, demonstratives, or interrogatives using per-language lexical tables. Romance language subject-verb inversions (“est-elle” → “elle”) are resolved with a post-pass that checks the suffix of any hyphenated token against all pronoun tables. This stage produces a structured list of candidate referents with no English assumptions anywhere.

An LLM stage then resolves those candidates against discourse anchors from the conversation history. One hard-won rule: the output key must be the exact input token, never translated or normalised. A Haiku failure mode early on produced “she” as the output key for the Polish token “ona” — which broke resolution for the downstream executor that was looking for “ona”.

First-person references in any language (“my”, “mein”, “mon”, “mio”) map to a sentinel value the runtime resolves to the home person at query time. The planner emits this natively, so first-person works without any English keywords in the pipeline.

What surprised me

The algebra paid off faster than expected. Plan caching alone meaningfully reduced LLM call volume within weeks of the initial implementation. Structural validation caught prompt regressions that would have been silent failures in a tool-calling design — wrong answers with no error signal.

Haiku is capable of most of it. The instinct is always to reach for the strongest model. In a multi-agent pipeline where each agent has a narrow job and receives a deterministically constructed fact sheet, Haiku handles the majority of work correctly. The planning step is where the reasoning budget actually matters.

Language-agnostic is harder upfront, much cheaper long-term. Building pronoun resolution as an English feature and retrofitting other languages later would have been painful. Building it as a multilingual grammar problem from the start — no hardcoded English logic, per-language lexical tables — meant French and Polish support was additive, not surgical.

Determinism unlocks everything above it. The quality gate, the eval harness, the inference retry loop, and the plan cache — none of them would be clean to implement on top of a non-deterministic execution layer. Each layer that becomes deterministic makes the one above it easier to reason about.

This is version 5 of the architecture. Versions 1–4 are a graveyard of approaches that seemed reasonable and didn’t survive contact with real queries