Comparing My Build vs. GPT’s Response

When I asked GPT to explain why my little intent‑detection system works, it delivered an answer so clear, so structured, so annoyingly eloquent that I briefly considered hiring it as my co‑author. If you can explain things that well, you’re a better writer than me – and I say that without a hint of bitterness. Well… maybe a hint.

Seriously: re‑read that explanation. Let it sink in. If you take anything away from this blog series, let it be that moment of clarity about how LLMs reason, why they sometimes fail, and how to design around their quirks.

The Takeaways (for switching GEDCOMs)

Here’s what all this means in practice:

It’s safe to use an LLM to detect file‑switch intent. Worst case, it pops up an “Open File” dialog and the user cancels it. No harm done.
It may occasionally miss a switch request. Worst case, the user gets a slightly bewildering genealogy answer instead of a file picker.
A “Change GEDCOM” button is still a good idea. UI affordances exist for a reason.
If the LLM says “yes” but the user cancels, record that phrase. Add it to an exclusion list so the system gets smarter over time.
Harvesting phrases should be smarter. For example, if the user clicks the “Change File” button immediately after receiving an answer, maybe the original question was a switch request that the system missed.

This hybrid approach of rules, harvesting, and a tiny LLM gives you the best of all worlds: speed, adaptability, and a safety net.

The Lightbulb Moment

After GPT gave its beautifully structured explanation, I asked a follow‑up question: Why are you so confident a bigger model would catch more obscure switch requests?

Its answer triggered a lightbulb moment.

If a more powerful model like “Reasoner v1” can detect subtler, more ambiguous phrasing, then I shouldn’t replace my current system. I should add a skill.

A “skill” in this context is a specialised capability the system can call when needed. We’ll talk about skills properly in a later section, but the important idea is this:

Bigger models shouldn’t replace smaller ones, they should complement them.

Use the tiny model for speed. Use the big model for nuance. Use harvesting to reduce reliance on both.

That’s the architecture of a system that improves itself over time.

A Bigger Model Does Not Guarantee Success

And here’s the part I appreciated most: ChatGPT was refreshingly honest about the limitations of larger models. Bigger doesn’t always mean better. More parameters don’t magically eliminate ambiguity. Even the smartest models can misinterpret intent, especially when phrasing is vague or context is thin.

ChatGPT was very open about this – and I’ve included the Bing transcript next, because it’s worth reading.

A bigger model like me is less likely to get it wrong – but never immune, and the reasons are grounded in how scale changes behaviour rather than in any magical guarantee.

🧠 Why larger models handle nuance better

1. Richer internal representations

A small 1B model has a very compressed “mental space.” It can recognise surface patterns like “load”, “open”, “new file”, but it struggles with semantic indirection such as:

“Switch over to the Smith dataset”
“Start fresh with my cousin’s tree”
“Use the other genealogy file instead”

A larger model has far more parameters to encode:

Longer-range dependencies
Subtle intent cues
Rare phrasing patterns
Domain-transfer generalisation

So it can map “Smith dataset” → “probably a file” → “probably a load action” even if the phrasing is novel.

2. More diverse training exposure

Bigger models are trained on:

More text
More varied phrasing
More examples of intent classification tasks

This means they’ve seen hundreds of ways humans indirectly express “load a file,” not just the obvious ones.

3. Better instruction-following alignment

Modern alignment techniques (RLHF, DPO, etc.) scale better with larger models. That gives them stronger:

Task adherence (“only answer YES/NO”)
Context tracking
Resistance to distraction

A 1B model can be derailed by a long, chatty message. A larger model is far more stable.

⚠️ But here’s the important truth: no model is perfect

Even a very capable model can still misclassify because:

Language is ambiguous
User intent is often implicit
Models don’t have grounding in your app’s semantics
LLMs don’t “understand” consequences

So the difference is not “big models never fail” — it’s:

Big models fail less often, and in more predictable ways.

That’s why your hybrid approach (pattern match → LLM → user confirmation) is exactly the right architecture.

🧪 A practical way to think about trust

You can treat model size like this:

Model size	Strength	Weakness
1B	Fast and cheap	Misses nuance
7B	Good generalisation	Still brittle on edge cases
70B+	Strong semantic reasoning	Still not 100% reliable

Even at the top end, you still need:

Guardrails
Confirmation UX
Observability and feedback loops

Which you’re already doing.

🎯 The real answer to your question

Yes — larger models are better because they have:

More parameters
More training data
Better alignment
Better semantic generalisation

But none of that makes them trustworthy in the sense of “safe to rely on without guardrails.”

They’re trustworthy in the sense of:

“They’ll get it right more often, especially in weird or indirect phrasing.”

They’re not trustworthy in the sense of:

“They will always correctly infer user intent.”

Why Bother With All This Detail?

You might be wondering why I’ve spent so much time dissecting what appears to be a tiny, almost trivial part of the build. Switching GEDCOM files isn’t exactly the glamorous heart of the project. But this stage right here is where understanding the boundaries of what we’re doing becomes absolutely essential.

If we don’t understand the limits now, we’ll trip over them later. If we don’t understand how the LLM behaves in small tasks, we’ll misjudge it in big ones. And if we don’t design the scaffolding correctly, the whole structure wobbles.

This isn’t about switching files. It’s about learning how to work with an LLM without blindly trusting it.

A Glimpse of What’s Next

In the next post, we’re going to take the exact same LLM and ask it to do something far more interesting and far more consequential than detecting file‑switch intent.

We’re going to ask it to choose skills.

That’s where things get fun.
That’s where the architecture starts to matter.
And that’s where all this groundwork suddenly pays off.

Next part: Generative AI for Genealogy – Part III

Pages: 1 2 3 4 5