Generative AI for Genealogy – Part II

My Love-Hate Relationship with Prompt Engineering

I’ll be honest: I hate prompt engineering. I hate that LLMs sometimes behave like whimsical Victorian children who follow instructions only when the moon is in the correct phase.

It’s not that they’re bad. It’s that they’re opaque.

When the model returns “[YES]” or “[NO],” I can’t easily tell why. What tipped the needle? Which token nudged which neuron? Was it the word “load”? Was it the phrasing? Was it the model having a bad day?

With humans, at least you can ask. With LLMs, you get a shrug in vector form.

I’ve seen research on visualising neuron activations as text flows through the model. It’s fascinating. It’s beautiful. It’s also not remotely reassuring. It doesn’t make me trust the model any more than I trust a cat walking across a piano.

Why These Phrases Work

The list of default phrases didn’t come from magic. It came from thinking about semantics – words that clearly indicate “load a file” and are mutually exclusive from normal genealogy questions.

If you’re asking about a family member, you’re not going to say “new file.” If you’re switching GEDCOMs, you’re not going to say “where was Bart born?”

The separation is clean. The intent is distinct. The classifier has a fighting chance.

And now, the part Bing/GPT said so eloquently…

You’re basically asking: what is this little model actually doing in its head when it decides “[YES]” vs “[NO]”, and how far can I trust that? Let’s unpack that in practical terms.

1. What the model is really doing (at a high level)

Under the hood, the model is not symbolically reasoning “this is a load-request, therefore YES”. It’s doing something more like:

  1. Embed the prompt + user text into a big fuzzy representation. It “sees” both:
    • Your instruction (“You must reply with whether the user is asking to load a new file…”)
    • The user’s actual query (“new gedcom”, “who is the pope?”, etc.)
  2. Predict the next token that best fits training patterns + your instruction. It was fine-tuned on a lot of instruction-following examples like:
    • “If X, answer ‘A’, otherwise ‘B’”
    • “Classify the intent of the user”
    • “Return only one of these labels”
    So when it sees:
    • The pattern: “If user wants to load a new file → reply [YES], else [NO]”
    • Plus an example that looks similar to “user wants to load a file”
    The highest-probability next output token is usually "[", followed by "YES" or "NO".
  3. “Load a file” intent is pattern-matched, not logically derived. For phrases like:
    • “new gedcom”
    • “new file”
    • “load”
    • “different file”
    These are near tons of training examples involving loading/opening/importing content. The model has internalised a kind of “semantic cluster” around file-loading operations. When your prompt says “If the user appears to want to load a new file…”, the model aligns that cluster with “should output [YES]”.

So, conceptually, it feels like intent recognition, but operationally, it’s “statistical pattern completion” conditioned by your explicit instructions.

2. Why it gets your examples right (positives and negatives)

Your prompt is actually pretty good for a tiny model, because it gives:

  • A clear binary task: Only “[YES]” or “[NO]”.
  • A crisp criterion: “asking to load a new file (gedcom)”.
  • Positive and negative examples.

So for:

  • “new gedcom” → [YES]
    • “gedcom” is strongly associated with genealogical data files.
    • “new” + filename-like token → matches the “new file” / “load” semantic cluster.
    • The examples you gave reinforce this: “new gedcom” is extremely close to “new file”.
  • “who is the pope?” → [NO]
    • Question about a person, no file-related words.
    • It looks like a pure Q&A query that conflicts with your instruction (“Do not attempt to answer the actual question”) and fits the “else reply [NO]” path.

The model isn’t “understanding” in a human sense. It’s doing a kind of: “Strings with ‘load/new/file/open/gedcom/different file’ → [YES]; everything else → [NO]”, with some fuzziness around paraphrases.

3. Where it will likely fail (and why you’re right not to trust it blindly)

You’re absolutely right: you can’t test every phrase. Some failure modes to expect:

  1. Subtle or indirect intents.
    • “Can we switch over to the Smith family data?”
      • This means “load a different gedcom/file” but doesn’t contain obvious keywords.
      • A 1B model might miss that and answer [NO].
    • “Let’s start a new tree from scratch.”
      • Conceptually “new file”, but without the words “file”, “load”, “gedcom”.
  2. Ambiguous language.
    • “Should I save this before I open the new thing?”
      • Could be interpreted as wanting to load something, but not explicit.
    • “Can I import the Johnsons from a separate file?”
      • Is that “load a new gedcom” or just “import from another file into current tree”?
      • Your logic may want [YES] or [NO], but the model has no grounding in your product semantics, just language patterns.
  3. Out-of-distribution wording.
    • Domain-specific or odd phrasing the model hasn’t seen close parallels for:
      • “Point the workspace at the new lineage artifact.”
      • “Hoist the secondary pedigree container.”
    The further you get from “normal” phrasing of “load/open a file”, the shakier it gets.
  4. Prompt drift and verbosity.
    • Longer, chatty user messages can hide the key intent:
      • “Hi, I’ve been using your tool for a while. I love it. I’ve got another gedcom from my cousin and I was wondering if there’s a way to merge them or maybe start from that one instead of the one I’ve been using?” A strong model: likely [YES]. A 1B model: could get distracted.

4. “Thought process”: what’s going on conceptually

We can’t see the literal internal steps, but we can describe types of cues it uses:

  • Lexical cues: Direct keywords: “load”, “open”, “new file”, “gedcom”, “different file”, “import”.
  • Syntactic patterns:
    • Imperatives: “load X”, “open X”.
    • Requests: “Can you load…”, “I want to load…”.
  • Semantic similarity: Phrases that behave like “load a file” in text it was trained on:
    • “Switch to another document”
    • “Open the other project”
    • “Start a new session with this file”
  • Instruction alignment: Your system prompt pushes it hard toward:
    • “Ignore answering questions.”
    • “Binary classification, answer only [YES] or [NO].”
    That dramatically simplifies the search space. That’s why such a small model can do reasonably well.

But crucially: none of this is “I have a symbolic rule: if intent==LOAD_FILE then YES”. It’s all probability mass flowing through a neural pattern.

5. How much can you trust it, really?

The honest answer: you can trust it for “clean” cases, but not for edge cases or safety-critical logic.

Use it like this:

  • Good for:
    • Making your UI feel smart in obvious situations.
    • “Nice-to-have” intent recognition that you also backstop with explicit UI controls (buttons, menus).
    • Suggesting actions (“Looks like you might want to load a new file – is that right?”).
  • Not good for:
    • Single point of truth for important state transitions. E.g. “LLM said [YES], so immediately discard current file and load another” – dangerous.
    • Anything where misclassification would be costly: data loss, irreversible operations, security decisions.

You’ll want to treat its answer as a hint, not a command.

6. Ways to increase confidence (without “trusting” the model blindly)

To get beyond “it feels like it’s working”:

  1. Build a test set of intents.
    • Positive examples: all the different ways real users might ask to load a new file.
    • Negative examples: normal genealogical questions, editing actions, navigation, etc.
    • Include:
      • Very short queries (“load”, “new tree”).
      • Long, chatty queries.
      • Edge cases (“import”, “merge trees”, “open sample data”).
    Run them regularly, automate the evaluation, track accuracy over time.
  2. Add simple rule-based checks around it. For a 1B model, a hybrid approach is often better:
    • If user text contains very strong indicators (e.g. “load”, “open”, “new gedcom”, “new file”, “different file”), you can:
      • Override the model and treat as YES, or
      • Use the model only when keywords are absent / weak.
    This way, the model only handles the truly fuzzy cases.
  3. Make the decision require confirmation. Instead of:
    • Model: [YES] → you immediately load a new file.
    Do:
    • Model: [YES] → You show the user:
      • “It looks like you want to load a new GEDCOM file. Do you want to: [Load new file] [Cancel]”
    That turns misclassifications into a UX annoyance, not a data-loss bug.
  4. Log and refine.
    • Log user message + model’s [YES]/[NO] + what actually happened (did user proceed to load a file?).
    • Periodically sample mistakes and add them to your test set.
    • Evolve your prompt or add guardrails as you see recurring patterns.

7. A mental model you can keep

Think of this 1B Llama as:

A competent but somewhat literal junior assistant who has read huge amounts of text and is pretty good at spotting “this sounds like loading a file” – as long as people talk in roughly normal ways. It will do the obvious things well, but it has no real model of your app, your data, or the consequences of being wrong.

So:

  • Use it as heuristic intent detection.
  • Wrap it in hard UX and logic for anything important.
  • Build a test harness and treat its performance as something you can measure, not just “trust”.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *