Generative AI for Genealogy – Data vs. GEDCOM files

In case you missed it: Generative AI for Genealogy – Introduction

Genealogy data is beautiful, GEDCOM is… less so

This is why your family tree file looks like it was designed by a committee of caffeinated squirrels.

Most people start an AI project because they want to change the world. I started mine because I wanted to ask a simple question about my family tree without needing a PhD in deciphering GEDCOM files. If you’ve ever opened a GEDCOM and felt your soul leave your body, you’ll understand. It’s like reading ancient scripture, except the scripture is angry, inconsistent, and written by 47 different software vendors who never once spoke to each other.

Meanwhile, genealogy itself has become easier than ever. Ancestry.com lowered the bar so far that even your cat could build a family tree. Unfortunately, this also means people accept hints that are so obviously wrong that even their cat would notice. And once enough people accept the wrong hint, it becomes “fact,” and then it gets offered to everyone else like a contagious rumour wearing a waistcoat.

So before we talk about AI, we need to talk about data – the good, the bad, and the GEDCOM.

The Data: What Genealogy Should Look Like

At the heart of genealogy is people – otherwise the “gene” part would be awkward.

People have facts:

  • “Born 1949 in Peterborough, Northamptonshire.”
  • “Married in 1972.”
  • “Favourite GenAI: ChatGPT”, obviously.

Yes, I did write that – it is actually my favourite AI.

Facts can involve:

  • a date
  • a place
  • a value (height, occupation, cause of death, favourite biscuit)

Some facts are shared, marriage being the obvious one. But in reality, many things could be shared: photos, census entries, immigration records, that one family story everyone tells differently.

Facts are supported by citations, which point to sources – the “proof.” Sources come in two flavours:

  • Original: the actual document or a legible image
  • Derivative: a transcription, index, or someone’s best guess after squinting at 19th‑century handwriting

Original sources are better because they’re verifiable. Derivative sources are… well, they’re like photocopies of photocopies of photocopies. Useful, but only if you squint.

Genealogists often use the word Event interchangeably with Fact, and GEDCOM does too – though in a way that suggests the authors had a vague idea of what data modelling was but decided to freestyle it anyway.

And then there are relationships – the real backbone of a family tree. Unfortunately, this is where Ancestry followed the GEDCOM standard a little too faithfully. Fixing inherited model issues makes exporting to GEDCOM painful, because you end up trying to shoehorn modern, sensible data into a format that predates the Spice Girls.

GEDCOM: The Format We Love to Hate

GEDCOM is the de facto standard because everything outputs it. Not because it’s good. Not because it’s modern. Not because it makes sense. Simply because it exists.

I’m using the non‑XML 5.5x version for my proof‑of‑concept because:

  • it’s what everyone exports
  • it’s what everyone expects
  • it’s what everyone complains about

You can find the official standards at gedcom.org, where they are presented with the confidence of a format that absolutely believes it is perfect and not at all responsible for decades of genealogical chaos.

GEDCOM has quirks. Many quirks. Enough quirks to fill a small book. For example:

  • Events and facts are sometimes the same thing, sometimes not.
  • Shared events exist, but only in certain contexts.
  • Some tags are overloaded.
  • Some tags are under‑documented.
  • Some tags appear to have been invented on a Friday afternoon.

Parsing GEDCOM is non‑trivial because you’re not just reading a file, you’re decoding the collective historical trauma of the genealogy software industry.

I’ve built multiple GEDCOM parsers over the years, including one for an application that generates an offline Ancestry‑style website and another that prints 50‑foot‑wide family trees. (Yes, fifty. Yes, it barely fits in a village hall.) So I’ve seen the weirdness up close.

Why This Matters for AI

If you want an AI to answer genealogy questions, you need clean, structured, unambiguous data. GEDCOM gives you… something else.

So this post exists to set the stage:

  • what genealogy data should look like
  • what GEDCOM actually gives you
  • why parsing it is a heroic act
  • and why AI needs a saner internal model than the one GEDCOM provides

This is the foundation for the rest of the series. Part I kicks off the architecture. Part II dives further into the UI and the craziness that aids selection of a different GEDCOM. And somewhere down the line, we’ll talk about the moment the AI confidently invented a great‑uncle who absolutely never existed.

Next Part: Generative AI for Genealogy – Part I

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *