Generative AI for Genealogy – Part IV

This follows on from:

Normalisation: Cleaning Up the Linguistic Junk Drawer

Ask ten software architects what “normalisation” means and you’ll get ten different answers, each delivered with the confidence of someone who once read half a database textbook. Context matters, purpose matters, and let’s be honest, ego matters.

For this project, I’m using a very practical definition: normalisation is the process of turning whatever the user typed into something an LLM can understand without having a nervous breakdown.

That means removing invisible Unicode goblins, flattening exotic whitespace, and aligning synonyms so the model doesn’t treat “mum”, “mom”, and “cuz” as three unrelated species. It’s not glamorous work, but it’s essential especially when you’re dealing with small models that panic at the sight of a zero‑width space.

Why normalisation matters (and why small models cry without it)

There are three major sources of chaos in user input:

  • Invisible Unicode noise: BOMs, ZWSPs, LRM/RLM… the kind of characters that haunt your dreams.
  • Whitespace weirdness: NBSP, NNBSP, IDEOGRAPHIC SPACE… basically every space except the one you actually want.
  • Synonym soup: mum/mom/mother, cuz/cousin, abt/about, etc.

Large models can often power through this mess. Small models? They fold like a cheap deckchair.

Normalisation gives them a clean, predictable surface to work with. It’s like tidying your desk before doing real work, except your desk is full of invisible characters that shouldn’t exist.

The two‑step dance: sanitise + substitute

My pipeline is simple:

1. Sanitise invisible characters

  • Canonicalise the string
  • Remove Unicode formatting characters (BOM, ZWSP, ZWJ, LRM/RLM…)
  • Convert all exotic spaces to ' '
  • Convert tabs/newlines/CR to ' '
  • Collapse multiple spaces into one

2. Replace synonyms

This aligns user phrasing with the phrasing in my prompts. “Who is the mum of X?” becomes “Who is the mother of X?” The model doesn’t need this, but it certainly appreciates it.

If this were my only competitive differentiator, I’d be in trouble. But it’s not. It’s just one of the quiet, unglamorous foundations that make the whole system behave.

Observability: the unsung hero

You’ll notice the code (at the bottom the post) references my observability library. That’s deliberate. Normalisation is only useful if you can see what happened.

Never underestimate the importance of observability. It’s the difference between “the model is wrong” and “oh look, the user typed a zero‑width joiner between every letter.”

GPT’s verdict (and the extra homework it gave me)

I asked GPT to critique my pipeline. It responded with the AI equivalent of a warm pat on the back:

Your normalisation pipeline is already stronger than what most developers ever implement…

Which is flattering, but then it did what GPT always does: it gave me more work.

Here are the improvements it suggested:

1. Case normalisation

Lowercase everything after Unicode cleanup. Great for classification. Terrible for entity extraction. Conclusion:

  • File‑switch skill → lowercase is fine
  • Genealogy Q&A → absolutely not

2. Punctuation smoothing

Replace curly quotes, em‑dashes, ellipses, etc. with ASCII. This avoids tokenisation surprises and existential dread.

3. Genealogical shorthand expansion

“b.” → “born” “d.” → “died” “abt” → “about” “circa” → “about” “w/” → “with” This massively improves intent detection for tiny models.

4. Strip filler words (carefully)

Not just “please” and “hey”, but structural fluff like: “can you maybe”, “would you be able to”, “just wondering if”. But only for classification. Never for Q&A.

5. Normalise question forms

“whats” → “what is” “who’s” → “who is” “where’d” → “where did”

6. Domain‑specific expansion

“ftm” → “family tree maker” “ged” → “gedcom” “fam” → “family”

7. Keep a reversible log

Store both the raw input and the normalised version. This is already handled by observability.

The nuance: remove noise, not meaning

The golden rule of normalisation is simple: Strip anything that doesn’t change the user’s intent.

Lowercasing? Great for classification. Terrible for identifying “Bart” as a name instead of a verb.

Removing filler? Great for intent detection. Terrible for Q&A, where tone and phrasing matter.

Normalisation is not about dumbing down the input. It’s about removing the linguistic static that hides meaning.

Coming up next: the thing that didn’t work

In the next chapter, I’ll cover something that should have worked but absolutely didn’t: matching examples within a skill.

It’s not glamorous. It’s not exciting. But it’s one of those engineering dead‑ends that teaches you more than the successes. And yes, I’ll bring my best humour, because the topic deserves it.

Next part: Generative AI for Genealogy – Part V

Please feel to use my normaliser in your own fun projects. If you spot an error, please leave a comment!

Please note: you will need to replace the call to my observability library with one of your own. Don’t forget to create the synonym file, and adjust the path accordingly.

using LlmIntegration.observability;
using System.Diagnostics;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;

namespace LlmIntegration;

/// <summary>
/// Normalisation engine for input text.
/// </summary>
internal static class NormalisationEngine
{
    /// <summary>
    /// Where the synonyms file are located.
    /// </summary>
    private const string SYNONYMS_PATH = @".\llm\synonyms.md";

    /// <summary>
    /// List of synonyms to perform. 
    /// </summary>
    private static readonly List<(string, string)> s_synonyms = [];

    /// <summary>
    /// The compiled synonym regex (we use RegEx to find the synonyms in text).
    /// </summary>
    private static Regex? s_synonymRegex;

    /// <summary>
    /// Contains a mapping of words to their synonyms used for lookup operations.
    /// </summary>
    private static Dictionary<string, string>? s_synonymMap;

    /// <summary>
    /// Constructor - loads synonyms from file.
    /// </summary>
    static NormalisationEngine()
    {
        if (!File.Exists(SYNONYMS_PATH))
        {
            Debug.WriteLine($"Synonyms file not found at {SYNONYMS_PATH}. No synonyms will be applied.");
            return;
        }

        foreach (string line in File.ReadAllLines(SYNONYMS_PATH))
        {
            string trimmedLine = line.Trim();

            if (trimmedLine.StartsWith('#') || string.IsNullOrWhiteSpace(trimmedLine))
            {
                continue; // Skip comments and empty lines
            }

            // format: 
            // newValue: oldValue, oldValue2...
            // e.g mother: mom, mum, mam, mommy, mummy, ma

            // get old and new values
            if (!trimmedLine.Contains(':')) continue;

            string oldValue = trimmedLine.Split(":", StringSplitOptions.RemoveEmptyEntries)[0].Trim().ToLowerInvariant();
            string[] newValues = trimmedLine.Split(":", StringSplitOptions.RemoveEmptyEntries)[1].Split(",", StringSplitOptions.RemoveEmptyEntries);

            // add each synonym, trimming and normalizing. New values can be multiple, separated by commas.
            foreach (string newValue in newValues)
            {
                string trimmedNewValue = newValue.Trim().ToLowerInvariant();

                s_synonyms.Add((oldValue, trimmedNewValue));

                if (Debugger.IsAttached) Debug.WriteLine($"Loaded synonym: '{oldValue}' => '{trimmedNewValue}'");
            }
        }

        if (Debugger.IsAttached) Debug.WriteLine($"Loaded {s_synonyms.Count} synonyms from {SYNONYMS_PATH}.");

        EnsureSynonymRegexInitialized();
    }

    /// <summary>
    /// Takes care of initializing the synonym regex from the synonym map.
    /// </summary>
    private static void EnsureSynonymRegexInitialized()
    {
        // Build a canonical map: variant -> canonical (last wins)
        Dictionary<string, string> map = new(StringComparer.OrdinalIgnoreCase);

        foreach (var (canonical, variant) in s_synonyms)
        {
            if (string.IsNullOrWhiteSpace(canonical) || string.IsNullOrWhiteSpace(variant)) continue;
            map[variant.Trim()] = canonical.Trim();
        }

        s_synonymMap = map;

        if (map.Count == 0)
        {
            s_synonymRegex = null;
            return;
        }

        // Longest-first to prefer phrases (multi-word supported)
        string alternation = string.Join("|",
            map.Keys
               .OrderByDescending(k => k.Length)
               .Select(Regex.Escape));

        // Unicode-aware non-letter/number/mark boundaries
        string pattern = $@"(?<![\p{{L}}\p{{N}}\p{{M}}])(?<key>{alternation})(?![\p{{L}}\p{{N}}\p{{M}}])";

        s_synonymRegex = new Regex(
            pattern,
            RegexOptions.Compiled |
            RegexOptions.CultureInvariant |
            RegexOptions.IgnoreCase);
    }

    /// <summary>
    /// Applies the casing style of the original text to the replacement string, preserving all-uppercase, title case,
    /// or lowercase formatting.
    /// </summary>
    /// <remarks>If the original span contains no letter characters, the replacement is returned unchanged.
    /// Only simple casing styles (all uppercase, title case, or lowercase) are detected; mixed or complex casing is not
    /// preserved.</remarks>
    /// <param name="original">The source span whose casing style is to be analyzed and applied. Only letter characters are considered when
    /// determining casing.</param>
    /// <param name="replacement">The string to which the detected casing style from the original span will be applied.</param>
    /// <returns>A string containing the replacement text adjusted to match the casing style of the original span. Returns the
    /// replacement as all uppercase if the original is all uppercase, with an initial capital if the original is title
    /// case, or unchanged otherwise.</returns>
    private static string ApplyCasing(ReadOnlySpan<char> original, string replacement)
    {
        // Preserve simple ALLCAPS, TitleCase, or lower
        bool anyLetter = false;
        bool allUpper = true;
        bool titleCase = true;

        for (int i = 0; i < original.Length; i++)
        {
            char c = original[i];

            if (!char.IsLetter(c)) continue;

            anyLetter = true;

            if (!char.IsUpper(c)) allUpper = false;

            if (i == 0)
            {
                if (!char.IsUpper(c)) titleCase = false;
            }
            else
            {
                if (char.IsUpper(c)) titleCase = false;
            }
        }

        if (!anyLetter) return replacement;

        if (allUpper) return replacement.ToUpperInvariant();

        if (titleCase) return replacement.Length > 0
            ? char.ToUpperInvariant(replacement[0]) + (replacement.Length > 1 ? replacement[1..] : string.Empty)
            : replacement;

        return replacement; // lower or mixed stays as-is
    }

    /// <summary>
    /// Normalizes the input text.
    /// Currently little more than synonym replacement.
    /// Enhancements could include:
    /// - More advanced text normalization techniques
    /// - Contextual synonym replacement
    /// - Integration with external knowledge sources
    /// </summary>
    /// <param name="input"></param>
    /// <returns></returns>
    internal static string NormalizeText(string input)
    {
        string originalInput = new(input); // keep for observability

        input = input
            .Trim()
            .Replace("\\r", "\r")  // replace escaped carriage returns with real carriage returns
            .Replace("\\n", "\n"); // replace escaped new lines with real new lines

        input = SanitizeInvisibleChars(input);

        if (s_synonymRegex != null && s_synonymMap != null && s_synonymMap.Count > 0 && input.Length > 0)
        {
            var applied = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);

            string replaced = s_synonymRegex.Replace(input, m =>
            {
                string key = m.Groups["key"].Value;
                if (!s_synonymMap.TryGetValue(key, out string? canonical))
                {
                    // Fallback to lower for safety
                    s_synonymMap.TryGetValue(key.ToLowerInvariant(), out canonical);
                }

                if (canonical is null) return key; // Shouldn’t happen

                applied.TryGetValue(key, out int count);
                applied[key] = count + 1;

                return ApplyCasing(key.AsSpan(), canonical);
            });

            input = replaced;
        }

        // provide observability if any changes were made
        if (!string.Equals(originalInput, input, StringComparison.Ordinal))
        {
            LlmObservability.Adjustment("LLM Input Normalisation", MakeInvisibleCharactersVisible(originalInput), MakeInvisibleCharactersVisible(input));
        }

        return input;
    }

    /// <summary>
    /// Remove BOM/zero-width/control format chars and normalize odd spaces to ASCII space.
    /// </summary>
    /// <param name="input">Text to sanitise.</param>
    /// <returns>Sanitised text.</returns>
    private static string SanitizeInvisibleChars(string? input)
    {
        if (string.IsNullOrEmpty(input)) return input ?? string.Empty;

        // normalize first to canonical form
        string normalized = input.Normalize(NormalizationForm.FormC);
        normalized = normalized.Replace("…", "..."); // normalize ellipsis to three dots

        StringBuilder sb = new(normalized.Length);

        foreach (var ch in normalized)
        {
            var cat = CharUnicodeInfo.GetUnicodeCategory(ch);

            // drop all Unicode format characters (includes U+FEFF BOM, ZWSP/ZWJ, LRM/RLM, etc.)
            if (cat == UnicodeCategory.Format)
                continue;

            // convert all space separators (NBSP, NNBSP, IDEOGRAPHIC SPACE, etc.) to ' '
            if (cat == UnicodeCategory.SpaceSeparator)
            {
                sb.Append(' ');
                continue;
            }

            // drop most control characters except common whitespace; normalize them to a space as well
            if (cat == UnicodeCategory.Control)
            {
                if (ch == '\n' || ch == '\r' || ch == '\t')
                {
                    sb.Append(' ');
                }

                // else skip
                continue;
            }

            // normalize some common punctuation variants per GPT guidance
            switch (ch)
            {
                case '\r':
                    continue; // remove carriage returns

                case '’':
                    sb.Append('\''); // normalize fancy apostrophes to straight
                    continue;

                case '“':
                    sb.Append('"'); // normalize fancy quotes to straight
                    continue;

                case '—': // em dash
                case '–': // en dash
                    sb.Append('-'); // normalize dashes to hyphen
                    continue;
            }

            sb.Append(ch);
        }

        return CollapseMultipleSpacesIntoSingleSpace(sb);
    }

    /// <summary>
    /// Collapse multiple spaces into a single space, in one pass. This avoid doing multiple Replace() calls.
    /// </summary>
    /// <param name="sbContainingTextToRemoveSpacesFrom"></param>
    /// <returns>Input with multiple contiguous spaces removed.</returns>
    private static string CollapseMultipleSpacesIntoSingleSpace(StringBuilder sbContainingTextToRemoveSpacesFrom)
    {
        string stringWithoutSurplusSpaces = sbContainingTextToRemoveSpacesFrom.ToString();
        bool lastWasSpace = false;
        bool lastWasNewLine = false;

        sbContainingTextToRemoveSpacesFrom.Clear(); // we're going to reinsert the cleaned chars

        foreach (var ch in stringWithoutSurplusSpaces)
        {
            char thisChar = ch;

            if (thisChar == '\t') // replace tabs with spaces
            {
                thisChar = ' ';
            }

            // multiple spaces in a row get collapsed to single space
            if (thisChar == ' ')
            {
                if (lastWasSpace) continue; // this removes space following spaces

                sbContainingTextToRemoveSpacesFrom.Append(' ');
                lastWasSpace = true;


                continue;
            }

            // multiple new lines in a row get collapsed to single new line
            if (thisChar == '\n')
            {
                if (lastWasNewLine) continue; // remove new line following new line

                sbContainingTextToRemoveSpacesFrom.Append('\n');
                lastWasNewLine = true;

                continue;
            }

            sbContainingTextToRemoveSpacesFrom.Append(thisChar);
            lastWasSpace = false;
        }

        return sbContainingTextToRemoveSpacesFrom.ToString().Trim(); // even after collapsing, we have to trim leading/trailing spaces (avoids logic elsewhere)
    }

    /// <summary>
    /// For observability, makes invisible characters visible. Otherwise we can't see what was changed.
    /// </summary>
    /// <param name="input"></param>
    /// <returns></returns>
    private static string MakeInvisibleCharactersVisible(string input)
    {
        if (string.IsNullOrEmpty(input)) return input ?? string.Empty;

        var sb = new StringBuilder(input.Length * 2);
        int i = 0;

        while (i < input.Length)
        {
            char ch = input[i];

            // combine CRLF into a single token
            if (ch == '\r')
            {
                if (i + 1 < input.Length && input[i + 1] == '\n')
                {
                    sb.Append("<CRLF>");
                    i += 2;
                    continue;
                }

                sb.Append("<CR>");
                i++;
                continue;
            }

            // common ASCII whitespace
            if (ch == '\n') { sb.Append("<LF>"); i++; continue; }
            if (ch == '\t') { sb.Append("<TAB>"); i++; continue; }
            if (ch == ' ') { sb.Append(ch); i++; continue; }

            // handle a few known ASCII controls
            switch (ch)
            {
                case '\v': // VT
                    sb.Append("<VT>");
                    i++;
                    continue;
                case '\f': // FF
                    sb.Append("<FF>");
                    i++;
                    continue;
            }

            // Unicode aware handling
            var cat = CharUnicodeInfo.GetUnicodeCategory(ch);

            // Unicode format characters (zero-width, directional marks, BOM, etc.)
            if (cat == UnicodeCategory.Format)
            {
                switch (ch)
                {
                    case '\uFEFF': sb.Append("<BOM>"); break;          // ZERO WIDTH NO-BREAK SPACE (BOM)
                    case '\u200B': sb.Append("<ZWSP>"); break;         // ZERO WIDTH SPACE
                    case '\u200C': sb.Append("<ZWNJ>"); break;         // ZERO WIDTH NON-JOINER
                    case '\u200D': sb.Append("<ZWJ>"); break;          // ZERO WIDTH JOINER
                    case '\u2066': sb.Append("<LRI>"); break;          // LEFT-TO-RIGHT ISOLATE
                    case '\u2067': sb.Append("<RLI>"); break;          // RIGHT-TO-LEFT ISOLATE
                    case '\u2068': sb.Append("<FSI>"); break;          // FIRST STRONG ISOLATE
                    case '\u2069': sb.Append("<PDI>"); break;          // POP DIRECTIONAL ISOLATE
                    case '\u200E': sb.Append("<LRM>"); break;          // LEFT-TO-RIGHT MARK
                    case '\u200F': sb.Append("<RLM>"); break;          // RIGHT-TO-LEFT MARK
                    case '\u202A': sb.Append("<LRE>"); break;          // LEFT-TO-RIGHT EMBEDDING
                    case '\u202B': sb.Append("<RLE>"); break;          // RIGHT-TO-LEFT EMBEDDING
                    case '\u202C': sb.Append("<PDF>"); break;          // POP DIRECTIONAL FORMAT
                    case '\u202D': sb.Append("<LRO>"); break;          // LEFT-TO-RIGHT OVERRIDE
                    case '\u202E': sb.Append("<RLO>"); break;          // RIGHT-TO-LEFT OVERRIDE
                    default: sb.Append("<FMT>"); break;                // Generic format char
                }

                i++;
                continue;
            }

            // Any Unicode space separator -> show as <SP>
            if (cat == UnicodeCategory.SpaceSeparator)
            {
                // Includes NBSP (\u00A0), NNBSP (\u202F), IDEOGRAPHIC SPACE (\u3000), etc.
                sb.Append("<SP>");
                i++;
                continue;
            }

            // Control characters: C0, DEL, C1; avoid hex, use generic tags
            if (cat == UnicodeCategory.Control)
            {
                // Handle NEL (Next Line) U+0085
                if (ch == '\u0085')
                {
                    sb.Append("<NEL>");
                }
                else if (ch == '\u0000')
                {
                    sb.Append("<NUL>");
                }
                else
                {
                    sb.Append("<CTRL>");
                }

                i++;
                continue;
            }

            // Otherwise, visible character
            sb.Append(ch);
            i++;
        }

        return sb.ToString();
    }
}

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *