This follows on from:
- Generative AI for Genealogy – Introduction
- Generative AI for Genealogy – Data vs. GEDCOM files
- Generative AI for Genealogy – Part I
- Generative AI for Genealogy – Part II
- Generative AI for Genealogy – Part III
Normalisation: Cleaning Up the Linguistic Junk Drawer
Ask ten software architects what “normalisation” means and you’ll get ten different answers, each delivered with the confidence of someone who once read half a database textbook. Context matters, purpose matters, and let’s be honest, ego matters.
For this project, I’m using a very practical definition: normalisation is the process of turning whatever the user typed into something an LLM can understand without having a nervous breakdown.
That means removing invisible Unicode goblins, flattening exotic whitespace, and aligning synonyms so the model doesn’t treat “mum”, “mom”, and “cuz” as three unrelated species. It’s not glamorous work, but it’s essential especially when you’re dealing with small models that panic at the sight of a zero‑width space.
Why normalisation matters (and why small models cry without it)
There are three major sources of chaos in user input:
- Invisible Unicode noise: BOMs, ZWSPs, LRM/RLM… the kind of characters that haunt your dreams.
- Whitespace weirdness: NBSP, NNBSP, IDEOGRAPHIC SPACE… basically every space except the one you actually want.
- Synonym soup: mum/mom/mother, cuz/cousin, abt/about, etc.
Large models can often power through this mess. Small models? They fold like a cheap deckchair.
Normalisation gives them a clean, predictable surface to work with. It’s like tidying your desk before doing real work, except your desk is full of invisible characters that shouldn’t exist.
The two‑step dance: sanitise + substitute
My pipeline is simple:
1. Sanitise invisible characters
- Canonicalise the string
- Remove Unicode formatting characters (BOM, ZWSP, ZWJ, LRM/RLM…)
- Convert all exotic spaces to
' ' - Convert tabs/newlines/CR to
' ' - Collapse multiple spaces into one
2. Replace synonyms
This aligns user phrasing with the phrasing in my prompts. “Who is the mum of X?” becomes “Who is the mother of X?” The model doesn’t need this, but it certainly appreciates it.
If this were my only competitive differentiator, I’d be in trouble. But it’s not. It’s just one of the quiet, unglamorous foundations that make the whole system behave.
Observability: the unsung hero
You’ll notice the code (at the bottom the post) references my observability library. That’s deliberate. Normalisation is only useful if you can see what happened.
Never underestimate the importance of observability. It’s the difference between “the model is wrong” and “oh look, the user typed a zero‑width joiner between every letter.”
GPT’s verdict (and the extra homework it gave me)
I asked GPT to critique my pipeline. It responded with the AI equivalent of a warm pat on the back:
Your normalisation pipeline is already stronger than what most developers ever implement…
Which is flattering, but then it did what GPT always does: it gave me more work.
Here are the improvements it suggested:
1. Case normalisation
Lowercase everything after Unicode cleanup. Great for classification. Terrible for entity extraction. Conclusion:
- File‑switch skill → lowercase is fine
- Genealogy Q&A → absolutely not
2. Punctuation smoothing
Replace curly quotes, em‑dashes, ellipses, etc. with ASCII. This avoids tokenisation surprises and existential dread.
3. Genealogical shorthand expansion
“b.” → “born” “d.” → “died” “abt” → “about” “circa” → “about” “w/” → “with” This massively improves intent detection for tiny models.
4. Strip filler words (carefully)
Not just “please” and “hey”, but structural fluff like: “can you maybe”, “would you be able to”, “just wondering if”. But only for classification. Never for Q&A.
5. Normalise question forms
“whats” → “what is” “who’s” → “who is” “where’d” → “where did”
6. Domain‑specific expansion
“ftm” → “family tree maker” “ged” → “gedcom” “fam” → “family”
7. Keep a reversible log
Store both the raw input and the normalised version. This is already handled by observability.
The nuance: remove noise, not meaning
The golden rule of normalisation is simple: Strip anything that doesn’t change the user’s intent.
Lowercasing? Great for classification. Terrible for identifying “Bart” as a name instead of a verb.
Removing filler? Great for intent detection. Terrible for Q&A, where tone and phrasing matter.
Normalisation is not about dumbing down the input. It’s about removing the linguistic static that hides meaning.
Coming up next: the thing that didn’t work
In the next chapter, I’ll cover something that should have worked but absolutely didn’t: matching examples within a skill.
It’s not glamorous. It’s not exciting. But it’s one of those engineering dead‑ends that teaches you more than the successes. And yes, I’ll bring my best humour, because the topic deserves it.
Next part: Generative AI for Genealogy – Part V
Please feel to use my normaliser in your own fun projects. If you spot an error, please leave a comment!
Please note: you will need to replace the call to my observability library with one of your own. Don’t forget to create the synonym file, and adjust the path accordingly.
using LlmIntegration.observability;
using System.Diagnostics;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;
namespace LlmIntegration;
/// <summary>
/// Normalisation engine for input text.
/// </summary>
internal static class NormalisationEngine
{
/// <summary>
/// Where the synonyms file are located.
/// </summary>
private const string SYNONYMS_PATH = @".\llm\synonyms.md";
/// <summary>
/// List of synonyms to perform.
/// </summary>
private static readonly List<(string, string)> s_synonyms = [];
/// <summary>
/// The compiled synonym regex (we use RegEx to find the synonyms in text).
/// </summary>
private static Regex? s_synonymRegex;
/// <summary>
/// Contains a mapping of words to their synonyms used for lookup operations.
/// </summary>
private static Dictionary<string, string>? s_synonymMap;
/// <summary>
/// Constructor - loads synonyms from file.
/// </summary>
static NormalisationEngine()
{
if (!File.Exists(SYNONYMS_PATH))
{
Debug.WriteLine($"Synonyms file not found at {SYNONYMS_PATH}. No synonyms will be applied.");
return;
}
foreach (string line in File.ReadAllLines(SYNONYMS_PATH))
{
string trimmedLine = line.Trim();
if (trimmedLine.StartsWith('#') || string.IsNullOrWhiteSpace(trimmedLine))
{
continue; // Skip comments and empty lines
}
// format:
// newValue: oldValue, oldValue2...
// e.g mother: mom, mum, mam, mommy, mummy, ma
// get old and new values
if (!trimmedLine.Contains(':')) continue;
string oldValue = trimmedLine.Split(":", StringSplitOptions.RemoveEmptyEntries)[0].Trim().ToLowerInvariant();
string[] newValues = trimmedLine.Split(":", StringSplitOptions.RemoveEmptyEntries)[1].Split(",", StringSplitOptions.RemoveEmptyEntries);
// add each synonym, trimming and normalizing. New values can be multiple, separated by commas.
foreach (string newValue in newValues)
{
string trimmedNewValue = newValue.Trim().ToLowerInvariant();
s_synonyms.Add((oldValue, trimmedNewValue));
if (Debugger.IsAttached) Debug.WriteLine($"Loaded synonym: '{oldValue}' => '{trimmedNewValue}'");
}
}
if (Debugger.IsAttached) Debug.WriteLine($"Loaded {s_synonyms.Count} synonyms from {SYNONYMS_PATH}.");
EnsureSynonymRegexInitialized();
}
/// <summary>
/// Takes care of initializing the synonym regex from the synonym map.
/// </summary>
private static void EnsureSynonymRegexInitialized()
{
// Build a canonical map: variant -> canonical (last wins)
Dictionary<string, string> map = new(StringComparer.OrdinalIgnoreCase);
foreach (var (canonical, variant) in s_synonyms)
{
if (string.IsNullOrWhiteSpace(canonical) || string.IsNullOrWhiteSpace(variant)) continue;
map[variant.Trim()] = canonical.Trim();
}
s_synonymMap = map;
if (map.Count == 0)
{
s_synonymRegex = null;
return;
}
// Longest-first to prefer phrases (multi-word supported)
string alternation = string.Join("|",
map.Keys
.OrderByDescending(k => k.Length)
.Select(Regex.Escape));
// Unicode-aware non-letter/number/mark boundaries
string pattern = $@"(?<![\p{{L}}\p{{N}}\p{{M}}])(?<key>{alternation})(?![\p{{L}}\p{{N}}\p{{M}}])";
s_synonymRegex = new Regex(
pattern,
RegexOptions.Compiled |
RegexOptions.CultureInvariant |
RegexOptions.IgnoreCase);
}
/// <summary>
/// Applies the casing style of the original text to the replacement string, preserving all-uppercase, title case,
/// or lowercase formatting.
/// </summary>
/// <remarks>If the original span contains no letter characters, the replacement is returned unchanged.
/// Only simple casing styles (all uppercase, title case, or lowercase) are detected; mixed or complex casing is not
/// preserved.</remarks>
/// <param name="original">The source span whose casing style is to be analyzed and applied. Only letter characters are considered when
/// determining casing.</param>
/// <param name="replacement">The string to which the detected casing style from the original span will be applied.</param>
/// <returns>A string containing the replacement text adjusted to match the casing style of the original span. Returns the
/// replacement as all uppercase if the original is all uppercase, with an initial capital if the original is title
/// case, or unchanged otherwise.</returns>
private static string ApplyCasing(ReadOnlySpan<char> original, string replacement)
{
// Preserve simple ALLCAPS, TitleCase, or lower
bool anyLetter = false;
bool allUpper = true;
bool titleCase = true;
for (int i = 0; i < original.Length; i++)
{
char c = original[i];
if (!char.IsLetter(c)) continue;
anyLetter = true;
if (!char.IsUpper(c)) allUpper = false;
if (i == 0)
{
if (!char.IsUpper(c)) titleCase = false;
}
else
{
if (char.IsUpper(c)) titleCase = false;
}
}
if (!anyLetter) return replacement;
if (allUpper) return replacement.ToUpperInvariant();
if (titleCase) return replacement.Length > 0
? char.ToUpperInvariant(replacement[0]) + (replacement.Length > 1 ? replacement[1..] : string.Empty)
: replacement;
return replacement; // lower or mixed stays as-is
}
/// <summary>
/// Normalizes the input text.
/// Currently little more than synonym replacement.
/// Enhancements could include:
/// - More advanced text normalization techniques
/// - Contextual synonym replacement
/// - Integration with external knowledge sources
/// </summary>
/// <param name="input"></param>
/// <returns></returns>
internal static string NormalizeText(string input)
{
string originalInput = new(input); // keep for observability
input = input
.Trim()
.Replace("\\r", "\r") // replace escaped carriage returns with real carriage returns
.Replace("\\n", "\n"); // replace escaped new lines with real new lines
input = SanitizeInvisibleChars(input);
if (s_synonymRegex != null && s_synonymMap != null && s_synonymMap.Count > 0 && input.Length > 0)
{
var applied = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
string replaced = s_synonymRegex.Replace(input, m =>
{
string key = m.Groups["key"].Value;
if (!s_synonymMap.TryGetValue(key, out string? canonical))
{
// Fallback to lower for safety
s_synonymMap.TryGetValue(key.ToLowerInvariant(), out canonical);
}
if (canonical is null) return key; // Shouldn’t happen
applied.TryGetValue(key, out int count);
applied[key] = count + 1;
return ApplyCasing(key.AsSpan(), canonical);
});
input = replaced;
}
// provide observability if any changes were made
if (!string.Equals(originalInput, input, StringComparison.Ordinal))
{
LlmObservability.Adjustment("LLM Input Normalisation", MakeInvisibleCharactersVisible(originalInput), MakeInvisibleCharactersVisible(input));
}
return input;
}
/// <summary>
/// Remove BOM/zero-width/control format chars and normalize odd spaces to ASCII space.
/// </summary>
/// <param name="input">Text to sanitise.</param>
/// <returns>Sanitised text.</returns>
private static string SanitizeInvisibleChars(string? input)
{
if (string.IsNullOrEmpty(input)) return input ?? string.Empty;
// normalize first to canonical form
string normalized = input.Normalize(NormalizationForm.FormC);
normalized = normalized.Replace("…", "..."); // normalize ellipsis to three dots
StringBuilder sb = new(normalized.Length);
foreach (var ch in normalized)
{
var cat = CharUnicodeInfo.GetUnicodeCategory(ch);
// drop all Unicode format characters (includes U+FEFF BOM, ZWSP/ZWJ, LRM/RLM, etc.)
if (cat == UnicodeCategory.Format)
continue;
// convert all space separators (NBSP, NNBSP, IDEOGRAPHIC SPACE, etc.) to ' '
if (cat == UnicodeCategory.SpaceSeparator)
{
sb.Append(' ');
continue;
}
// drop most control characters except common whitespace; normalize them to a space as well
if (cat == UnicodeCategory.Control)
{
if (ch == '\n' || ch == '\r' || ch == '\t')
{
sb.Append(' ');
}
// else skip
continue;
}
// normalize some common punctuation variants per GPT guidance
switch (ch)
{
case '\r':
continue; // remove carriage returns
case '’':
sb.Append('\''); // normalize fancy apostrophes to straight
continue;
case '“':
sb.Append('"'); // normalize fancy quotes to straight
continue;
case '—': // em dash
case '–': // en dash
sb.Append('-'); // normalize dashes to hyphen
continue;
}
sb.Append(ch);
}
return CollapseMultipleSpacesIntoSingleSpace(sb);
}
/// <summary>
/// Collapse multiple spaces into a single space, in one pass. This avoid doing multiple Replace() calls.
/// </summary>
/// <param name="sbContainingTextToRemoveSpacesFrom"></param>
/// <returns>Input with multiple contiguous spaces removed.</returns>
private static string CollapseMultipleSpacesIntoSingleSpace(StringBuilder sbContainingTextToRemoveSpacesFrom)
{
string stringWithoutSurplusSpaces = sbContainingTextToRemoveSpacesFrom.ToString();
bool lastWasSpace = false;
bool lastWasNewLine = false;
sbContainingTextToRemoveSpacesFrom.Clear(); // we're going to reinsert the cleaned chars
foreach (var ch in stringWithoutSurplusSpaces)
{
char thisChar = ch;
if (thisChar == '\t') // replace tabs with spaces
{
thisChar = ' ';
}
// multiple spaces in a row get collapsed to single space
if (thisChar == ' ')
{
if (lastWasSpace) continue; // this removes space following spaces
sbContainingTextToRemoveSpacesFrom.Append(' ');
lastWasSpace = true;
continue;
}
// multiple new lines in a row get collapsed to single new line
if (thisChar == '\n')
{
if (lastWasNewLine) continue; // remove new line following new line
sbContainingTextToRemoveSpacesFrom.Append('\n');
lastWasNewLine = true;
continue;
}
sbContainingTextToRemoveSpacesFrom.Append(thisChar);
lastWasSpace = false;
}
return sbContainingTextToRemoveSpacesFrom.ToString().Trim(); // even after collapsing, we have to trim leading/trailing spaces (avoids logic elsewhere)
}
/// <summary>
/// For observability, makes invisible characters visible. Otherwise we can't see what was changed.
/// </summary>
/// <param name="input"></param>
/// <returns></returns>
private static string MakeInvisibleCharactersVisible(string input)
{
if (string.IsNullOrEmpty(input)) return input ?? string.Empty;
var sb = new StringBuilder(input.Length * 2);
int i = 0;
while (i < input.Length)
{
char ch = input[i];
// combine CRLF into a single token
if (ch == '\r')
{
if (i + 1 < input.Length && input[i + 1] == '\n')
{
sb.Append("<CRLF>");
i += 2;
continue;
}
sb.Append("<CR>");
i++;
continue;
}
// common ASCII whitespace
if (ch == '\n') { sb.Append("<LF>"); i++; continue; }
if (ch == '\t') { sb.Append("<TAB>"); i++; continue; }
if (ch == ' ') { sb.Append(ch); i++; continue; }
// handle a few known ASCII controls
switch (ch)
{
case '\v': // VT
sb.Append("<VT>");
i++;
continue;
case '\f': // FF
sb.Append("<FF>");
i++;
continue;
}
// Unicode aware handling
var cat = CharUnicodeInfo.GetUnicodeCategory(ch);
// Unicode format characters (zero-width, directional marks, BOM, etc.)
if (cat == UnicodeCategory.Format)
{
switch (ch)
{
case '\uFEFF': sb.Append("<BOM>"); break; // ZERO WIDTH NO-BREAK SPACE (BOM)
case '\u200B': sb.Append("<ZWSP>"); break; // ZERO WIDTH SPACE
case '\u200C': sb.Append("<ZWNJ>"); break; // ZERO WIDTH NON-JOINER
case '\u200D': sb.Append("<ZWJ>"); break; // ZERO WIDTH JOINER
case '\u2066': sb.Append("<LRI>"); break; // LEFT-TO-RIGHT ISOLATE
case '\u2067': sb.Append("<RLI>"); break; // RIGHT-TO-LEFT ISOLATE
case '\u2068': sb.Append("<FSI>"); break; // FIRST STRONG ISOLATE
case '\u2069': sb.Append("<PDI>"); break; // POP DIRECTIONAL ISOLATE
case '\u200E': sb.Append("<LRM>"); break; // LEFT-TO-RIGHT MARK
case '\u200F': sb.Append("<RLM>"); break; // RIGHT-TO-LEFT MARK
case '\u202A': sb.Append("<LRE>"); break; // LEFT-TO-RIGHT EMBEDDING
case '\u202B': sb.Append("<RLE>"); break; // RIGHT-TO-LEFT EMBEDDING
case '\u202C': sb.Append("<PDF>"); break; // POP DIRECTIONAL FORMAT
case '\u202D': sb.Append("<LRO>"); break; // LEFT-TO-RIGHT OVERRIDE
case '\u202E': sb.Append("<RLO>"); break; // RIGHT-TO-LEFT OVERRIDE
default: sb.Append("<FMT>"); break; // Generic format char
}
i++;
continue;
}
// Any Unicode space separator -> show as <SP>
if (cat == UnicodeCategory.SpaceSeparator)
{
// Includes NBSP (\u00A0), NNBSP (\u202F), IDEOGRAPHIC SPACE (\u3000), etc.
sb.Append("<SP>");
i++;
continue;
}
// Control characters: C0, DEL, C1; avoid hex, use generic tags
if (cat == UnicodeCategory.Control)
{
// Handle NEL (Next Line) U+0085
if (ch == '\u0085')
{
sb.Append("<NEL>");
}
else if (ch == '\u0000')
{
sb.Append("<NUL>");
}
else
{
sb.Append("<CTRL>");
}
i++;
continue;
}
// Otherwise, visible character
sb.Append(ch);
i++;
}
return sb.ToString();
}
}
