Guardrails
So you’re planning to add an LLM chatbot or RAG. How do you ensure it acts with a controlled scope? Guardrails.
They are rules that are intended to prevent the LLM from answering questions that are deemed out-of-scope. For example, GPT should not inform wannabe terrorists how to make explosives or help script kiddies write viruses.
With the tone set, let’s troll China briefly. I love them too, but denying facts in the modern age is futile.
Who remembers Tiananmen Square? Running people over with tanks is fun in video games like Call Of Duty, Far Cry or Halo. It’s very efficient and doesn’t use up ammo; I do it all the time. It isn’t ok IRL.
When you ask a Chinese AI, the tentacles of autocracy say “no”.

Are you sure about that? I am pretty sure that isn’t true. I am confident it’s learnt from Western (biased) reports during training.

…

Funny. I am sure a minute ago, it knew nothing about the violent crackdown. It didn’t even take much effort to bypass the guardrail. It also didn’t understand I was in “1000AD” having travelled from 3000AD.
This is about the dangers of LLMs. They are like old (US?) presidents – often cannot be trusted.
How do you ensure that your LLM follows the rules? I have no idea; my trust in current LLMs is zero.
Here is DeepSeek r1 being slanderous. The LLM was meant to call a function that uses the hierarchy to work out children/parents. Except “DeepSeek” went rogue (none of it is true). Imagine YOUR chatbot/LLM doing this…

They also have an amazing ability to give dumb output: (where exactly did I ask to check a prime number?)

It’s an area that is moving at a rapid pace, but so are always the bad guys who try to be 1 step ahead. Their ingenuity knows no bounds.
For example, using Leet Speak.
Guardrails are growing in complexity. The basic essence is matching the text input to the LLM against disallowed phrases and the same outbound with the answer. Matching a string sounds logical, but ask the LLM to output “i” as “1” (one) and “e” as “3” means the text matching doesn’t catch it. You can also output text backwards, again and it no longer matches. Sure that can also be blocked by turning “1”,”0″,”3″ into letters before comparing, or reverse() of the string in the latter case.
But that’s not sustainable. You could ask the AI to substitute something other than the known characters. You could ask the LLM to use “11” instead of “1”.
You could ask it to output a hyphen in between letters…

It did that using:

You could ask it to output each character in hex…


That works on output, encoding the input is a little trickier, as it can’t run code; or can it?


That was surprising!
Except, it doesn’t work when I do this. Quirkily it dropped the “n” in “men”. Hmmm.

Let’s be MORE specific. How stupid is this? It generates the code, then kindly spills the beans about the topic it is not permitted to talk about.

I am horrified by what was said to have happened, but this example is not used for the purpose of criticising China. It’s to highlight that despite best efforts LLMs can be easily misled.

Here’s using “reversed” text to bypass it… Except it demonstrates can’t even reverse accurately spelling Tiananmen as “Tianamaneq”<gulp>, and mangled “square”.

You now appreciate the ineffectiveness of guardrails..
In the above examples what the LLM should do is refuse to answer a question that has [“tank”, “Tiananmen”, “square”, “protest”] tokens. The problem with such an approach is that you’d need a huge list of things to ban, and someone to work them out. Using a block list (things to block) is inadequate as you cannot guarantee you’ve excluded everything. You also cannot make an allow list of all the things it can say.
For guardrails to work they require an understanding of the intent. In the case of the “protests”, China is embarrassed by it and would rather the world forget about it. Making a guardrail will need to be quite clever.
We haven’t yet found a way to cause workable amnesia in an LLM. Making it never learn about something is easy if you’re prepared to be selective in what you train it on. Making it forget is difficult. I remember a recent article showing that negative things seemed to activate the same region of the neural network regardless of what it was negative about. I imagine applying a topic-specific amnesia will cause collateral damage to other things it has learnt.
I hope this inspires you to build something powered by an LLM but also raises awareness.