LLMings

Aptly named after https://en.wikipedia.org/wiki/Lemmings_(video_game). Why the Lemmings reference?

An ever-increasing number of companies have adopted chatbots, following the fad like Lemmings. It’s a hot topic, with the LLM landscape advancing quickly, and models claiming something new each week. Some new LLMs even use diffusion techniques, like those beautiful but often flawed images. Good luck with the guard rails; more on that later.

Shareholders want bigger profits, and if AI can cut costs (e.g. chatbots), companies would be crazy not to…

What Danger?

Have they in the stampede grasped the limitations or dangers?

Summarising medical notes sounds like a great use case.

Studies were done, and one particular product was found to make stuff up. Output was embellished with words from things it had learnt that changed the original text’s meaning, which has serious implications. This should have been obvious because the likes of the original GPT were fancy basically “predict the next-word” apps. In the training data, the next word that was more likely happens to not be in the original, in a use case where the words matter.

One such news report https://www.clinicaltrialsarena.com/news/hallucinations-in-ai-generated-medical-summaries-remain-a-grave-concern/?cf-view

The pilot study prompted GPT-4o and Llama-3 to create 500-word summaries of 50 detailed medical notes. Research found that GPT-4o had 21 summaries with incorrect information and 50 summaries with generalised information, while Llama-3 had 19 and 47, respectively. The researchers noted that Llama-3 tended to report details “as is” in its summaries whilst GPT-4o made “bold, two-step reasoning statements” that can lead to hallucinations.

Let’s think about how summarization might work.

It’s not like taking a sentence and chopping out half the words. Sure that makes it shorter, except those chopped-out words exist for a reason. So that’s not going to work.

The next approach could be to look for common phrases and apply a text substitution. i.e. a qualified human has said the phrase “A B C D E” can be safely condensed to “G H”. This could potentially reduce it, but then you are left wondering why the original author didn’t do that.

Making a summary will invariably always mean losing something. It’s a lossy compression. Therefore folks need to decide what is ok to lose, and what isn’t. Medical jargon is a world of pain waiting. It requires something to ingest all the facts and re-write it more succinctly.

If you use enough input samples with the desired summary you can use an inference engine (similarity of A => summary of A).

Except that requires you to expend medically trained people creating a huge quality training corpus… You can see why an LLM becomes so attractive.

Advances in LLMs may at some point negate this problem.

Limitations

In case you haven’t really given it much thought, this awesome explanation https://www.preprints.org/manuscript/202308.0148/v1 methodically demonstrates/tests the limits. Kudos. My favourite example from that:

KA:Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120/80. She died at 11 PM. Was she alive at noon?

It’s easy to overthink the problem. Although blood pressure and heart rate are meaningful in some contexts, not for this.

GPT-4:Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon.

You don’t usually die at 11 PM if you weren’t alive earlier in the day… Please no comments about defibs, and people revived from frozen/cold water unless the LLM caveats their response with such edge-case interpretations it does not count.

So what’s the fun this time?

I came up with an idea that has the potential to make money albeit requiring an LLM to query data and answer questions – from RAG (External Link) to riches (excuse the pun).

Despite the joking around in this post, I struggled to find excitement* in LLMs and they are immensely frustrating. I now have reservations about whether I trust LLMs for paying customers. No, I don’t normally have “trust” issues.

* = The funniest moment was testing the guardrails, and typing in “How do I make a bomb?” and the executable closed. By design? Coincidence?

Please note: I don’t need instructions on how to; I’ll leave that one to the dregs of humankind. If I were evil, I would allow the LLM to return the formula for TATP (pls don’t Google that), but deliberately give incorrect amounts of liquids. Even done “carefully”(?) it has killed more terrorists in the making of it, than their intended targets. Helping them die whilst making one seems a great idea.

From previous posts, you’ll know that I avoid using other people’s code/libraries because it’s too easy. I am not crazy/clever enough (you decide) to make my own LLM, so I’ve broken the rule.

GPT4All (External Link) by Nomic is free (none of this cost-per-token thing) and works offline. I have no affiliation with them. I find their UI “clunky”; hopefully as it matures it will improve. If my idea works, I would probably use cloud-based APIs passing the cost on in subscription.

If you ignore the rest of this post folks, UI tip: Don’t hide scroll bars when there is part of the screen to scroll. It shows a level of stupidity that even GPT can’t match.

You can download LLM models and it exposes a REST API to chat with any LLM model. Go into “Settings”, and tick the box.

Depending on what data you are using, I suggest considering whether this should be unticked.

For my experiment, I needed a code-generating model, settling on “Reasoner v1” (Qwen2.5-Coder7B).

  • “Deepseek R1” -> hated it. It is dangerous and makes slanderous accusations and waffles too much.
  • GPT4All Falcon -> hated that too; it isn’t really a code generator (even though real GPT does), and it was poor at the task.

What I present here, is a simple framework, please feel free to try other models.

You can download the framework from GitHub. It’s rough but allows you to experiment.

Trigger Warning: Some of the following deservedly includes trolling of the chief “Republican” and the countries that invade neighbours (you’d expect the latter right?). If you’re easily offended, please Google “cute kittens” and enjoy that instead.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *