Three bits of AI jargon come up again and again, and unlike most of the vocabulary, these three cost you money and expose you to risk directly. Hallucinations are a risk problem — the model states falsehoods with total confidence. Tokens are a cost problem — they're the unit you're billed in. Context windows are a limit problem — they cap what the model can consider at once. Understand these three and you'll spot a bad AI deal, a runaway bill, and a fabricated answer before they hurt you.
The short version: A hallucination is the model confidently making something up. A token is a chunk of text (about ¾ of a word) and it's the unit you pay for. A context window is the maximum tokens a model can read and write in one go. The first is your risk exposure, the second and third are your bill. All three are manageable once you see how they work.
What is an AI hallucination, and why does it happen?
A hallucination is when an AI model states something false as though it's a fact — a fabricated statistic, a citation to a paper that doesn't exist, an invented clause in a contract, a wrong sum presented as certain. The output reads fluently and confidently. The confidence is identical whether the model is right or completely wrong, which is exactly what makes it dangerous.
Here's why it happens, and it's worth understanding because the fix follows from the cause. A language model doesn't hold a database of facts it looks things up in. It predicts the most likely next chunk of text given everything before it. Most of the time, likely text is also true text — the model has seen enough correct examples that the plausible answer is the correct one. But when it hasn't seen the answer, or the question sits in a gap, it still produces the most plausible-sounding continuation. That continuation can be pure fiction, delivered with the same steady confidence as a fact.
There's no built-in flag that says "I'm unsure here." Unless the surrounding system is designed to ground answers and admit uncertainty, the model has no natural way to say "I don't know." It would rather give you a smooth wrong answer than an honest gap.
Where hallucinations bite hardest:
- Legal, financial, medical or safety-related answers where a confident falsehood causes real harm
- Anything customer-facing, where a made-up policy or price becomes a promise you have to honour
- Citations and figures — models are notorious for inventing plausible-looking sources and numbers
- Niche or recent topics the model has thin or no training on
A widely reported example of the risk: a Canadian airline was held to a refund policy that its support chatbot had invented — the tribunal ruled the company was responsible for what its AI told a customer. The lesson lands for any UK business too. If your AI tells a customer something, you own it.
How do I reduce hallucinations?
You can't eliminate them entirely, but you can cut them down hard with a few deliberate choices. The point is to stop the model guessing and give it real material to work from, then keep a human in the loop where it counts.
- Ground it in real sources (RAG). Instead of asking the model to answer from memory, retrieve the relevant passages from your own documents and have it answer from those. This is the single biggest lever — see our guide on RAG, fine-tuning and prompting for how that works. An answer built from a real document you control is far harder to fabricate.
- Demand citations. Ask the model to quote or point to the source of each claim. Answers it can't source are answers to distrust, and asking for sources makes gaps visible.
- Keep a human check where risk is real. For anything legal, financial, medical or customer-binding, unverified AI output should never go straight out. Draft with AI, decide with a person.
- Constrain the task. A narrow, well-defined question grounded in specific data hallucinates far less than an open-ended "tell me about..." prompt.
- Let it say "I don't know." Systems can be designed to return "not found in the source material" rather than inventing an answer. That honesty is a feature worth building in.
Get grounding and human review right and hallucinations move from a landmine to a managed, acceptable risk. Ignore them and you're one confident fabrication away from a refund you didn't agree to, or worse.
What is a token, and how does it drive my bill?
A token is a chunk of text — roughly four characters, or about three-quarters of a word in English. Models don't read letters or words; they read tokens. Common words are usually one token; longer or unusual words split into several. As a rough rule, 1,000 tokens is about 750 words, and a page of typical prose is somewhere around 500 tokens.
This matters because tokens are the unit you're billed in. Every AI system that calls a model pays per token, and the bill has two sides:
- Tokens in — everything you send: your instructions, the question, and any documents or context you attach
- Tokens out — everything the model generates in reply
Output tokens usually cost several times more than input tokens, so a system that generates long, verbose answers costs more than one that returns something tight. But the bigger surprise for most businesses is on the input side. If your system attaches a large document to every single request, you pay for all those input tokens every single time.
A worked cost example
Let's make it concrete with rough, illustrative maths. Exact prices vary by model and change often, so treat these as a way to reason, not a quote. Suppose a mid-range model charges roughly £2.50 per million input tokens and £10 per million output tokens.
You build a support assistant. Each request sends:
- A system prompt and instructions: ~500 tokens
- Three retrieved document passages for grounding: ~2,000 tokens
- The customer's question: ~100 tokens
- The model's answer: ~400 tokens out
| Item | Tokens | Rough cost per request |
|---|---|---|
| Input (prompt + passages + question) | ~2,600 in | ~£0.0065 |
| Output (the answer) | ~400 out | ~£0.0040 |
| Total per request | ~£0.01 |
A penny a request sounds trivial. Now scale it: 5,000 requests a month is about £53. Fine. But watch what happens if you get lazy with design and stuff the entire 40-page knowledge base — say 20,000 tokens — into every request instead of retrieving only the relevant passages:
| Design | Input tokens per request | Rough cost at 5,000 requests/month |
|---|---|---|
| Retrieve only relevant passages | ~2,600 | ~£53 |
| Dump the whole knowledge base each time | ~20,400 | ~£310 |
Same feature, six times the bill, purely because of how the system was built. That gap — retrieve versus dump — is the difference between a system an engineer designed and one that was thrown together. It's exactly the kind of thing our AI System Audit catches: real token maths on your actual usage, in a written report, so you know whether a system is costing what it should.
The practical takeaways on tokens:
- Long documents and verbose prompts cost money on every call, not once
- Retrieving only what's relevant beats sending everything — cheaper and often more accurate
- Output length is a cost lever; ask for concise answers when you don't need an essay
- Usage volume multiplies everything, so per-request cost matters enormously at scale
What is a context window, and why does it limit what the model can see?
A context window is the maximum amount of text — measured in tokens — a model can hold in mind for a single request. It covers both what you send in and what the model writes back. Think of it as the model's short-term working memory for one conversation or task. Everything the model reasons over has to fit inside that window; anything beyond it, the model simply can't see.
Modern models have large windows — some hold the equivalent of a long book. But "large" is not "infinite", and two problems follow from the limit:
- Overflow. If your input plus the expected output exceeds the window, something has to give. The system either refuses, truncates your content, or drops the earliest part of a long conversation. A support bot that "forgets" what the customer said ten messages ago has usually run out of context window.
- Cost and focus. Even when everything fits, filling a huge window has a downside beyond the token bill. Bury the one relevant paragraph inside 300 pages of context and the model can lose the thread — accuracy can dip when the signal is drowned in noise. More context is not automatically better.
This is one of the strongest arguments for RAG over brute force. Rather than cramming your entire document library into the context window on every request — expensive, and easy for the model to lose focus in — you retrieve the few passages that actually matter and hand the model a clean, tight context. Smaller, sharper context usually beats bigger, noisier context on both cost and accuracy.
What the context window means for you in practice:
- There's a hard ceiling on how much a model can consider at once — you can't paste in unlimited data
- Long conversations and huge documents can push against or exceed the limit
- Bigger context costs more tokens and can reduce accuracy if it's mostly noise
- Good system design feeds the model the right context, not the most context
How these three connect — and why they're really one story
Hallucinations, tokens and context windows look like three separate bits of jargon. They're actually one design problem seen from three angles.
A model hallucinates when it's guessing instead of reading real sources. The fix — grounding it in your documents — means putting the right material into the context window. But the context window is finite and every token in it costs money. So the engineering job is to retrieve the right passages: enough context to keep answers grounded and cut hallucinations, few enough tokens to keep the bill sane and the focus sharp. Get that balance right and you've solved the risk problem and the cost problem in the same move.
That balance is what separates a cheap chatbot that guesses and racks up surprise bills from a system that answers accurately at a predictable cost. It doesn't happen by accident. It's a deliberate set of choices about what goes into the window and what stays out.
| Term | It's really about | Your exposure | The lever |
|---|---|---|---|
| Hallucination | The model guessing | Risk — false answers, liability | Grounding (RAG), citations, human check |
| Token | The unit of text | Cost — you pay per token | Retrieve, don't dump; keep answers tight |
| Context window | The model's working memory | Limits — hard ceiling, focus loss | Feed the right context, not the most |
The honest verdict
None of these three are reasons to avoid AI. They're reasons to build it — or buy it — with your eyes open.
Hallucinations are real and they carry genuine liability, so anything customer-facing or high-stakes needs grounding and a human in the loop, full stop. Tokens mean AI has a running cost that scales with use, and a badly designed system can cost many times what a well-designed one does for the identical feature. Context windows mean there's a limit to what a model can consider, and pretending otherwise leads to systems that forget, truncate or lose focus.
If a vendor waves these away — "don't worry about hallucinations", "the cost is nothing", "the window's huge, just paste everything in" — that's your signal they haven't thought hard about your risk or your bill. The people who take these three seriously are the ones building systems that hold up.
If you want the token maths and the risk exposure worked out for a specific system, in writing, before you commit — that's precisely what the AI System Audit delivers, and our pricing guide covers what building it properly costs. For the wider vocabulary with an honest verdict on each term, start with the AI glossary, and to avoid the traps that catch most companies, read the mistakes UK businesses make with AI.