Why do AI models hallucinate?

Because a language model generates the most likely next words based on patterns, not a database it checks. When it has no real answer, it fills the gap with something that reads correctly. There's no built-in 'I don't know' unless the system is designed to produce one.

What is a token in AI?

A token is a chunk of text — roughly four characters, or about three-quarters of a word in English. Models read and write in tokens, and you're billed per token. Roughly 1,000 tokens is about 750 words.

How do tokens affect my AI costs?

You pay for tokens in (what you send) and tokens out (what the model returns). Long documents, big prompts and verbose answers all cost more. A system that sends a 10,000-word document to the model on every request racks up tokens fast.

How do I stop an AI from making things up?

Ground it in real sources with RAG so it answers from your documents, ask it to cite where each claim comes from, keep a human checking anything that carries risk, and never let unverified AI output go straight to a customer or a legal, financial or medical decision.

Guides · Explainer

AI hallucinations, tokens and context windows: the jargon that hits your bills and your risk

What an AI hallucination, a token and a context window actually are — and how each one maps to real cost and real risk for your business.

Christian Gibbs · founder — last updated 3 July 2026 · 12 min read

Three bits of AI jargon come up again and again, and unlike most of the vocabulary, these three cost you money and expose you to risk directly. Hallucinations are a risk problem — the model states falsehoods with total confidence. Tokens are a cost problem — they're the unit you're billed in. Context windows are a limit problem — they cap what the model can consider at once. Understand these three and you'll spot a bad AI deal, a runaway bill, and a fabricated answer before they hurt you.

The short version: A hallucination is the model confidently making something up. A token is a chunk of text (about ¾ of a word) and it's the unit you pay for. A context window is the maximum tokens a model can read and write in one go. The first is your risk exposure, the second and third are your bill. All three are manageable once you see how they work.

What is an AI hallucination, and why does it happen?

A hallucination is when an AI model states something false as though it's a fact — a fabricated statistic, a citation to a paper that doesn't exist, an invented clause in a contract, a wrong sum presented as certain. The output reads fluently and confidently. The confidence is identical whether the model is right or completely wrong, which is exactly what makes it dangerous.

Here's why it happens, and it's worth understanding because the fix follows from the cause. A language model doesn't hold a database of facts it looks things up in. It predicts the most likely next chunk of text given everything before it. Most of the time, likely text is also true text — the model has seen enough correct examples that the plausible answer is the correct one. But when it hasn't seen the answer, or the question sits in a gap, it still produces the most plausible-sounding continuation. That continuation can be pure fiction, delivered with the same steady confidence as a fact.

There's no built-in flag that says "I'm unsure here." Unless the surrounding system is designed to ground answers and admit uncertainty, the model has no natural way to say "I don't know." It would rather give you a smooth wrong answer than an honest gap.

Where hallucinations bite hardest:

Legal, financial, medical or safety-related answers where a confident falsehood causes real harm
Anything customer-facing, where a made-up policy or price becomes a promise you have to honour
Citations and figures — models are notorious for inventing plausible-looking sources and numbers
Niche or recent topics the model has thin or no training on

A widely reported example of the risk: a Canadian airline was held to a refund policy that its support chatbot had invented — the tribunal ruled the company was responsible for what its AI told a customer. The lesson lands for any UK business too. If your AI tells a customer something, you own it.

How do I reduce hallucinations?

You can't eliminate them entirely, but you can cut them down hard with a few deliberate choices. The point is to stop the model guessing and give it real material to work from, then keep a human in the loop where it counts.

Ground it in real sources (RAG). Instead of asking the model to answer from memory, retrieve the relevant passages from your own documents and have it answer from those. This is the single biggest lever — see our guide on RAG, fine-tuning and prompting for how that works. An answer built from a real document you control is far harder to fabricate.
Demand citations. Ask the model to quote or point to the source of each claim. Answers it can't source are answers to distrust, and asking for sources makes gaps visible.
Keep a human check where risk is real. For anything legal, financial, medical or customer-binding, unverified AI output should never go straight out. Draft with AI, decide with a person.
Constrain the task. A narrow, well-defined question grounded in specific data hallucinates far less than an open-ended "tell me about..." prompt.
Let it say "I don't know." Systems can be designed to return "not found in the source material" rather than inventing an answer. That honesty is a feature worth building in.

Get grounding and human review right and hallucinations move from a landmine to a managed, acceptable risk. Ignore them and you're one confident fabrication away from a refund you didn't agree to, or worse.

What is a token, and how does it drive my bill?

A token is a chunk of text — roughly four characters, or about three-quarters of a word in English. Models don't read letters or words; they read tokens. Common words are usually one token; longer or unusual words split into several. As a rough rule, 1,000 tokens is about 750 words, and a page of typical prose is somewhere around 500 tokens.

This matters because tokens are the unit you're billed in. Every AI system that calls a model pays per token, and the bill has two sides:

Tokens in — everything you send: your instructions, the question, and any documents or context you attach
Tokens out — everything the model generates in reply

Output tokens usually cost several times more than input tokens, so a system that generates long, verbose answers costs more than one that returns something tight. But the bigger surprise for most businesses is on the input side. If your system attaches a large document to every single request, you pay for all those input tokens every single time.

A worked cost example

Let's make it concrete with rough, illustrative maths. Exact prices vary by model and change often, so treat these as a way to reason, not a quote. Suppose a mid-range model charges roughly £2.50 per million input tokens and £10 per million output tokens.

You build a support assistant. Each request sends:

A system prompt and instructions: ~500 tokens
Three retrieved document passages for grounding: ~2,000 tokens
The customer's question: ~100 tokens
The model's answer: ~400 tokens out

Item	Tokens	Rough cost per request
Input (prompt + passages + question)	~2,600 in	~£0.0065
Output (the answer)	~400 out	~£0.0040
Total per request		~£0.01

A penny a request sounds trivial. Now scale it: 5,000 requests a month is about £53. Fine. But watch what happens if you get lazy with design and stuff the entire 40-page knowledge base — say 20,000 tokens — into every request instead of retrieving only the relevant passages:

Design	Input tokens per request	Rough cost at 5,000 requests/month
Retrieve only relevant passages	~2,600	~£53
Dump the whole knowledge base each time	~20,400	~£310

Same feature, six times the bill, purely because of how the system was built. That gap — retrieve versus dump — is the difference between a system an engineer designed and one that was thrown together. It's exactly the kind of thing our AI System Audit catches: real token maths on your actual usage, in a written report, so you know whether a system is costing what it should.

The practical takeaways on tokens:

Long documents and verbose prompts cost money on every call, not once
Retrieving only what's relevant beats sending everything — cheaper and often more accurate
Output length is a cost lever; ask for concise answers when you don't need an essay
Usage volume multiplies everything, so per-request cost matters enormously at scale

What is a context window, and why does it limit what the model can see?

A context window is the maximum amount of text — measured in tokens — a model can hold in mind for a single request. It covers both what you send in and what the model writes back. Think of it as the model's short-term working memory for one conversation or task. Everything the model reasons over has to fit inside that window; anything beyond it, the model simply can't see.

Modern models have large windows — some hold the equivalent of a long book. But "large" is not "infinite", and two problems follow from the limit:

Overflow. If your input plus the expected output exceeds the window, something has to give. The system either refuses, truncates your content, or drops the earliest part of a long conversation. A support bot that "forgets" what the customer said ten messages ago has usually run out of context window.
Cost and focus. Even when everything fits, filling a huge window has a downside beyond the token bill. Bury the one relevant paragraph inside 300 pages of context and the model can lose the thread — accuracy can dip when the signal is drowned in noise. More context is not automatically better.

This is one of the strongest arguments for RAG over brute force. Rather than cramming your entire document library into the context window on every request — expensive, and easy for the model to lose focus in — you retrieve the few passages that actually matter and hand the model a clean, tight context. Smaller, sharper context usually beats bigger, noisier context on both cost and accuracy.

What the context window means for you in practice:

There's a hard ceiling on how much a model can consider at once — you can't paste in unlimited data
Long conversations and huge documents can push against or exceed the limit
Bigger context costs more tokens and can reduce accuracy if it's mostly noise
Good system design feeds the model the right context, not the most context

How these three connect — and why they're really one story

Hallucinations, tokens and context windows look like three separate bits of jargon. They're actually one design problem seen from three angles.

A model hallucinates when it's guessing instead of reading real sources. The fix — grounding it in your documents — means putting the right material into the context window. But the context window is finite and every token in it costs money. So the engineering job is to retrieve the right passages: enough context to keep answers grounded and cut hallucinations, few enough tokens to keep the bill sane and the focus sharp. Get that balance right and you've solved the risk problem and the cost problem in the same move.

That balance is what separates a cheap chatbot that guesses and racks up surprise bills from a system that answers accurately at a predictable cost. It doesn't happen by accident. It's a deliberate set of choices about what goes into the window and what stays out.

Term	It's really about	Your exposure	The lever
Hallucination	The model guessing	Risk — false answers, liability	Grounding (RAG), citations, human check
Token	The unit of text	Cost — you pay per token	Retrieve, don't dump; keep answers tight
Context window	The model's working memory	Limits — hard ceiling, focus loss	Feed the right context, not the most

The honest verdict

None of these three are reasons to avoid AI. They're reasons to build it — or buy it — with your eyes open.

Hallucinations are real and they carry genuine liability, so anything customer-facing or high-stakes needs grounding and a human in the loop, full stop. Tokens mean AI has a running cost that scales with use, and a badly designed system can cost many times what a well-designed one does for the identical feature. Context windows mean there's a limit to what a model can consider, and pretending otherwise leads to systems that forget, truncate or lose focus.

If a vendor waves these away — "don't worry about hallucinations", "the cost is nothing", "the window's huge, just paste everything in" — that's your signal they haven't thought hard about your risk or your bill. The people who take these three seriously are the ones building systems that hold up.

If you want the token maths and the risk exposure worked out for a specific system, in writing, before you commit — that's precisely what the AI System Audit delivers, and our pricing guide covers what building it properly costs. For the wider vocabulary with an honest verdict on each term, start with the AI glossary, and to avoid the traps that catch most companies, read the mistakes UK businesses make with AI.

All guides

Frequently asked

Straight answers.

What is an AI hallucination?
An AI hallucination is when a model states something false as if it were true — a fake statistic, an invented citation, a made-up policy. It happens because the model predicts plausible text rather than looking facts up. It sounds confident whether it's right or wrong.
Why do AI models hallucinate?
Because a language model generates the most likely next words based on patterns, not a database it checks. When it has no real answer, it fills the gap with something that reads correctly. There's no built-in 'I don't know' unless the system is designed to produce one.
What is a token in AI?
A token is a chunk of text — roughly four characters, or about three-quarters of a word in English. Models read and write in tokens, and you're billed per token. Roughly 1,000 tokens is about 750 words.
How do tokens affect my AI costs?
You pay for tokens in (what you send) and tokens out (what the model returns). Long documents, big prompts and verbose answers all cost more. A system that sends a 10,000-word document to the model on every request racks up tokens fast.
What is a context window?
A context window is the maximum amount of text — measured in tokens — a model can consider at once, covering both your input and its reply. Exceed it and the model can't 'see' the overflow. It's the model's short-term working memory for a single request.
How do I stop an AI from making things up?
Ground it in real sources with RAG so it answers from your documents, ask it to cite where each claim comes from, keep a human checking anything that carries risk, and never let unverified AI output go straight to a customer or a legal, financial or medical decision.

Keep reading

All guides →

Start here

Want a straight answer on your own systems?

The £8k AI System Audit is a fixed-scope review of what you have and what is worth building. A written report, not a sales deck.

See the AI System Audit Book a 20-min call