AI Engineering · Eval Infrastructure

AI System Audit — £8,000 Fixed-Fee Diagnostic for Production AI Systems

For teams running LLM features in production. Two-week diagnostic, eight-thousand-pound fixed fee, no discovery dance. A report a technical founder can act on in fifteen minutes — and an appendix an engineer can verify.

Book your audit AI Engineering

Five questions

The five questions you can't answer without an eval harness

01
Can you catch LLM regressions before customers do?
A prompt edit at four on a Friday. A checkpoint silently updated. The first signal is a support ticket — already in front of users.
02
Do you know your model swap risk?
Provider deprecates a checkpoint. Pricing shifts mid-quarter. Without evals, should we switch? is a guess wrapped in a slide deck.
03
What changes when you touch a prompt?
An engineer tightens a prompt to fix one complaint. Two weeks later, five other workflows are quietly worse. Nobody connects it back.
04
Are your LLM outputs faithful to your source data?
RAG retrieves the right document, the model paraphrases something it didn't say. Accuracy catches wrong answers — faithfulness catches confident fabrications.
05
Are your confidence scores calibrated?
The model says 0.9 and is wrong four times in ten. Your human-in-the-loop routes the wrong cases and trusts the rest.

Deliverables

What you get

2-week diagnostic
Week one — kickoff, read-only access, system map. Week two — analysis, draft, internal review. End-of-week check-ins so nothing lands as a surprise.
30-page report
Findings ranked by severity — failure mode named, evidence attached, fix sized in engineer-days. Prioritised list at the front. Raw eval output in the appendix.
One-page exec summary
Risk level, top three findings, next step, rough cost. No code, no eval jargon — the version that travels to a board.
60-min review call
The engineer who wrote the report walks you through it. Not an account manager. Bring your engineering lead and anyone remediation will touch.

Book your audit

Audit scope

What we look at

Golden set audit
Breadth, labelling consistency, staleness, size against system surface. The typical finding: 40 cases, 38 happy paths, edges six months old.
CI integration
Do evals run on every PR, or just on a laptop before release? Gate the deploy, or post a comment nobody reads?
Production sampling
Real inputs feeding back into the suite — or a golden set frozen while customers ask new questions every week. The most common gap.
Alerting
Score regression, cost spike, latency cliff, tool-call failures. Routed to a channel nobody mutes — or learned from a customer first.
Calibration
Does the judge model agree with humans? Does reported confidence match real accuracy? Without it, your human-in-the-loop routes on a lie.

Right fit

Who this is for

Teams with paying customers
LLM feature live for 3+ months
>100 production users

Wrong fit

Who this is NOT for

Pre-prototype
Single-prompt apps
Internal-only experiments

After the audit

What happens after the audit

Three paths, your call. Most buyers fix the gaps themselves — the deliverable is built for that. No upsell, no follow-up sales sequence.

DIY fix
Your engineers do the work. We answer questions by email if you get stuck. No follow-up sales call. Most buyers take this path.
Scoped £15–30k remediation
We do the work. Audit findings become the scope — no second discovery. Four to eight weeks, fixed-fee. Same engineer leads the build.
Ongoing partnership
Monthly retainer to keep the harness alive — score review, swap risk runs, quarterly golden-set refresh. From £3,500 a month. Same engineer.

Pricing

£8,000 fixed. No discovery dance.

Pay 50% upfront, 50% on delivery.

Covers the full diagnostic — five audit areas, the report, the exec summary, the review call. Scope fixed in writing before you sign. If it runs long, that's on us.

Full pricing rationale: How much does AI engineering cost?

Frequently asked

Questions before you book

Will this work for a non-OpenAI stack?
Yes. The methodology is provider-agnostic. Anthropic, Google, open-weight, hosted, or self-hosted — the audit reads golden sets, CI configs, sampling logic, alert routing. Stack-specific quirks get noted in the report, not the score. The five audit areas are independent of which API you call.
What do you need access to?
Read-only. CI logs, prompt registry, existing eval configs, a sample of production traces. No direct database access. No source-code commits. Sensitive payloads can be redacted before they reach us — the audit reads structure and signal, not customer data. NDA signed before kickoff.
How is the £8,000 split?
Four thousand on signature. Four thousand on report delivery. No day-rate clock. Scope is fixed in writing — five audit areas, the report, the exec summary, the review call. If the work runs long, that is on us. The second invoice goes out the day the PDF lands in your inbox.
What if we don't have a golden set yet?
Common starting point. Absence of a golden set is itself a finding — usually the top one. We sample your production traces, cluster by intent, and bootstrap a candidate set as part of the audit. You leave with thirty-to-fifty labelled cases ranked by coverage gap. Not a finished harness — a defensible first version.
Can we extend into a remediation engagement after?
Three paths, your call. Most buyers fix the gaps themselves — the report is built for that. If you want us to do the work, audit findings become the scope. Four to eight weeks, fifteen-to-thirty fixed-fee, same engineer leads. Or monthly retainer from three-and-a-half a month. No follow-up sales sequence either way.

Book your audit

Book the call. Start within a week.

Twenty minutes to confirm scope and fit. Contract that afternoon, fifty percent up front, audit starts within five working days.

Book your audit call Send us a message

Start here

Know your eval gaps in two weeks.

Fixed fee, fixed scope — a thirty-page report that names what to fix, in what order, and what each fix costs in engineer-days.

Book your audit AI Engineering

The five questions you can't answer without an eval harness

Can you catch LLM regressions before customers do?

A prompt edit at four on a Friday. A checkpoint silently updated. The first signal is a support ticket — already in front of users.

Do you know your model swap risk?

Provider deprecates a checkpoint. Pricing shifts mid-quarter. Without evals, should we switch? is a guess wrapped in a slide deck.

What changes when you touch a prompt?

An engineer tightens a prompt to fix one complaint. Two weeks later, five other workflows are quietly worse. Nobody connects it back.

Are your LLM outputs faithful to your source data?

RAG retrieves the right document, the model paraphrases something it didn't say. Accuracy catches wrong answers — faithfulness catches confident fabrications.

Are your confidence scores calibrated?

The model says 0.9 and is wrong four times in ten. Your human-in-the-loop routes the wrong cases and trusts the rest.

What you get

2-week diagnostic

Week one — kickoff, read-only access, system map. Week two — analysis, draft, internal review. End-of-week check-ins so nothing lands as a surprise.

30-page report

Findings ranked by severity — failure mode named, evidence attached, fix sized in engineer-days. Prioritised list at the front. Raw eval output in the appendix.

One-page exec summary

Risk level, top three findings, next step, rough cost. No code, no eval jargon — the version that travels to a board.

60-min review call

The engineer who wrote the report walks you through it. Not an account manager. Bring your engineering lead and anyone remediation will touch.

What we look at

Golden set audit

Breadth, labelling consistency, staleness, size against system surface. The typical finding: 40 cases, 38 happy paths, edges six months old.

CI integration

Do evals run on every PR, or just on a laptop before release? Gate the deploy, or post a comment nobody reads?

Production sampling

Real inputs feeding back into the suite — or a golden set frozen while customers ask new questions every week. The most common gap.

Alerting

Score regression, cost spike, latency cliff, tool-call failures. Routed to a channel nobody mutes — or learned from a customer first.

Calibration

Does the judge model agree with humans? Does reported confidence match real accuracy? Without it, your human-in-the-loop routes on a lie.

What happens after the audit

Three paths, your call. Most buyers fix the gaps themselves — the deliverable is built for that. No upsell, no follow-up sales sequence.

DIY fix

Your engineers do the work. We answer questions by email if you get stuck. No follow-up sales call. Most buyers take this path.

Scoped £15–30k remediation

We do the work. Audit findings become the scope — no second discovery. Four to eight weeks, fixed-fee. Same engineer leads the build.

Ongoing partnership

Monthly retainer to keep the harness alive — score review, swap risk runs, quarterly golden-set refresh. From £3,500 a month. Same engineer.

Questions before you book

Will this work for a non-OpenAI stack?

Yes. The methodology is provider-agnostic. Anthropic, Google, open-weight, hosted, or self-hosted — the audit reads golden sets, CI configs, sampling logic, alert routing. Stack-specific quirks get noted in the report, not the score. The five audit areas are independent of which API you call.

What do you need access to?

Read-only. CI logs, prompt registry, existing eval configs, a sample of production traces. No direct database access. No source-code commits. Sensitive payloads can be redacted before they reach us — the audit reads structure and signal, not customer data. NDA signed before kickoff.

How is the £8,000 split?

Four thousand on signature. Four thousand on report delivery. No day-rate clock. Scope is fixed in writing — five audit areas, the report, the exec summary, the review call. If the work runs long, that is on us. The second invoice goes out the day the PDF lands in your inbox.

What if we don't have a golden set yet?

Common starting point. Absence of a golden set is itself a finding — usually the top one. We sample your production traces, cluster by intent, and bootstrap a candidate set as part of the audit. You leave with thirty-to-fifty labelled cases ranked by coverage gap. Not a finished harness — a defensible first version.

Can we extend into a remediation engagement after?

Three paths, your call. Most buyers fix the gaps themselves — the report is built for that. If you want us to do the work, audit findings become the scope. Four to eight weeks, fifteen-to-thirty fixed-fee, same engineer leads. Or monthly retainer from three-and-a-half a month. No follow-up sales sequence either way.