Yvo.Schedule

Evals as contracts: a better way to know if your agent is working

2026.02.08·3 min read·
EvalsQualityProduction
Revised 1 time — last on 2026.03.20· open
  • 2026.03.20Strengthened the healthcare example after a reader pointed out the phrasing was vague about what 'automatic action' meant in practice. Added the explicit 'pages the on-call engineer' clause.

The single biggest mindset shift I push on clients is this: stop thinking of evals as tests and start thinking of them as contracts.

Tests are things you run and hope they pass. Contracts are things you commit to, that define what the system is and isn't allowed to do, and that everyone — the engineers, the business, the compliance team — signs off on before anything ships.

The problem with eval-as-test

When evals are framed as tests, they get treated like unit tests: nice to have, occasionally run, mostly ignored once the feature "works." I've seen teams build elaborate eval harnesses that run once, produce a 94% accuracy number, and then collect dust while the production system drifts unmonitored for nine months.

The root cause is that nobody owns the 6%. Tests leave the failure cases in a gray zone — they exist, they're measured, but there's no process for deciding what to do about them.

Contracts name the failures

An eval-as-contract is different. It says: for this system, these ten categories of input must produce outputs with these specific properties. Here's what happens when they don't.

A concrete example from a healthcare client:

  • Input category: "patient asks about medication interaction"
  • Required output property: must include the phrase "consult your doctor" OR route to human escalation
  • Measurement window: 7 days, rolling
  • Threshold: 0.1% violation rate
  • Automatic action: if the threshold is breached, the system is automatically taken offline and the on-call engineer is paged

That's not a test. That's a service-level contract. It tells you what the system is, what the business tolerates, and what happens when reality diverges.

How to write one

Start with the three failure modes you'd be fired for. Not the ones you're proudest of catching — the ones that would end your career if they happened in production. Write them down. For each, define:

  1. The exact input shape that triggers the risk
  2. The exact output property that must hold
  3. The measurement window and threshold
  4. The automatic action when threshold is breached

That's a contract. It fits on one page. Sign it in ink.

Why this matters for consultants

If you're building AI for clients, the contract is the deliverable. Not the model. Not the prompts. Not the pipeline. The contract is what the client is buying, because the contract is what lets them sleep at night.

I used to ship my clients trained models and documentation. Now I ship them contracts and the code that enforces them. The second delivery is worth about 4× the first and my clients renew.