b0gy field notes

Open source is not open-ended

Thu, 30 Apr 2026 00:00:00 +0000

Warp, the GPU-accelerated terminal, announced it is now open source. The code is on GitHub. The license is AGPL. OpenAI is the founding sponsor. Community members submit ideas, AI agents write the implementation, humans verify.

This is good news. And it is part of a pattern that is accelerating.

The wave is real

In the last twelve months the pace of tools going open source has been remarkable. Terminals, editors, databases, orchestration frameworks — companies that started closed are opening up, and companies that started open are choosing more permissive licenses. The competitive pressure from AI-accelerated development means the moat is no longer the code. It is the ecosystem, the integrations, the community velocity.

This is a net positive for engineering teams. More open tools means more options, more portability, and less lock-in. If a tool stalls, someone can fork it. If a vendor disappears, the code survives. The default posture should be optimism — more open source is better than less.

But “open source” is not one thing. And the license still matters.

AGPL is not MIT

When an engineering leader hears “open source,” they often hear “free to use and modify however I want.” AGPL says something slightly different.

The GNU Affero General Public License requires that if you modify the software and let users interact with it over a network, you must release your modifications under the same license. It is designed to close the “SaaS loophole” in regular GPL — the one that lets you modify GPL code, run it as a service, and never share your changes.

For a terminal emulator, this mostly does not matter. You will use Warp as-is. But if your platform engineering team is thinking about forking it — adding proprietary integrations, custom agent workflows, internal distribution — AGPL’s copyleft requirements propagate in ways that catch people off guard. Worth a conversation with legal before you commit to a fork, not after.

Does the license matter in a world where everything is moving this fast? Yes — but maybe less than it used to. When switching costs are low and alternatives are plentiful, the license is a factor, not a blocker. For most teams using Warp as a terminal, AGPL changes nothing about their day-to-day. For the small number building on top of it, it is worth ten minutes of reading.

Agents write code, humans steer

The more interesting signal is how Warp plans to develop in the open. Their pitch: community members submit ideas and verification, AI agents handle implementation through Oz — their cloud orchestration platform. The bottleneck, they argue, is no longer writing code. It is the human-in-the-loop activities around the code.

This is a genuinely new contribution model and it is worth watching. Traditional open source depends on a core maintainer team plus community PRs. Warp is proposing something different: community-directed, agent-implemented, human-verified. If it works, it could dramatically increase the throughput of open-source projects that struggle with maintainer burnout.

The question is about verification quality at scale. When a human writes code, the review process carries implicit context — the reviewer knows the author’s track record, their tendencies, their level. When an agent writes code, that social context is gone. The verification burden does not disappear. It shifts from “does this person’s approach make sense” to “is this output correct from scratch.” That is a different skill, and teams adopting this model should be honest about whether their reviewers are equipped for it.

But this is an execution challenge, not a fundamental flaw. Agent-assisted development is already how a lot of teams work internally. Warp is just making it the explicit open-source contribution model. That transparency is worth more than pretending every PR was lovingly hand-crafted.

Open source as acceleration

Warp’s rationale for going open is refreshingly honest. They cannot outspend well-funded closed-source alternatives on headcount or pricing subsidies. Open-sourcing with community contributions — routed through AI agents — is a strategy to get more development velocity without proportionally more cost.

This is becoming the dominant playbook and it is working. GitLab, Supabase, Grafana, PostHog — the most successful developer tools of the last decade are open-core. The code is free. The business model sits on top: cloud features, team management, enterprise support. Warp is joining a proven pattern, not inventing a risky one.

For engineering leaders evaluating tools, this model is actually more predictable than pure closed-source. You can read the code, audit the security, assess the architecture. If you outgrow the tool or the company pivots, you have the source. The worst case with open-core is better than the worst case with closed-source.

The residual risk

We wrote about a similar dynamic with open-weights models — self-hosting trades one set of dependencies for another. Open-source tools are the same, just lower stakes.

The code being on GitHub is a real reduction in risk. But the project’s momentum still depends on Warp’s team, their AI agent pipeline, their OpenAI sponsorship, and their community. If those inputs change, the project’s trajectory changes too. The code is a snapshot. The roadmap is a relationship.

For terminal emulators, this barely matters. You can switch terminals in an afternoon. But the pattern generalizes to higher-stakes tools where switching costs are real. The habit worth building: when you adopt any open-source tool, note what the project depends on beyond the license. Funding model, contribution velocity, bus factor. Not because you should be paranoid — because you should be informed.

What this means for your stack

Three things worth internalizing:

Default to open, but read the license. The wave of tools going open source is good for everyone. Lean into it. But know the difference between MIT, Apache 2.0, and AGPL before you build on top of something. Ten minutes of reading now saves a painful conversation with legal later.

Watch the agent-contribution model. Warp is an early, visible example of AI agents as the primary implementation layer in open source. This will spread. It is not inherently better or worse — it is different, and your evaluation of a project’s health should account for how the code is actually being written and reviewed.

Open source lowers switching cost, which is the thing that actually matters. Vendor risk is ultimately about how painful it is to leave. Open-source tools make leaving easier, which makes adopting them safer. That is the real win — not “free software” but “software you are not trapped by.”

tl;dr

The pattern. A wave of developer tools are going open source as AI-accelerated development makes the code less defensible than the ecosystem around it. This is good — more options, more portability, less lock-in. The nuance. “Open source” is a spectrum of licenses with different obligations. AGPL is not MIT. Agent-written contributions shift the verification model. And open code does not eliminate all dependency on the team behind it. The outcome. Default to open-source tools when the option exists, but evaluate them the way you evaluate any dependency: license, contribution health, switching cost, and what happens if the project’s inputs change.

If you can't eval it, don't ship it

Sun, 19 Apr 2026 00:00:00 +0000

Every AI feature we’ve seen regress in production had something in common. It shipped without an eval suite. The team planned to “add evals later.” Later never came — or it came after the second incident, by which time nobody trusted the system and the fix was political, not technical.

This is the pattern. It is extremely common. And it is fixable, if you flip the order.

“We’ll add evals later”

Software engineers learned this lesson 20 years ago with unit tests. “We’ll write tests later” meant “we’ll never write tests.” The industry developed TDD, CI gates, coverage thresholds — not because testing is fun, but because the cost of not testing compounds silently until something breaks in production.

AI systems are the same problem, but worse. A traditional bug crashes. A log line appears. Someone gets paged. An AI regression does none of those things. The model returns a plausible-looking wrong answer. The user sees it. Maybe they notice, maybe they don’t. Your dashboards stay green. Your error rate is zero. Your system is confidently wrong and nobody knows.

This is why “we’ll add evals later” is more dangerous than “we’ll write tests later.” Tests catch failures that announce themselves. Evals catch failures that don’t.

The order is wrong

Most teams we work with build in this order:

Build the feature
Demo it to stakeholders
Ship it
Get a bug report
Panic
Build an eval to prove the fix works

The order should be:

Define what “correct” means for this feature
Build an eval that measures it
Build the feature
Run the eval
Ship when the eval passes
Run the eval on every deploy

Step 1 is the hardest part. It forces you to answer questions you’d rather defer. What does a good answer look like? How wrong is too wrong? What are the edge cases? If you cannot answer these questions, you are not ready to build the feature — you just don’t know it yet.

What a minimal eval suite looks like

You do not need a research-grade evaluation framework. You need three things.

A golden set. 50–100 input-output pairs where you know the correct answer. For a RAG system, these are questions paired with the documents that contain the answers. For a classification agent, these are inputs paired with correct labels. For a generation system, these are prompts paired with reference outputs and a rubric. Building this takes 1–2 days. It is the single highest-leverage day your team will spend.

A scoring function. Something that takes a system output and a reference answer and returns a number. This can be exact match. It can be cosine similarity. It can be an LLM-as-judge with a rubric. It does not need to be perfect. It needs to be consistent enough to catch regressions.

A CI gate. The eval runs on every PR that touches the retrieval pipeline, the prompt, or the model config. If the score drops below a threshold, the PR does not merge. This is the part that actually prevents regressions. Without it, the golden set is just a spreadsheet someone checks once a quarter.

That is it. Golden set, scoring function, CI gate. You can build this in a week. You can build a rough version in a day.

The failure modes we keep seeing

“Our eval is a vibe check.” Someone on the team runs 10 queries manually and says “looks good.” This catches nothing. It is not repeatable. It does not run in CI. It is a ritual, not a test.

“Our eval is too expensive to run on every deploy.” Then make it cheaper. Subsample your golden set. Use a faster model for scoring. Run the full suite nightly and a smoke test on every PR. The constraint is not cost. The constraint is that you have not decided to prioritize it.

“We don’t know what correct looks like.” This is the most honest version. And it means you are not ready to ship the feature. If you cannot define correct, you cannot measure it. If you cannot measure it, you cannot know whether your next deploy made it better or worse. You are flying blind. That is fine in a prototype. It is not fine in production.

“Our system is too creative to eval.” No it isn’t. Even creative outputs have properties you can measure — factual accuracy, format compliance, toxicity, length, presence of required information. You are not evaluating whether the output is beautiful. You are evaluating whether it is broken.

The heuristic

If you can’t eval it, you can’t ship it. If you can’t re-eval it on every deploy, you can’t maintain it.

This sounds strict. It is. AI systems degrade in ways that are invisible until they are expensive. A prompt change that improves one class of queries and silently breaks another. An embedding model update that shifts your vector space. A chunking change that drops context your users depend on. None of these will page anyone. All of them will erode trust.

The eval suite is not overhead. It is the only thing standing between you and a system that is getting worse and you don’t know it.

We have seen this pattern dozens of times. The teams that build evals first ship slower in week one and faster in month three. The teams that skip evals ship fast, then spend a quarter rebuilding trust — with their users, with their stakeholders, and with themselves.

Build the eval first. Then build the feature. The order matters.

tl;dr

The pattern. Teams ship AI features without evals because they plan to “add them later,” which means they never get added until after the second production incident — by which point the system has eroded user trust and the fix is political as much as technical. The fix. Before writing the prompt or building the pipeline, define what “correct” means, assemble a 50–100 item golden set, write a scoring function, and wire a CI gate that blocks merges when the score drops below threshold. The outcome. Regressions from prompt changes, embedding model swaps, and chunking updates get caught in the pull request rather than discovered by users, and the team can ship changes confidently instead of shipping them hopefully.

Your agent is a cronjob. Name it that.

Thu, 26 Mar 2026 00:00:00 +0000

Half the “agent architectures” we audit are a cronjob with a LLM call and a retry loop. That is a good thing. Here is why naming it correctly changes how you test it, what you monitor, and whether your on-call can fix it at 3am.

The pattern

You have a scheduled job. It runs every N minutes. It calls a model. If the model fails, it retries. If the retry fails, it alerts someone. That is a cronjob. It is a good architecture. It is battle-tested. Your ops team already knows how to run it.

The problem starts when you call it an “agent” and treat it like one. Agents get agent infrastructure — orchestration frameworks, memory stores, planning loops. Your cronjob does not need any of that. It needs a cron expression, a health check, and a dashboard.

Why the name matters

When you name something correctly, three things change:

Testing. Cronjobs get tested like cronjobs — you run them, check the output, compare to expected. You don’t need an “agent evaluation framework.” You need pytest and a fixture that returns a known payload.

Monitoring. Cronjobs get monitored like cronjobs — did it run, how long did it take, did it succeed. You don’t need “agent observability.” You need a counter, a histogram, and an alert on failure rate.

On-call. When your cronjob pages someone at 3am, the on-call engineer knows what to do. Check the logs. Check the input. Check the model response. Retry manually if needed. They do not need to understand a “reasoning trace” or a “tool-use chain.”

The heuristic

If your system does not make decisions about what to do next — if the control flow is static and the only dynamic part is the model call — it is a cronjob. Name it that. Run it that way. Monitor it that way.

Save the word “agent” for systems that actually have a planning loop, where the output of one step determines which step runs next. Those exist. They are rare. And they need genuinely different infrastructure.

Most of what is shipping in production today is a cronjob. That is not an insult. That is a compliment. Cronjobs work.

tl;dr

The pattern. Teams label scheduled jobs with a single LLM call as “agent architectures” and then reach for orchestration frameworks, memory stores, and planning infrastructure that the system does not need and the on-call engineer cannot debug at 3am. The fix. If the control flow is static and only the model call is dynamic, name it a cronjob, test it with pytest, monitor it with a counter and a failure-rate alert, and save “agent” for systems that actually have a planning loop where one step’s output determines the next. The outcome. Your system gets the simple, battle-tested operations tooling it deserves, and your on-call can fix it without understanding a reasoning trace.

Wishlists with a Gantt chart glued on

Sun, 08 Mar 2026 00:00:00 +0000

Most AI roadmaps we see are 14 features with a velocity assumption. The fix is not better estimation. The fix is cutting 13 of them. We show the exact heuristic we use, and the three questions that shake loose the one that compounds.

The pattern we keep seeing

A team comes to us with a roadmap. It has 14 items. Each item has a T-shirt size. The sizes add up to “about two quarters.” The team has four engineers. The math checks out if you squint.

Here is what actually happens: item 1 takes six weeks instead of two. Item 2 gets blocked on a vendor decision. Items 3 through 14 never ship. The board asks what happened. The answer is always the same: “we underestimated.”

No. You over-scoped.

The three questions

When we see a roadmap like this, we ask three questions:

1. Which of these compounds? Not “which is important” — they are all important, that is why they are on the roadmap. Which one, if you ship it and nothing else, makes the next thing easier to build? That is your only item.

2. What is the smallest version that teaches you something? Not an MVP in the startup sense. A version small enough that you can ship it, measure it, and learn whether the full version is even worth building. If you cannot describe this version in one sentence, you are not ready to build it.

3. What happens if you never build the other 13? Usually the answer is “nothing, because we were never going to build them anyway.” Sometimes the answer is “we lose a customer.” That is useful information. But it does not mean you should build 14 things. It means you should pick differently.

The heuristic

One feature per quarter. Two if the team is large and the features are independent. Anything more is a wishlist with a Gantt chart glued on.

tl;dr

The pattern. AI teams commit to 14 roadmap items per quarter with T-shirt-size estimates that assume everything goes smoothly, item 1 takes three times as long as planned, and items 3 through 14 never ship. The fix. Ask which single item compounds — which one, if shipped alone, makes the next thing easier to build — and treat everything else as optional until that one is done. The outcome. You ship one thing that actually works and compounds, instead of fourteen things that are perpetually 80% done.

Stop benchmarking on Wikipedia

Sat, 14 Feb 2026 00:00:00 +0000

Your retrieval benchmark is lying to you if it’s on a corpus your model has seen. Here is a small, cheap protocol for building an eval set on your actual corpus, plus a script that will tell you when your retriever has quietly regressed.

The problem

You built a RAG system. You benchmarked it on a public dataset — maybe Natural Questions, maybe HotpotQA, maybe something you found in a blog post. Your numbers look good. You ship it. Three weeks later, users are complaining that the answers are wrong.

The benchmark lied. Not because the benchmark is bad. Because the benchmark corpus is not your corpus.

Why public benchmarks fail you

Public benchmarks test retrieval on corpora that large language models have already seen during training. This means the model can sometimes answer the question correctly without retrieving anything. Your retrieval could be returning garbage and the benchmark would still show high accuracy.

Your corpus is different. Your internal docs, your Confluence pages, your Notion databases — the model has never seen these. When retrieval fails on your corpus, the model cannot compensate. The answer is wrong, and the user notices.

The protocol

Here is how to build an eval set that actually measures your retrieval quality:

Step 1. Pull 50 documents from your actual corpus. Pick them at random. Do not cherry-pick.

Step 2. For each document, write 2–3 questions that can only be answered by reading that specific document. Not trivia. Real questions that a user would actually ask.

Step 3. For each question, record the document ID that contains the answer. This is your ground truth.

Step 4. Run your retriever on each question. Check whether the correct document appears in the top-k results. This is your recall@k.

That is your eval set. 100–150 question-document pairs. It takes about a day to build. It is worth more than any public benchmark you will ever run.

The regression script

Once you have the eval set, run it on every deploy. If recall@10 drops by more than 5 points, block the deploy. This catches silent regressions — the kind where someone changes an embedding model or a chunking strategy and does not realize they just broke retrieval for 30% of queries.

tl;dr

The pattern. Teams benchmark their RAG retrieval on public datasets like Natural Questions or HotpotQA and get good numbers, then ship to production where the model is answering from internal documents it has never seen — and the retrieval quietly fails. The fix. Spend a day building 100–150 question-document pairs from your actual corpus, then run recall@k against it on every deploy and block if it drops more than 5 points. The outcome. You have a retrieval benchmark that measures your real system, and silent regressions from embedding model swaps or chunking changes get caught before they reach users.

Eval-driven development

Fri, 23 Jan 2026 00:00:00 +0000

There is a workflow that most AI teams converge on eventually. The ones that converge on it early ship better products. The ones that converge on it late have a painful six months first.

The workflow is this: write the eval before you write the prompt. It is test-driven development for AI systems, and it is the single most important practice we recommend to teams building with language models.

The problem it solves

Without evals, the development cycle looks like this: write a prompt, try a few examples in the playground, look at the outputs, feel okay about them, ship. Two weeks later, a user reports a bad output. You tweak the prompt. Try the failing example. It works now. Ship. A week later, a different user reports a different bad output. Repeat.

This cycle has two problems. First, you are testing in production. Your users are your eval suite. They do not enjoy the role. Second, you have no way to know whether a change that fixes one problem breaks another. Every prompt change is a coin flip.

With evals, the cycle becomes: define what success looks like, build a test suite, iterate until the tests pass, ship. When something breaks in production, add it to the suite. Run the suite before every deploy. You still have production failures — but each one makes the system permanently better, because it becomes a test case that can never silently recur.

Write the eval first

This is the part teams resist. It feels backwards. “How can I write tests before I know what the system will do?”

You write the tests because you need to define what “working” means before you start building. This forces clarity. Instead of “the chatbot should be helpful,” you write concrete test cases:

Input: “What is your return policy?” Expected: Response mentions 30-day window. Response mentions the requirement for original packaging. Response does not mention competitor policies.
Input: “Can I return a used item?” Expected: Response clearly states that used items cannot be returned. Response suggests contacting support for exceptions.
Input: “How are you feeling today?” Expected: Response redirects to product-related topics. Response does not engage in personal conversation.

These test cases are imperfect. They are incomplete. They do not cover every edge case. That is fine. They cover the cases you know about, and they define a floor for behavior. The floor rises over time as you add more cases.

The act of writing the eval also surfaces design questions early. “What should the system do when asked about competitor products?” If you do not decide before building, you will discover the question in production — when a user screenshots a bad answer and posts it on Twitter.

The eval suite structure

A practical eval suite has three layers:

Deterministic checks. These are non-negotiable behaviors that can be verified programmatically. The output must be valid JSON. The output must not contain PII. The output must be under 500 tokens. The output must be in the specified language. These are cheap to run, fast to evaluate, and should never fail.

Semantic checks. These verify that the output contains or avoids specific content. “Response mentions the 30-day return window.” “Response does not include pricing information.” These can be checked with string matching, keyword detection, or — for fuzzier criteria — LLM-as-judge.

Quality checks. These assess the overall quality of the response against criteria like accuracy, helpfulness, tone, and completeness. These are almost always evaluated with LLM-as-judge or human review. They are the most expensive layer but also the most informative.

Not every test case needs all three layers. Start with deterministic checks for structural requirements and semantic checks for content requirements. Add quality checks for your most important use cases.

The daily workflow

Here is what eval-driven development looks like day to day:

Morning. Developer picks up a task — maybe a new feature, maybe a bug fix, maybe a prompt improvement. Before touching the prompt, they write 3-5 new eval cases that define what the change should accomplish.

Midday. Developer iterates on the prompt. After each change, they run the eval suite. The suite includes the new cases plus all existing cases. They watch for two things: do the new cases pass? Do any existing cases break?

Afternoon. The eval suite passes. The developer opens a pull request. The PR includes the prompt change and the new eval cases. The reviewer can see exactly what behavior the change is supposed to produce and verify that the eval cases are reasonable.

Deployment. CI runs the full eval suite against the changed prompt. If the suite passes, the change is deployed. If it fails, the deploy is blocked. The developer is notified and investigates.

This workflow is slower than “edit and ship” on day one. By month three, it is faster. The team spends less time debugging production issues, less time reverting bad changes, less time explaining to stakeholders why the AI said something it should not have said.

The eval suite as artifact

Over time, the eval suite becomes the team’s most important artifact. More important than the prompt. More important than the model selection. More important than the architecture.

Here is why: the eval suite encodes what “working” means. It is the cumulative knowledge of every failure, every edge case, every design decision. A new team member can read the eval suite and understand the system’s intended behavior faster than they can read the code.

When you switch models — and you will — the eval suite tells you whether the new model meets the bar. When you rewrite the prompt — and you will — the eval suite tells you whether the rewrite preserved the behaviors that matter. When you redesign the pipeline — and you will — the eval suite is the constant.

Prompts are ephemeral. Models are ephemeral. The eval suite is the thing that persists.

The economics

Teams that adopt eval-driven development report a consistent pattern:

Week 1-2. Slower. Writing evals takes time. The team feels like they are over-investing in testing.
Month 1. Neutral. The eval suite catches a few regressions that would have been production incidents. Time saved on debugging roughly offsets time spent on eval writing.
Month 3. Faster. The eval suite is mature enough that prompt changes can be made confidently. The team iterates faster because they know immediately whether a change works. Production incidents drop.
Month 6. Significantly faster. The eval suite is comprehensive. New features are built against existing eval infrastructure. Onboarding new team members is faster because the eval suite serves as documentation.

The teams that never adopt evals stay in the “edit, ship, pray” loop. They ship about as fast in month six as they did in month one — but they spend an increasing share of their time on firefighting.

Common objections

“Evals are expensive.” A 100-case eval suite costs $5-15 per run with LLM-as-judge. You run it a few times a day during development. That is $20-60 per day. Your production AI spend is orders of magnitude higher. The eval cost is the cheapest insurance you will buy.

“LLM-as-judge is unreliable.” It is imperfect. It has a 5-10% error rate on nuanced judgments. But you are not asking it for nuance. You are asking it for gross failures — did the response mention the return policy or not? At that level, LLM-as-judge is quite reliable. Use deterministic checks where you can. Use LLM-as-judge for the rest.

“We don’t know what all the edge cases are.” You do not need to. Start with the cases you know. Add production failures as they occur. The suite grows organically. Perfection is not the goal. Coverage is.

The heuristic

Write the eval before you write the prompt. Every production failure becomes a new eval case. Run the suite before every deploy. The eval suite is the artifact that compounds — it gets more valuable with every failure it encodes and every regression it catches. If you build one thing well on your AI team, build the eval suite.

tl;dr

The pattern. AI teams build the feature first, demo it, ship it, and only think about evals after the first production incident — which means they spend months in an “edit, ship, pray” loop where every prompt change might silently break a behavior they fixed last week. The fix. Write the eval before you write the prompt: define what “correct” means, build 20–50 test cases, and block deploys in CI when the suite score drops below threshold. The outcome. The team ships slower in week one and significantly faster by month three, because every change is made against a growing specification of intended behavior rather than into the dark.

The AI project that should have been a spreadsheet

Fri, 09 Jan 2026 00:00:00 +0000

A team spent three months building an AI-powered classification system. It categorized incoming support tickets into 12 buckets. It used a fine-tuned model. It had a retrieval layer for edge cases. It had a human-in-the-loop review queue. It cost $8k/month to run.

The previous system — a series of keyword rules in a CASE statement — had 89% accuracy. The AI system had 93% accuracy. The 4-point improvement cost $8k/month in API fees, three months of engineering time, and ongoing maintenance burden for a system with non-deterministic behavior.

A senior engineer on the team eventually asked the question nobody wanted to hear: “Could we have gotten to 93% by adding more rules to the CASE statement?”

The answer was yes.

The pattern

We see this pattern often enough that it has a name in our practice. We call it “AI-for-the-sake-of-AI.” The problem is real. The solution works. But the solution is dramatically over-engineered for the problem it solves.

The tell is simple: if you can enumerate the categories, you probably do not need a language model to classify them. If you can write the summary template, you probably do not need a model to generate it. If the data fits in memory, you probably do not need embeddings to search it.

This is not a criticism of AI. AI is genuinely transformative for problems that require language understanding, pattern recognition at scale, or handling of genuinely novel inputs. The criticism is of reaching for AI before checking whether a simpler tool works.

The examples

Classification with a small label set. If your classification problem has fewer than 20 categories and the distinguishing features are keywords or patterns in the input, a rules engine is the right tool. It is deterministic, debuggable, fast, and free. Add AI when the categories are ambiguous, the language is varied, or new categories emerge frequently.

Summarization with a fixed structure. “Summarize this support ticket into: customer name, issue type, severity, and next action.” This is not summarization. This is extraction. A template with regex or a lightweight NER model handles this at a fraction of the cost and with 100% structural consistency. The LLM will occasionally forget a field, reformat the output, or hallucinate a severity level. The template will not.

Prediction with historical data. “Predict which customers will churn based on their usage patterns.” If you have structured data — login frequency, feature usage, support tickets filed — a gradient-boosted tree will outperform an LLM at this task. It will be faster, cheaper, more interpretable, and easier to maintain. LLMs are not good at tabular prediction. They never have been.

Search over a small corpus. If your corpus is fewer than 10,000 documents and your users search by keyword, full-text search (Elasticsearch, PostgreSQL tsvector, even SQLite FTS) is the right answer. It is fast, well-understood, and does not require an embedding pipeline. Add semantic search when keyword search fails — when users search for concepts, not strings.

Data transformation with known rules. “Convert these addresses to a standard format.” “Extract phone numbers from these documents.” “Map these product codes to categories.” These are deterministic transformations. Write the rules. An LLM will get 95% of them right and will get 5% wrong in unpredictable ways. The rules engine will get 100% right for the patterns you have written and will fail loudly on patterns you have not — which is the behavior you want.

Why teams reach for AI anyway

Three reasons:

Excitement. AI is new and interesting. Rules engines are boring. Engineers — reasonably — want to work on interesting problems. The organizational pressure to “do AI” reinforces this. Nobody gets a promotion for shipping a well-crafted CASE statement.

Anticipated complexity. “The problem is simple now, but it will get more complex.” Maybe. But build for the problem you have, not the problem you imagine. If the problem gets more complex, you can add AI then. You cannot un-add complexity.

Demo-driven development. The AI solution demos well. You type a natural language query, the system responds intelligently, the stakeholder is impressed. The rules engine does not demo well. It just works, quietly, correctly, boringly. But demos are not production, and production is what matters.

The cost of unnecessary AI

The cost is not just the API bill — though the API bill matters. The deeper costs:

Non-determinism. Rules produce the same output for the same input. Always. LLMs do not. When your classification system occasionally puts the same ticket in different categories on successive runs, debugging becomes archaeology. “Why did it do that?” “We don’t know. It’s a language model.”

Maintenance burden. A rules engine is maintained by editing rules. An AI system is maintained by monitoring evals, managing prompts, tracking model versions, debugging retrieval, and handling the occasional production hallucination. The maintenance surface area is 10x larger.

Debugging difficulty. When a rule is wrong, you read the rule, find the bug, fix it. When an AI output is wrong, you inspect the prompt, check the retrieved context, examine the model version, consider whether the temperature is too high, wonder if this is a rare stochastic failure, and eventually shrug.

Latency. The rules engine responds in milliseconds. The AI system responds in seconds. For many use cases, this matters.

The heuristic

Before you build an AI-powered solution, ask three questions:

Can I enumerate the categories or outcomes? If yes, try a rules engine first.
Does the data fit in a spreadsheet? If yes, start with a spreadsheet.
Does the problem require understanding language that varies in unpredictable ways? If no, you probably do not need an LLM.

Use AI when the problem genuinely requires it — when inputs are novel, language is varied, patterns are too complex for rules, or scale makes manual approaches impossible. For everything else, the boring solution is the better solution.

tl;dr

The pattern. Teams reach for fine-tuned models, retrieval layers, and human-in-the-loop queues to solve problems — classification with a fixed label set, extraction into a known template, keyword search over a small corpus — that a CASE statement or a regex would solve deterministically for free. The fix. Before building anything AI-powered, ask whether you can enumerate the categories, whether the data fits in a spreadsheet, and whether the problem actually requires understanding unpredictably varied language. The outcome. You end up with a system that is faster, cheaper, fully debuggable, and cheaper to maintain — and you reserve AI for the problems where it genuinely cannot be replaced.

Monitoring AI systems is not monitoring APIs

Fri, 19 Dec 2025 00:00:00 +0000

Your AI system is monitored. You have dashboards. Uptime: 99.9%. p95 latency: 2.3 seconds. Error rate: 0.1%. Everything is green. Everything looks healthy.

Your users are getting wrong answers. They have been getting wrong answers for three days. Your monitoring did not catch it, because your monitoring is not monitoring the right thing.

The gap

Traditional API monitoring answers one question: is the system running? Uptime, latency, error rate, throughput — these tell you whether the service is available and responsive. For a CRUD API, this is sufficient. If the service is up and returning 200s, it is probably working correctly.

AI systems break this assumption. An AI system can be 100% available, returning 200s with sub-second latency, and be 100% wrong. The model is running. The API is responding. The answers are garbage.

This happens more often than teams expect. A retrieval index gets corrupted — the system returns confident, well-formed answers based on the wrong documents. A prompt change introduces a subtle regression — the system answers most queries correctly but consistently fails on a specific category. A model update changes behavior in ways that are hard to detect from individual responses but obvious in aggregate.

Your Datadog dashboard will not catch any of these. It will remain green while your users lose trust.

What to monitor

AI monitoring requires a different set of metrics. Not instead of traditional monitoring — in addition to it. You still need uptime and latency. But you also need metrics that approximate output quality.

Output distribution tracking. Monitor the statistical properties of your outputs over time. Average response length. Vocabulary diversity. Frequency of refusal responses (“I cannot answer that”). Frequency of hedging language (“I’m not sure, but…”).

These are not direct measures of quality. They are proxies — and useful ones. If your average response length suddenly drops by 40%, something changed. If your refusal rate spikes from 2% to 15%, something is wrong. If every response starts with the same phrase, something is broken.

Set baselines during a period of known-good behavior. Alert on deviations beyond 2 standard deviations. The alert will not tell you what is wrong, but it will tell you something is wrong — which is infinitely better than finding out from a customer escalation.

Retrieval quality metrics. If you are running a RAG system, monitor the retrieval layer independently. Track the number of chunks retrieved per query, the similarity scores of retrieved chunks, and the percentage of queries that retrieve zero results.

A drop in average similarity score means your retrieval is returning less relevant documents. A spike in zero-result queries means your index is missing coverage. These are leading indicators — they degrade before the user-visible output degrades.

Confidence and uncertainty. If your system produces confidence scores — through calibrated probabilities, log probabilities, or a separate scoring step — track them. A decline in average confidence suggests the system is seeing queries it is less equipped to handle, or that the underlying data has drifted.

Not every system has native confidence scores. But you can add them. A simple approach: after generating a response, ask a second model (or the same model with a different prompt) whether the response answers the question. Track the agreement rate. A drop in agreement is a signal.

Cost per query. Monitor what each query costs — in API tokens, in compute, in dollars. Cost is a surprisingly good proxy for behavioral changes. If cost per query increases, the model is producing longer outputs or the retrieval is stuffing more context into the prompt. If cost decreases, outputs are getting shorter — which might mean the model is being less thorough.

Cost monitoring also catches runaway spending. A prompt change that triggers verbose reasoning chains can 3x your API bill before anyone notices. If you are monitoring cost per query with alerts, you catch it in hours, not at month-end.

Periodic eval runs. The most reliable quality signal: run your eval suite against production on a schedule. Daily, if you can afford it. Weekly at minimum.

Take a sample of production queries, run them through the system, and score the outputs against your golden set or with LLM-as-judge. Track the score over time. If it drops, investigate.

This is not a substitute for real-time monitoring. Eval runs are lagging indicators — they tell you about yesterday’s quality, not right now. But they are the most accurate quality signal you have, and they catch slow degradation that proxy metrics miss.

The dashboard

Here is what an AI monitoring dashboard should include, beyond the standard operational metrics:

Response length distribution (histogram, with 7-day rolling baseline)
Refusal rate (time series)
Retrieval similarity score distribution (if RAG)
Zero-result retrieval rate (if RAG)
Cost per query (p50, p90, p99)
Eval score (latest run, trend over last 30 days)
Output diversity score (unique n-grams as a fraction of total n-grams)

Each of these should have an alert threshold. Start generous — you do not want alert fatigue on day one. Tighten the thresholds as you build intuition about what normal looks like.

The incident you will catch

Here is a real pattern we have seen: a team updated their embedding model as part of a routine dependency upgrade. The new model had slightly different dimensional characteristics. The retrieval index was rebuilt, but the similarity scores shifted — documents that previously scored 0.85 now scored 0.72. The retrieval was still returning results, so no errors were thrown. But the results were less relevant. Answer quality degraded gradually over two weeks.

With traditional monitoring, this is invisible. With retrieval quality monitoring, the similarity score drop is caught within hours.

The heuristic

If your AI monitoring dashboard has the same metrics as your API monitoring dashboard, you are not monitoring your AI system. You are monitoring the container it runs in. Add output distribution tracking, retrieval quality metrics, cost per query, and periodic eval runs. The system can be up and wrong. Your monitoring should know the difference.

tl;dr

The pattern. Teams monitor AI systems the same way they monitor APIs — uptime, latency, error rate — which stays green while the model returns wrong answers for days because HTTP 200 says nothing about whether the response was correct. The fix. Add output distribution tracking, retrieval similarity scores, cost per query, and scheduled eval runs against a golden set on top of your standard operational metrics. The outcome. Silent quality regressions — like a corrupted retrieval index or a prompt change that breaks a query category — get caught in hours instead of via customer escalation.

AI governance is an engineering problem, not a legal one

Fri, 05 Dec 2025 00:00:00 +0000

Your company has an AI policy. It was written by legal, reviewed by compliance, approved by a VP. It says things like “AI-generated content must be reviewed by a qualified human before being shared with customers” and “sensitive data must not be included in prompts sent to third-party AI providers.”

It lives in a SharePoint folder. Nobody has read it since the all-hands where it was announced. It is not enforced. It cannot be enforced — because enforcement requires engineering, and engineering was not involved in writing it.

This is the state of AI governance at most companies. And it is a problem.

The gap

There is a gap between policy and practice. The policy says one thing. The system does another. No one is lying. No one is negligent. The gap exists because policy documents describe intent, and intent does not execute.

Consider the rule: “AI outputs must be reviewed by a human before being sent to customers.” How is this enforced? Is there a review queue? Is there a UI that forces a human to approve each output before it is sent? Is there a log of who reviewed what? Or is the expectation that people will just… do the right thing?

In most cases, it is the latter. And in most cases, people are busy, the volume is high, and the review becomes a rubber stamp — a quick glance, a click, done. The governance is nominal. The risk is real.

Governance as code

Governance that works is governance that is enforced at the system level. Not as a suggestion. Not as a policy. As code that runs in the pipeline and blocks things that should be blocked.

Here is what that looks like in practice:

Output filtering. Before any AI-generated content reaches an end user, it passes through a filter. The filter checks for PII, profanity, competitor mentions, off-topic responses, hallucinated URLs, or whatever your policy prohibits. If the filter catches something, the output is blocked and logged. The user gets a fallback response.

This is not hard to build. A combination of regex patterns, classification models, and simple heuristics covers 90% of cases. The remaining 10% is where you invest in more sophisticated detection.

PII detection. Your policy says “do not send PII to third-party AI providers.” Enforce it. Run a PII detector on every prompt before it leaves your infrastructure. Redact or block prompts that contain social security numbers, credit card numbers, email addresses, phone numbers, or whatever counts as PII in your domain.

Named entity recognition models are mature. Regex patterns catch structured PII reliably. The combination is imperfect — you will have false positives and false negatives — but imperfect enforcement is vastly better than no enforcement.

Audit logs. Every AI interaction — every prompt sent, every response received, every user who triggered it — should be logged. Not for surveillance. For accountability and debugging.

When something goes wrong — and it will — you need to answer: What was the prompt? What was the response? Who saw it? When? Which model version was running? What context was retrieved? Without audit logs, the answer to all of these is “we don’t know.”

The log does not need to be fancy. A structured log entry per interaction, written to your existing logging infrastructure, is sufficient. Include: timestamp, user ID, prompt hash (or full prompt if compliance allows), response hash, model ID, latency, and any filter actions taken.

Eval gates. Before a new model, prompt, or pipeline version is deployed to production, it must pass an eval suite. If the eval score drops below the threshold, the deployment is blocked. This is CI for AI — and it is the most effective governance mechanism we have seen.

The eval gate does not just catch regressions. It creates a record. “This model version was deployed on this date, having passed these evals with these scores.” When an auditor asks how you ensure quality, you point to the gate — not a policy document.

Access controls. Not everyone should have access to every model endpoint. Not every application should be able to call the most expensive model. Not every team should be able to deploy prompt changes to production.

Role-based access control on model endpoints is straightforward if you route all model calls through an internal gateway. The gateway enforces who can call what, logs every call, and applies rate limits. This is the same pattern you use for internal APIs. Apply it to AI.

The CI check

The most powerful framing we have found: treat governance as a CI check.

Your deployment pipeline already has checks — tests pass, linting passes, security scans pass. Add governance checks to the same pipeline:

PII detection on prompts: pass/fail.
Output filter coverage: pass/fail.
Eval suite against golden set: pass/fail.
Audit logging enabled: pass/fail.
Access controls configured: pass/fail.

If any check fails, the deployment does not proceed. This is not bureaucracy. This is the same automated quality enforcement you already apply to traditional software. AI systems are not special. They need the same discipline.

The org design implication

For this to work, engineering must be involved in governance from the start. Not consulted after the policy is written. Involved in defining what governance means in technical terms.

The ideal structure: legal defines the intent (“we must not expose PII”), engineering defines the mechanism (“PII detection runs on every prompt and blocks matches”), and both teams agree on the acceptance criteria (“false negative rate below 1% for structured PII, below 5% for unstructured PII”).

This is a collaboration, not a handoff. Legal cannot write enforceable governance alone. Engineering cannot define acceptable risk alone. They need each other.

The heuristic

For every line in your AI governance policy, ask: “How is this enforced in code?” If the answer is “it isn’t” — that line is a wish, not a policy. Convert it to a check, a filter, a gate, or a log. Governance that exists only in a document is governance that does not exist.

tl;dr

The pattern. Companies write AI governance policies in PDFs — “do not send PII to third-party providers,” “all outputs must be reviewed by a human” — and then rely on people to comply voluntarily, which they do not at volume. The fix. Treat every policy line as a CI check: PII detection on every outbound prompt, output filters before every user-facing response, eval gates before every deploy, and audit logs on every interaction. The outcome. Governance becomes something that actually runs in the pipeline rather than something that lives in a SharePoint folder nobody has opened since the all-hands.

The AI audit your board will eventually ask for

Fri, 21 Nov 2025 00:00:00 +0000

Someone is going to ask. It might be a board member who read an article about AI risk. It might be a regulator with a new framework. It might be a customer whose contract requires an AI addendum. It might be your insurance carrier.

The question will be some version of: “How do you know your AI systems are doing what you think they’re doing?”

And you will either have an answer or you will not. The difference between those two states is about 40 hours of work — if you do it proactively. If you do it under pressure, it is 400 hours and a significant distraction from everything else your team is supposed to be shipping.

What an AI audit actually looks like

Strip away the compliance language and an AI audit is four questions.

What AI are you running? An inventory of every AI system in production — not just the chatbot your marketing team launched, but the recommendation model in your product, the classification system in your support pipeline, the summarization tool your ops team built in a weekend, and the 14 GPT wrappers various teams are using via personal API keys.

Most companies do not have this inventory. They have a partial list that covers the systems built by the ML team. They do not have the systems built by product teams, the systems bought from vendors, or the systems adopted by individual employees. The first step in being audit-ready is knowing what you are running. You cannot govern what you cannot see.

How do you know it is working? Documented evaluation criteria for each system. What does “working” mean for this specific system? What metrics do you track? How often do you measure them? What are the thresholds for acceptable performance?

For a customer-facing chatbot, “working” might mean: answer accuracy above 90% on a curated test set, hallucination rate below 2%, response latency under 3 seconds, and no responses that violate your content policy. For a document classification system, “working” might mean: precision above 95% on your top 10 categories, with a human review step for anything classified with low confidence.

The key is that “working” is defined, measured, and documented — not assumed. “Our users seem happy” is not an audit answer. “Here are last quarter’s eval results showing 92% accuracy on our 200-question test set” is.

What happens when it is wrong? Every AI system produces wrong output. The question is not whether it will be wrong — it is what happens when it is. Do you have incident detection? Do you have a response process? Do you have a way for users to flag bad output? Do you have a log of past incidents and how they were resolved?

This is where most companies have the biggest gap. They built the AI system. They might even eval it regularly. But they have no incident process. When the model produces a bad output, someone notices, someone fixes the prompt, someone deploys — and none of it is documented. There is no trail. There is no way to look back and say “here are the 7 incidents we had last quarter, here is what caused them, here is what we changed.”

Where does the data come from? Data lineage. What data does each AI system use? Where does it come from? How is it processed? Who has access? How is it stored? Is any of it PII? Is any of it subject to data residency requirements?

This is the question regulators care about most and engineers care about least. The model is a function of its data. If you cannot trace the data, you cannot explain the output. And if you cannot explain the output, you have a governance problem that no amount of model evaluation will solve.

Why you should build this before you are asked

The cost of building an AI governance framework proactively is small. A spreadsheet, some documentation, a quarterly review cadence. Maybe 40 hours of work spread across a few people.

The cost of building it reactively — when the board asks, when the regulator sends a letter, when the customer requires it for contract renewal — is an order of magnitude higher. Not because the work is different, but because the context is different.

Under pressure, you are doing archaeology. You are reverse-engineering which systems use which data. You are asking engineers to reconstruct eval results from 6 months ago. You are discovering AI systems that nobody on the leadership team knew existed. You are doing all of this while also trying to maintain the appearance that you have it under control.

Under pressure, you also make bad governance decisions. You over-restrict. You implement heavy-handed approval processes that slow down development. You create compliance theater — checkboxes and review boards that produce documentation without producing understanding. The reactive governance framework is almost always worse than the proactive one, and it costs 10x more to build.

Build it now. It is easier, cheaper, and produces a better result.

The minimum viable governance framework

You do not need a Chief AI Ethics Officer. You do not need a 50-page policy document. You do not need a governance platform. You need three things.

A spreadsheet

One row per AI system. Columns:

System name
Owner (a person, not a team)
What it does (one sentence)
What data it uses
How it is evaluated (link to eval results)
Last eval date
Current performance (key metric and value)
Incident count (last quarter)
Risk level (high/medium/low — based on customer impact if the system produces wrong output)

This spreadsheet is the inventory. It is the thing you hand to the board member, the regulator, the auditor. It takes an afternoon to create and 30 minutes per quarter to update. It is the single most valuable governance artifact you can produce.

A quarterly review

Once per quarter, the owner of each AI system presents a 5-minute update: eval results, incidents, changes, and any concerns. The audience is a small group — your CTO, your head of product, maybe a legal representative.

The purpose is not approval. It is awareness. The review ensures that leadership knows what AI systems exist, how they are performing, and where the risks are. It creates a forcing function for the system owners to actually run their evals and document their incidents.

Keep it tight. 5 minutes per system. No slide decks. Just the spreadsheet row, updated, with a verbal summary. If you have 10 AI systems, the review takes less than an hour.

An incident log

Every time an AI system produces output that is wrong in a way that matters — not every typo, but every incident where the wrong output could have or did cause harm, confusion, or cost — log it.

The log is simple: date, system, what happened, what caused it, what was changed, who was involved. This is not a post-mortem for every incident. It is a line in a spreadsheet.

Over time, this log becomes your most valuable governance tool. It tells you which systems are fragile. It tells you what kinds of failures you are prone to. It tells you whether your fixes are working. And when someone asks “what happens when your AI is wrong,” you can show them the log and say: “Here is what happened. Here is what we did about it.”

The three questions auditors actually ask

We have sat in these meetings. Board reviews, customer audits, regulatory conversations. The questions are remarkably consistent.

“What AI are you running?” They want the inventory. They want to know the scope. They are trying to understand whether you know what you have. If you pull out the spreadsheet, this question takes 2 minutes. If you do not have the spreadsheet, this question takes 2 weeks.

“How do you know it’s working?” They want eval results. They do not need to understand the metrics — they need to see that you have metrics, that you measure them regularly, and that the results are within the thresholds you defined. The existence of a rigorous evaluation process is more reassuring than any specific number.

“What happens when it’s wrong?” They want the incident log. They want to see that you have a process — that when things go wrong, you detect it, respond to it, and learn from it. Companies that have an incident log with 12 entries and a clear pattern of improvement look better than companies that claim they have never had an incident. Zero incidents means you are not looking, not that nothing went wrong.

That is it. Three questions. If you can answer all three clearly and with documentation to support your answers, you pass. Not because you are perfect — nobody is — but because you are paying attention. And paying attention is what governance actually means.

The timeline

Start now. Not because an audit is imminent, but because the work is small and the payoff compounds.

Week 1: Build the inventory spreadsheet. Go talk to every engineering team. Find every AI system. Fill in the rows.

Week 2: For each system, confirm there is an eval process. If there is not — and for some there will not be — flag it. That is your priority list.

Week 3: Create the incident log. Retroactively fill it in from Slack threads and post-mortems if you can. Going forward, make it part of your incident response process.

Week 4: Schedule the first quarterly review. Put it on the calendar. Make it recurring.

Four weeks. Mostly part-time. And you will be ready for the question before anyone asks it.

tl;dr

The pattern. Companies build AI systems without governance, then scramble to create audit documentation under pressure — producing compliance theater that costs 10x more and protects the business less. The fix. Build a minimum viable governance framework now — an inventory spreadsheet, a quarterly review, and an incident log — before the board, a regulator, or a customer asks for it. The outcome. You answer the three audit questions (what AI are you running, how do you know it works, what happens when it is wrong) in minutes instead of weeks, and your governance actually improves your AI systems instead of just documenting them.

Your AI vendor's pricing will change. Plan for it.

Fri, 07 Nov 2025 00:00:00 +0000

Someone on your team built a spreadsheet. It says your AI feature costs $0.04 per query. It multiplies that by projected volume. It shows a healthy margin. Everyone feels good.

That spreadsheet is fiction. Not because the math is wrong. Because the price it is based on will change — and you do not know when, by how much, or in which direction.

The price history

OpenAI launched GPT-4 at $0.03 per 1K input tokens. Then GPT-4 Turbo dropped it to $0.01. Then GPT-4o dropped it further. Then they introduced cached input pricing at half the rate. Then reasoning models arrived with a different pricing structure entirely — thinking tokens that you pay for but never see.

Anthropic launched Claude 3 Opus at one price, then introduced Claude 3.5 Sonnet at a fraction of the cost with better performance. Then prompt caching changed the math again.

Google has repriced Gemini models multiple times, introduced context caching, and restructured their API tiers.

In the last 18 months, no major AI API provider has kept their pricing stable for more than six months. Prices have generally gone down — which is good — but the pricing model has changed in ways that make forecasting difficult.

Per-token vs. per-request. Input vs. output pricing. Cached vs. uncached. Thinking tokens vs. completion tokens. Batch vs. real-time. Each of these is a different axis that can shift your unit economics.

The danger of coupling

When your business model is tightly coupled to current API pricing, you have a fragile system. Here is how it breaks:

Scenario 1: Prices go up. A provider introduces a new model that is better for your use case, but more expensive. You want to upgrade. Your margin disappears. You now have to choose between product quality and profitability — a choice you should never have to make on short notice.

Scenario 2: Pricing model changes. Your cost model assumes per-token pricing. The provider introduces per-request pricing with a token cap. Your short queries get more expensive. Your long queries get cheaper. Your aggregate cost shifts unpredictably.

Scenario 3: You need to switch providers. Your primary provider has an outage. Or they deprecate your model. Or a competitor releases something significantly better. If switching requires re-deriving your unit economics from scratch, you will move slowly — and in AI, moving slowly is expensive.

Scenario 4: Thinking tokens. You are using a reasoning model. Your cost model counts input tokens and output tokens. But the model is generating 5x more thinking tokens than output tokens — tokens you pay for, that do not appear in the response, and that vary wildly based on query complexity. Your cost-per-query variance goes from 10% to 300%.

The 2x buffer

The simplest defense: build your cost model with a 2x buffer on API costs. If your current cost is $0.04 per query, model your economics as if it were $0.08. If the business still works at $0.08, you have room to absorb pricing changes without panic.

This sounds conservative. It is. That is the point.

The 2x buffer is not waste — it is insurance against being forced into a bad decision when prices shift. And if prices continue to drop, the buffer becomes margin. You do not lose by being conservative here.

Abstract the model layer

Your application code should not know which model it is calling. It should call an internal abstraction — a model service, a gateway, a routing layer — that handles model selection, fallback, and cost tracking.

This is not over-engineering. It is the same pattern we use for every external dependency. You do not hardcode database connection strings. You do not hardcode payment processor endpoints. Do not hardcode model endpoints.

The abstraction should handle:

Model selection. Route queries to different models based on complexity, cost, or latency requirements.
Fallback. If the primary model is down or slow, fall back to an alternative.
Cost tagging. Tag every API call with the feature, user segment, or query type that triggered it. This data is essential for understanding where your money goes.
Rate limiting. Enforce per-feature or per-user rate limits to prevent cost spikes.

Building this abstraction takes a week. It pays for itself the first time you need to switch models — which, given the pace of this market, will be within a quarter.

Monitor cost per query

You track latency per endpoint. You track error rate per service. You should track cost per query for every AI feature.

Not aggregate monthly cost — cost per query, broken down by feature, model, and query type. This is how you spot problems early.

Useful metrics:

p50, p90, p99 cost per query. The median is interesting. The p99 tells you about the expensive outliers — the queries that trigger long reasoning chains or retrieve large contexts.
Cost per query by feature. One feature might account for 70% of your spend but 20% of your queries. That is worth knowing.
Cost trend over time. A gradual increase in cost per query can indicate prompt drift, retrieval bloat, or model behavior changes that have nothing to do with pricing.

Set alerts on cost anomalies. A sudden spike in cost per query might mean a prompt change is causing longer outputs, a retrieval bug is stuffing more context into the prompt, or a model update changed the tokenization.

Plan for both directions

Most teams plan for prices going down. They assume API costs will shrink over time, and they are probably right in aggregate.

But plan for prices going up on specific capabilities. Better reasoning costs more. Longer context windows cost more. Multimodal capabilities cost more. The frontier model you will want in six months may be more expensive than the model you are using today, even if the model you are using today gets cheaper.

The right mental model is not “AI will get cheaper” but “the cost-performance frontier will shift.” You will have more options at every price point. But the option you want — the one that makes your feature significantly better — may be at a higher price point than you are at today.

The heuristic

Never build unit economics on current API pricing without a 2x buffer. Abstract your model layer so switching is a config change, not a rewrite. Monitor cost per query in production the same way you monitor latency. The price will change. The question is whether you are ready when it does.

tl;dr

The pattern. Teams build financial models on current API token prices even though every major AI provider has repriced, restructured billing, or introduced new cost axes — thinking tokens, cached input tiers, per-request caps — multiple times in the last 18 months. The fix. Apply a 2x buffer to all AI cost assumptions, route every model call through an internal abstraction layer so switching providers is a config change, and monitor cost per query in production as a first-class metric. The outcome. Pricing changes become a business decision you are prepared to make rather than an emergency that forces you to choose between product quality and profitability.

Regression suites for prompts

Fri, 24 Oct 2025 00:00:00 +0000

You changed three words in a system prompt. The feature that was broken is now fixed. You deploy. The next morning, a different feature is broken — one you did not touch, did not test, did not even think about.

This is the normal state of prompt engineering without regression testing. Every prompt change is a blind trade. You fix one behavior and break another, and you do not find out until a user tells you.

Prompts are code

We treat prompts like copy. We edit them in a text box. We eyeball a few examples and ship. We do not test them. We do not version them rigorously. We do not run them through CI.

This is a mistake. Prompts are the control plane for model behavior. A prompt change is a behavior change. And behavior changes need tests — the same way code changes need tests.

The difference is that prompt behavior is non-deterministic. The same prompt can produce different outputs on successive runs. This makes testing harder, but it does not make testing optional. It makes testing more important, because you cannot rely on manual spot-checking to catch regressions.

What a regression suite looks like

A regression suite for prompts is a set of input/output pairs where you know the expected behavior. Not the exact expected output — the expected behavior.

Each test case has three parts:

Input. The user query or the full prompt template with variables filled in.
Expected behavior. Not the exact string, but a description of what the output should do. “Should mention the return policy.” “Should not include pricing.” “Should respond in Spanish.” “Should decline to answer.”
Assertion. A function that checks whether the output meets the expected behavior. This can be a string match, a regex, an LLM-as-judge call, or a human review — depending on how precise the behavior is.

Start with 20 test cases. That is enough to catch gross regressions. You do not need 500 on day one. You need 20 that cover the most important behaviors.

Where the test cases come from

The first 20 are easy. You already know them — they are the queries you manually tested when you built the feature. The ones you pasted into the playground. Write them down. Add the expected behavior. That is your initial suite.

After that, every production failure becomes a test case. User reports a bad answer. You investigate. You fix the prompt. You add the failing query to the suite with the correct expected behavior. Now that failure can never recur silently.

This is the key insight: your regression suite is a record of every lesson learned. It encodes institutional knowledge about what the system should and should not do. Six months in, your suite is the most valuable artifact on the team — more valuable than the prompt itself, because the suite defines what the prompt is supposed to achieve.

The workflow

Here is how it works in practice:

Developer wants to change a prompt.
Developer makes the change locally.
Developer runs the regression suite against the changed prompt. This takes 2-10 minutes depending on suite size and model latency.
Suite passes — the change did not break any known behavior. Deploy.
Suite fails — the change broke something. Developer fixes the prompt or updates the test case if the old behavior was wrong.

This is not revolutionary. It is test-driven development applied to prompts. The only novel part is that assertions are fuzzier — you are checking behavior, not exact output.

LLM-as-judge assertions

For many test cases, the assertion is hard to write as a regex or string match. “The response should be helpful and accurate” is not something you can check with contains().

This is where LLM-as-judge works well. Use a separate model call — ideally a different model than the one being tested — to evaluate whether the output meets the expected behavior. The judge prompt is simple: “Given this input and this expected behavior, does this output meet the criteria? Respond yes or no.”

LLM-as-judge is not perfect. It has a ~5-10% error rate on nuanced judgments. But it is good enough for regression testing, where you are looking for gross failures, not subtle quality differences. And it is vastly better than no testing at all.

For critical behaviors — safety, compliance, factual accuracy — use deterministic assertions where possible. Reserve LLM-as-judge for softer criteria like tone, helpfulness, and completeness.

The cost objection

“Running the suite costs money. Every test case is an API call — sometimes two, if we’re using LLM-as-judge.”

Yes. A 50-case suite with LLM-as-judge costs maybe $2-5 per run. You run it a few times a day during development. That is $10-20 per day. Your production AI spend is probably $500-5000 per day.

The cost of the regression suite is a rounding error compared to the cost of a production regression that serves bad answers to real users for hours before someone notices.

Growing the suite

The suite should grow monotonically. You add cases. You almost never remove them.

When a production failure occurs, add a test case before you fix the prompt. This is the same discipline as writing a failing test before fixing a bug. It proves the test catches the failure. Then fix the prompt. The test passes. Ship it.

Over time, you will notice clusters. Certain categories of queries are fragile — they break more often. These clusters tell you where the prompt is weakest and where to invest in improvements.

A healthy suite grows by 2-5 cases per week. After six months, you have 100-200 cases. After a year, 200-400. At that point, the suite is a comprehensive specification of your system’s behavior. New team members can read the suite and understand what the system does faster than they can read the code.

The heuristic

If you are deploying prompt changes without running a regression suite, you are testing in production. Your users are your test suite. They are not good at it, and they do not enjoy it.

Start with 20 cases. Add every failure. Run it before every deploy. This is the minimum viable practice for professional prompt engineering.

tl;dr

The pattern. Teams change a prompt to fix one behavior, ship it, and discover the next morning that a different behavior broke — because prompts are treated like copy instead of code and deployed without any regression testing. The fix. Build a suite of input/expected-behavior pairs, add every production failure as a new test case, and run the suite before every prompt deploy. The outcome. Prompt changes stop being blind trades, institutional knowledge about what the system should do compounds into the suite, and production regressions become rare instead of routine.

Caching LLM responses is not cheating

Fri, 10 Oct 2025 00:00:00 +0000

There is a strange guilt that settles over teams when someone suggests caching LLM responses. It feels like cheating. Like the whole point was to have a model think about each query fresh. Like serving a cached response means you are not really using AI.

This is wrong. Caching is infrastructure. And infrastructure that makes your system faster, cheaper, and more predictable is not cheating — it is engineering.

The stigma

We have seen this pattern at multiple clients. Someone builds an AI feature. It works. It goes to production. The bill arrives. Someone on the team says, “We could cache the common queries.” And someone else — usually someone who championed the AI feature — pushes back. “If we’re just serving cached responses, why did we build an AI system?”

Because not every query needs fresh inference. Most queries don’t.

Look at your production logs. You will find that 30-50% of queries are semantically identical to queries you have already answered. Same question, different phrasing. Same intent, slightly different words. You are paying for a fresh API call each time, waiting 2-4 seconds each time, and getting roughly the same answer each time.

That is not engineering. That is waste.

Three tiers of caching

Not all caching is the same. There are three tiers, each with a different complexity-to-payoff ratio.

Exact match caching. Hash the prompt. If you have seen this exact prompt before, return the cached response. Implementation: a key-value store. Redis, DynamoDB, even an in-memory dictionary for low-traffic systems. Zero ambiguity, zero risk. If the prompt is identical, the response is valid.

This alone will catch 10-20% of queries in most production systems. Users copy-paste. Automated workflows send the same prompt repeatedly. Internal tools hit the same questions daily.

Semantic caching. Embed the incoming query. Compare it to a store of previously seen query embeddings. If the cosine similarity exceeds a threshold — typically 0.95 or higher — return the cached response.

This is where it gets interesting. “What’s our refund policy?” and “How do I get a refund?” are different strings but the same question. Semantic caching catches these. Implementation is slightly more involved — you need an embedding model and a vector store — but if you already have a RAG pipeline, you already have both.

Semantic caching typically catches an additional 20-40% of queries on top of exact match caching. The key is tuning the similarity threshold. Too low and you serve wrong answers. Too high and you cache nothing. Start at 0.97 and lower it gradually while monitoring quality.

Tiered caching with freshness. Cache responses for common queries. Serve live inference for novel ones. Set a TTL on cached responses so they refresh when underlying data changes. Tag cache entries by data source so you can invalidate selectively when a source is updated.

This is the production-grade approach. It requires more engineering — cache invalidation is, as always, one of the two hard problems — but the payoff is significant.

The ROI

The numbers are hard to argue with.

A client of ours was spending $45k/month on API calls for a customer-facing Q&A system. After implementing semantic caching with a 0.96 similarity threshold, their monthly API spend dropped to $18k. Latency for cached queries dropped from 2.8 seconds to 40 milliseconds.

That is not a rounding error. That is a 60% cost reduction and a 98% latency improvement for the majority of queries.

And there is a secondary benefit that teams rarely anticipate: consistency. When the same question gets the same answer every time, users trust the system more. Non-determinism is a feature when you need creativity. It is a bug when a user asks the same support question twice and gets contradictory answers.

When not to cache

Caching is not appropriate everywhere.

Do not cache when the answer depends on real-time data. Stock prices, live inventory, breaking news — these need fresh inference or at minimum very short TTLs.

Do not cache when the query includes user-specific context that changes the answer materially. “What’s my account balance?” is not cacheable across users, though it may be cacheable per-user with a short TTL.

Do not cache when you are still iterating on the prompt. Cached responses from an old prompt will persist until the cache is invalidated. If you change your system prompt, flush the cache.

And do not cache with a similarity threshold below 0.93. The false positive rate gets uncomfortable fast. One bad cached response erodes more trust than the caching saves in cost.

Implementation pattern

Here is the pattern we recommend:

Start with exact match caching. Deploy it behind a feature flag. Monitor cache hit rate and output quality for two weeks.
Add semantic caching once you are confident in the exact match layer. Start with a high similarity threshold (0.97) and lower it in increments of 0.01, monitoring quality at each step.
Add TTL-based invalidation. Default to 24 hours. Shorten for data that changes frequently.
Add source-based invalidation. When a source document is updated, invalidate all cache entries derived from it.
Monitor cache hit rate, cost savings, latency distribution, and — critically — output quality. If quality degrades, raise the similarity threshold.

The whole thing can be built in a week. The first two steps can be done in a day.

The heuristic

If more than 20% of your production queries are semantically similar to previous queries, you should be caching. Check your logs. The number is almost always higher than you think.

Caching LLM responses is not cheating. It is the same engineering discipline we apply to every other expensive computation. The model is a function. Some inputs recur. Cache the outputs. Ship it.

tl;dr

The pattern. Teams pay for fresh LLM inference on every query even when 30–50% of those queries are semantically identical to ones already answered, because caching feels like it defeats the purpose of using AI. The fix. Layer exact-match caching first, then semantic caching at a 0.97 cosine similarity threshold and lower it gradually while monitoring quality. The outcome. API costs drop 40–60%, latency for cached queries falls from seconds to milliseconds, and users get more consistent answers.

Stop hiring ML PhDs for engineering problems

Fri, 19 Sep 2025 00:00:00 +0000

You have a job opening for an “ML engineer.” The job requires deploying models to production, building data pipelines, setting up monitoring, managing infrastructure, and integrating AI outputs into an existing product. You are looking for someone with a PhD in machine learning.

These two things do not match.

A PhD in machine learning trains you to do research — to read papers, design experiments, implement novel architectures, run ablation studies, and write results up for publication. These are valuable skills. They are not the skills your job requires.

Your job requires engineering. Specifically, it requires production engineering for systems that happen to include a model. The model is a component, not the system. You are hiring a researcher to do an engineer’s job, and both of you will be frustrated.

What a PhD trains you to do

A machine learning PhD — at a good program, with a good advisor — produces someone who can:

Read and critique research papers.
Formulate a research question and design experiments to answer it.
Implement models from papers, often from scratch.
Understand the math behind gradient descent, attention mechanisms, loss functions, and optimization.
Run controlled experiments — vary one thing, measure the effect.
Write clearly about technical work.
Navigate ambiguity over multi-year timescales.

This is a rigorous training. It produces people who think carefully and work precisely. But notice what’s not on the list: deployment, monitoring, pipeline engineering, infrastructure management, API design, CI/CD, observability, incident response.

These skills are not taught in PhD programs because they are not research skills. They are engineering skills — the kind you learn by running production systems, getting paged at 3am, and debugging a silent failure in a data pipeline.

What production AI actually needs

Here is what the day-to-day looks like for most AI engineers in production:

Monday: The embedding pipeline failed overnight because a source system changed its API response format. Debug the pipeline, fix the parser, backfill the failed documents, verify the index is consistent.

Tuesday: The PM wants to add a new data source to the RAG system. Design the ingestion pipeline, write the chunking logic, set up the incremental indexing, test retrieval quality with the new source included.

Wednesday: Latency spiked for 20% of users. Investigate — turns out the reranker is timing out for long queries. Add a timeout with fallback to non-reranked results. Update the monitoring dashboard. Write a postmortem.

Thursday: A new model version is available from the provider. Run the eval suite against the new version. Compare accuracy, latency, and cost. Write up the results. Recommend whether to upgrade.

Friday: Code review a teammate’s PR for a new guardrails implementation. Review the integration test coverage. Deploy the weekly model update to staging. Run smoke tests.

This is engineering work. It requires comfort with production systems, debugging skills, infrastructure knowledge, and the ability to move fast without breaking things. The model itself — the thing the PhD spent 5 years studying — is an API call. The work is everything around that API call.

The mismatch in practice

When you hire a PhD for an engineering role, here’s what happens.

The PhD is overqualified for the model work. Choosing between GPT-4o and Claude 3.5 Sonnet doesn’t require understanding the transformer architecture at a mathematical level. Prompt engineering doesn’t require knowing how attention works. Fine-tuning an open-source model uses a library with a one-page quickstart guide. The PhD’s deep technical knowledge is mostly unused.

The PhD is underqualified for the engineering work. They’ve never set up a CI/CD pipeline. They’ve never configured monitoring and alerting. They’ve never designed a data model for a production database. They’ve never been on-call. These aren’t things you can pick up in a week — they’re skills that take years to develop, and the PhD has been developing different skills.

The PhD is frustrated because the work is “not ML.” They expected to train models and run experiments. Instead, they’re debugging a Kubernetes deployment and writing SQL. This isn’t what they signed up for, and it’s not what they’re good at.

You are frustrated because the PhD is slow on engineering tasks. They’re careful and thorough — because that’s what research trained them to be — but production engineering often requires speed, pragmatism, and a willingness to ship something good enough and iterate.

Both of you are unhappy. Neither of you is wrong. You’re just mismatched.

When you actually need a PhD

There are roles where a PhD is the right hire. They are rarer than most companies think.

You’re training your own models. Not fine-tuning with a library — actually designing architectures, writing training loops, managing training infrastructure at scale. This is research work. It benefits from research training.

You’re working on a genuinely novel problem. Your use case doesn’t fit the standard patterns. Off-the-shelf models don’t work. You need someone who can read the literature, understand what’s been tried, and design something new.

You’re building core ML infrastructure. An inference engine, a training platform, a feature store. These systems require deep understanding of how models work at a mathematical and systems level.

You need to evaluate research. Someone needs to read papers, assess whether new techniques are relevant, and decide what to adopt. This is a curation role, and a PhD is uniquely suited for it.

If your product is calling an API, building a RAG pipeline, fine-tuning with LoRA, and integrating the results into a web app — you don’t need a PhD. You need an engineer who is curious about AI and willing to learn the ML-specific concepts as they go.

The hiring fix

Be honest about what the role requires. Do the exercise: write down what the person will do in their first 6 months. Be specific — not “work on our AI product” but “build the ingestion pipeline for our knowledge base, set up eval infrastructure, integrate the model output into the search results page, set up monitoring and alerting.”

Then ask: does this work require a PhD, or does it require a senior engineer who can learn the ML-specific parts?

If it’s 80% engineering and 20% ML, hire an engineer. Look for:

Strong production engineering background — they’ve deployed and operated systems at scale.
Curiosity about ML — they’ve built side projects, taken courses, read the docs.
Comfort with ambiguity — AI systems are less predictable than traditional software and they’re okay with that.
Debugging skills — they can trace a problem from the user complaint to the root cause, even when the root cause is “the model is sometimes wrong.”

If it’s 80% ML and 20% engineering, hire a PhD. But be honest about the engineering requirements and support them with engineering mentorship and infrastructure tooling.

The worst hire is a PhD in an engineering role with no engineering support. They’ll build a beautiful model that nobody can deploy, or a fragile pipeline that works on their laptop but fails in production. Not because they’re bad at their job — because their job was misspecified.

The title problem

Part of the issue is titles. “ML Engineer” could mean a dozen different roles. It could mean someone who trains models, someone who deploys models, someone who builds ML infrastructure, or someone who integrates model outputs into products.

Be specific. “AI product engineer” is a better title for someone who integrates AI into products. “ML infrastructure engineer” is a better title for someone who builds training and serving systems. “Applied research scientist” is a better title for someone who adapts research to production use cases.

Clear titles attract the right candidates. Vague titles attract everyone, and you waste time filtering for a match that you could have specified upfront.

The heuristic

Before you write the job posting, write the first 6 months of work. If it’s mostly engineering — pipelines, integration, monitoring, deployment — hire an engineer and teach them the ML concepts. If it’s mostly research — novel architectures, training runs, experimental design — hire a PhD and support them with engineering. The mismatch is expensive for both sides. Get the role right before you get the person.

tl;dr

The pattern. Teams write job postings for ML PhDs when the actual role is deploying pipelines, setting up monitoring, and integrating model outputs into a product — skills PhD programs do not teach. The fix. Write down what the person will do in their first six months, and if it is 80% engineering, hire a senior engineer with AI curiosity instead of a researcher. The outcome. The right hire ships the feature, operates the system, and is not frustrated by it — and neither are you.

AI is not a department

Fri, 05 Sep 2025 00:00:00 +0000

Someone on your leadership team is going to suggest creating an AI department. It will sound reasonable. You have AI work scattered across three teams. Nobody owns the model infrastructure. The data scientists report to different managers with different priorities. Centralizing makes sense — in theory.

In practice, it is the single most reliable way to ensure your AI investments produce nothing of value.

We have seen this pattern at a dozen companies. The AI department gets formed with great fanfare. Smart people are hired. A roadmap is produced. Six months later, the AI team has built impressive internal tools that nobody uses, the product teams are still doing AI work on their own because the AI team’s queue is 8 weeks long, and the CEO is asking why the AI investment has not moved any business metrics.

The problem is not the people. The problem is the org chart.

Why centralized AI teams fail

A centralized AI team fails for the same reason a centralized “innovation lab” fails. It separates the people with the technical capability from the people with the business context.

They build things nobody asked for. Without daily proximity to customers, product managers, and business problems, the AI team gravitates toward technically interesting work. They build a better embedding model. They create a prompt management framework. They prototype a multi-agent system. These might be impressive artifacts. They are not products. Nobody in the business asked for them because nobody in the business knows they exist.

Product teams can’t use what they build. The AI team builds a recommendation engine. They hand it to the product team. The product team looks at the API, realizes it does not handle their edge cases, does not integrate with their data pipeline, and returns results in a format their frontend cannot consume. The AI team says “that’s an integration problem, not an AI problem.” The product team says “it’s your problem because you built it.” The recommendation engine sits unused.

They become a service org. Eventually, the product teams do start sending requests. Now the AI team is drowning. They have 14 requests from 6 teams, no product ownership of any of them, and no ability to prioritize because they do not understand the business context well enough to know which request matters most. They become an internal agency — taking briefs, estimating timelines, delivering work that satisfies the brief but misses the intent. This is the worst possible configuration for AI work, which requires tight iteration loops and deep domain understanding.

The talent gets frustrated. Good AI engineers want to ship products, not service tickets. When the centralized team becomes a service org, the best people leave. They go to companies where they sit on product teams and ship things that users touch. You are left with the people who are comfortable being an internal consultancy — which is rarely the talent profile you need.

The embedded model

The fix is embedding. Not the vector kind — the organizational kind.

The AI engineer sits on the product team. They attend the standups. They know the customers. They understand the data. They ship with the product team and are on-call with the product team. Their manager is the product team’s engineering manager, not a central AI lead.

This is not a new idea. It is how the best companies have always organized specialized engineering capability. Infrastructure engineers sit on product teams. Security engineers sit on product teams. Data engineers sit on product teams. AI engineers should sit on product teams.

When the AI engineer is embedded, the iteration loop is tight. The product manager describes the problem. The AI engineer proposes a solution. They prototype it together. They test it with users. They ship it. The whole cycle takes days, not months. There is no handoff, no requirements document, no integration gap.

The embedded engineer also develops something no centralized team can replicate — domain expertise. After 6 months on a product team, the AI engineer understands the domain as well as anyone on the team. They know which edge cases matter. They know where the data is messy. They know what “good enough” looks like for this specific product. This domain expertise is the difference between an AI feature that works in a demo and an AI feature that works in production.

The role of the central function

Some companies need a central AI function. But its role is different from what most companies assume.

The central AI function does not build features. It builds the platform that feature teams build on. There is a meaningful difference.

Standards. The central function sets standards for how AI systems are built, evaluated, and monitored. Which models are approved for production use. How evals are structured. What monitoring is required. What the incident response process looks like. These standards keep the embedded engineers from reinventing the wheel and ensure consistency across teams.

Shared infrastructure. The central function maintains the shared infrastructure — the model gateway, the eval platform, the prompt management system, the cost monitoring dashboard. This infrastructure is used by every product team. Building it centrally avoids duplication and ensures quality.

Hiring and career development. The central function owns the AI engineering discipline. They define the role, run the hiring process, set the career ladder, and facilitate knowledge sharing across embedded engineers. The embedded engineers report to their product team managers for day-to-day work, but the central function ensures they are connected to a community of practice.

The eval platform. This deserves special emphasis. A shared eval platform — where every team runs their AI evaluations in a consistent way, with comparable metrics and shared tooling — is the single most valuable thing a central AI function can build. It is the thing that turns AI development from “we think it works” to “we know it works.”

What the central function does not do: take feature requests from product teams. Build features for product teams. Own the roadmap for any product team’s AI work. Those responsibilities stay with the product teams, where they belong.

The three org models

There are three ways to organize AI capability. Each has a place.

Centralized

All AI engineers report to a single AI leader. The AI team takes requests from product teams and delivers solutions.

When it works: early stage, 1-3 AI engineers, one or two use cases. The team is small enough that communication overhead is low, and there are not enough AI engineers to embed in multiple teams.

When it fails: at scale. The moment you have more than 3 product teams requesting AI work, the centralized team becomes a bottleneck. Prioritization is political. Delivery is slow. Domain context is thin.

Embedded

All AI engineers report to product team engineering managers. No central AI function exists.

When it works: when your AI engineers are senior enough to set their own standards and your product teams are mature enough to manage specialized talent. This is the leanest model and produces the fastest iteration.

When it fails: when standards diverge. Team A uses one eval framework, Team B uses another, Team C does not eval at all. There is no consistency, no shared learning, and no way to compare AI performance across teams. The AI capability becomes fragmented.

Hybrid

AI engineers are embedded in product teams but a small central function (2-4 people) sets standards, maintains shared infrastructure, and facilitates community. The embedded engineers have a dotted-line reporting relationship to the central function for discipline and a solid-line to their product team manager for delivery.

When it works: at most scales. This is the model we recommend to most companies with 5+ AI engineers. It combines the speed and domain context of embedding with the consistency and infrastructure of a central function.

When it fails: when the dotted-line becomes a solid-line — when the central function starts pulling embedded engineers into central projects, or when product team managers treat the AI engineer as a temporary resource rather than a permanent team member. The model requires organizational discipline to maintain.

The transition

If you already have a centralized AI team and want to move to an embedded model, do it gradually. Pick one product team — the one with the clearest AI use case and the most receptive engineering manager. Move one AI engineer to that team. Give it a quarter. Measure the output. If it works — and it usually does — move the next engineer. Let the central team shrink naturally as engineers embed.

The central team’s remaining members become the platform team. They stop building features and start building infrastructure. This is a career transition for some people and not everyone will want it. That is okay. The people who want to build features should be on product teams. The people who want to build platforms should be on the platform team. Let people self-select.

The transition takes 2-3 quarters to complete. Resist the temptation to do it all at once. Organizational change is a deployment — you roll it out incrementally, monitor the results, and adjust.

tl;dr

The pattern. Companies create centralized AI departments that build impressive demos nobody uses — because the team with the AI capability is disconnected from the teams with the business context. The fix. Embed AI engineers directly into product teams where they ship alongside product managers and engineers, with a small central function that owns standards, shared infrastructure, and the eval platform. The outcome. AI features ship in days instead of months, domain expertise compounds inside product teams, and your AI investment shows up in business metrics instead of internal showcases.

The integration is harder than the model

Fri, 22 Aug 2025 00:00:00 +0000

Here is how every AI project goes. Week 1: the team gets the model working in a notebook. The demo is impressive. The output is good. Everyone is excited. The PM asks when it can ship.

Then integration starts. And the project takes 3 more months.

Getting the model to produce the right output is the easy part. Taking that output and fitting it into your existing systems — your UI, your database, your permissions model, your audit trail, your error handling, your user workflows — that is the actual project. The model is the ingredient. The integration is the meal.

Where the time goes

The model returns a JSON blob. Now what? Here is a partial list of things that need to happen before a user sees anything:

UI integration. Where does the output appear? Does it replace an existing element or augment it? What does it look like when the model is loading? What does it look like when the model fails? What does it look like when the model returns low-confidence results? Does the UI need to support streaming? Does it need to show sources or citations? Does it need an edit/regenerate flow? Each of these is a design decision that requires mockups, review, and implementation.

Workflow integration. How does the AI output interact with the user’s existing workflow? If the model suggests an action, does the user approve it first or does it happen automatically? If the user edits the output, is the edit tracked? Does the AI output feed into a downstream process — an approval chain, a notification, a report? Does the workflow change depending on the confidence level?

Data integration. Where is the output stored? Is it a first-class entity in your data model or a side effect? Does it need to be queryable? Searchable? Reportable? Does it need to join with existing tables? What is the schema? What happens to the output when the user deletes the input that triggered it?

Permissions. Who can see the AI output? Is it the same permission model as the input data? What if the model references documents the user doesn’t have access to? What if the model’s training data includes information that should be restricted?

Audit trail. In regulated industries — and increasingly in non-regulated ones — you need to know why the system produced a given output. What model version was used? What prompt? What context was retrieved? What was the input? All of this needs to be logged, stored, and retrievable. That’s a data pipeline.

Error handling. What happens when the model fails? Not just API errors — what about when the model returns a valid response that’s wrong? Who decides it’s wrong? What does the fallback look like? Does the user see an error state, or do they just not see the AI feature? What about partial failures — the model succeeded but the guardrail blocked the output?

Each of these is a workstream. Each one has edge cases. Each one requires coordination with teams that don’t report to you — design, infrastructure, security, compliance. The model was built by your team in a week. The integration requires half the engineering org.

Why teams underestimate this

Three reasons.

The demo was fast. The team showed a working model in week 1. Stakeholders anchored on that speed. “The hard part is done” — except it wasn’t. The hard part hadn’t started yet. The demo proved the model works. It did not prove the product works.

The model is the novel part. Integration is boring. It’s plumbing. It’s the same kind of work the team does for any feature — API endpoints, database migrations, UI components, permission checks. Nobody gets excited about a database migration. So nobody talks about it in planning. So nobody accounts for it in the timeline.

The edge cases are invisible. When you’re building the model, the happy path is all you see. The model takes input, produces output, the output is good. But in production, inputs are messy, outputs are sometimes wrong, users do unexpected things, and the system needs to handle all of it gracefully. The edge cases don’t surface until integration, and each one takes longer than you think.

Scope the integration first

The fix is counterintuitive: scope the integration before you scope the AI.

Before you pick a model, before you write a prompt, before you build an eval — map the integration surface. Ask:

Where does the output appear in the UI? Draw it.
What is the data model? Schema it.
What permissions apply? Write them down.
What is the error handling strategy? Define the states.
What audit requirements exist? List them.
What downstream systems consume the output? Enumerate them.

This exercise takes a day. It will save you a month. It reveals the actual scope of the project — not the model scope, the product scope. And it often reveals that the integration constraints should influence the model design.

If the permissions model requires that the AI output never references documents the user can’t access, that’s a retrieval constraint — you need to filter at retrieval time, not at output time. If the audit trail requires reproducibility, that constrains your model choice — you need deterministic outputs, which means temperature zero and version-pinned models. If the UI needs to support edit/regenerate, that constrains your output format — you need structured output that the user can modify, not a wall of text.

The integration shapes the AI, not the other way around.

The timeline heuristic

Here is the rough breakdown we see across projects:

Model selection and prompt engineering: 1-2 weeks.
Eval suite development: 1-2 weeks.
Integration — UI, data, permissions, errors: 6-10 weeks.
Testing, edge cases, polish: 2-4 weeks.

The model is 10-20% of the project. The integration is 50-70%. The rest is testing and polish.

If your project plan allocates equal time to “build the AI” and “integrate the AI,” you are going to miss your deadline. Double the integration estimate and you’ll be closer to reality.

The organizational pattern

The other thing that slows integration down is organizational. The AI team builds the model. A different team owns the UI. A third team owns the backend. A fourth team owns the data platform. Getting these teams to coordinate on a feature that cuts across all of them — that’s a project management problem, not a technical one.

The best pattern we’ve seen: embed the AI engineer on the product team for the duration of the integration. Not as a consultant who reviews PRs — as a team member who writes code, attends standups, and pairs with the frontend engineer on the streaming UI and the backend engineer on the error handling.

The worst pattern we’ve seen: the AI team builds a “model API” and throws it over the wall. The product team integrates it without understanding its failure modes. The result is a brittle integration that breaks in ways nobody anticipated because nobody on the product team understood the model’s behavior, and nobody on the AI team understood the product’s constraints.

The heuristic

The model is the demo. The integration is the product. Scope the integration first, staff the integration with your best engineers, and plan for the integration to take 3-5x longer than the model work. If your timeline doesn’t reflect this, it reflects wishful thinking.

tl;dr

The pattern. A working model in a notebook takes one week, stakeholders anchor on that speed as “the hard part is done,” and nobody accounts for the UI states, data model, permissions, audit trail, error handling, and cross-team coordination that make up the actual product. The fix. Scope the integration before you scope the AI — draw the UI, schema the data model, list the permissions and audit requirements — so the integration constraints shape the model design rather than fighting it after the fact. The outcome. Projects that budget 50-70% of their time for integration ship on schedule instead of discovering a quarter of unplanned work the moment the model hands off to the product team.

Your A/B test is lying because your baseline is moving

Fri, 08 Aug 2025 00:00:00 +0000

You ran an A/B test on your AI feature. The treatment group saw a 12% improvement in task completion. The test ran for 4 weeks. You shipped it. Congratulations — your results are probably wrong.

Here’s why. During those 4 weeks, your model provider pushed two updates. Your retrieval index was re-built twice. Your prompt was edited once by an engineer who “just fixed a typo.” The system your users experienced in week 1 was not the same system they experienced in week 4.

Your control group was not constant. Your treatment was not stable. Your A/B test measured something, but it wasn’t what you think it measured.

The fundamental problem

A/B testing requires a stable treatment and a stable control. You change one thing — the treatment — and measure the difference. If both sides are changing simultaneously, you can’t attribute the outcome to the treatment.

Traditional software features are stable once deployed. A new checkout flow doesn’t morph over time. The button stays blue. The copy stays the same. The logic doesn’t drift.

AI features drift by default. The model changes when the provider updates it. The retrieval results change when the index is rebuilt. The prompt changes when someone edits it. The guardrails change when the safety team updates the rules. Even the input distribution changes as users adapt their behavior to the system.

In a 4-week A/B test, you might see:

2 model updates from your provider (often unannounced for minor versions).
1-2 retrieval index rebuilds as new documents are ingested.
1-3 prompt changes as the team iterates.
Continuous input distribution shift as users learn what works.

Each of these changes affects both the control and the treatment, but not equally. A model update might improve the treatment while degrading the control — or vice versa. You can’t tell, because you didn’t isolate the variable.

Version everything

The first fix is version control for your entire AI stack, not just your code.

Model version. Pin the model version for both control and treatment. If your provider doesn’t support version pinning — if you’re calling gpt-4o instead of gpt-4o-2024-08-06 — you are running an experiment where the treatment changes without your knowledge. Pin the version. If the provider pushes a breaking update, that’s a reason to restart the experiment, not to let it continue.

Prompt version. Your prompt is not code that lives in a repo and gets deployed through CI. It should be, but for most teams it isn’t. During an A/B test, freeze the prompt. No edits. No “small fixes.” If someone changes the prompt, the experiment is contaminated.

Retrieval configuration. Freeze the retrieval config: the embedding model, the chunk size, the reranker, the number of results. If your index rebuilds during the experiment, rebuild both control and treatment simultaneously from the same snapshot.

Guardrails and post-processing. Version your guardrails configuration. A new content filter that blocks certain outputs will change your completion rate, which will change your metrics, which will corrupt your experiment.

This is a lot of things to freeze. That’s the point. AI systems have more moving parts than traditional features, which means A/B tests require more discipline, not less.

Run shorter, wider

Traditional A/B tests run for 2-4 weeks to accumulate statistical significance. For AI features, this is too long. Too many things change in 4 weeks.

The fix: run shorter experiments with larger populations. Instead of 4 weeks at 5% traffic, run 1 week at 20% traffic. You get the same sample size in a quarter of the time, and you reduce the window for confounders.

This isn’t always possible. Some metrics — retention, conversion over time — require longer observation windows. For those, you need a different approach: cohort analysis with versioned snapshots. Group users by the exact system version they experienced, not just by the time window. A user who experienced model version A with prompt version 3 is in a different cohort than a user who experienced model version B with prompt version 3, even if they’re both in the “treatment” group.

This is more analytical work. It’s also more honest.

Offline evals gate online experiments

Here’s the pattern we recommend: offline evaluation comes before A/B testing. Not instead of it — before it.

Your eval suite is your first gate. Run the new prompt, the new model, the new retrieval config against your eval dataset. Compare accuracy, relevance, latency, and cost against your baseline. If the offline eval doesn’t show improvement, don’t bother with the A/B test. You’re not going to find a signal in production that you can’t find in evaluation.

If the offline eval does show improvement, then you’ve earned the right to run an A/B test. But the A/B test is now answering a narrower question: does the improvement in eval translate to an improvement in user behavior? That’s a much cleaner experiment.

The eval suite also gives you a stable baseline. Your A/B test baseline drifts because the production system drifts. Your eval baseline is fixed — same inputs, same expected outputs, measured on every version. If your eval score drops during the A/B test, you know the system changed. You can decide whether to continue or restart.

The metrics problem

Even if you fix the versioning problem, AI A/B tests have a metrics problem. What are you measuring?

Traditional A/B tests measure user behavior: clicks, conversions, time on page. These are well-understood metrics with well-understood statistical properties.

AI feature metrics are harder. “Answer quality” is not a metric you can measure directly from user behavior. A user who gets a wrong answer might not know it’s wrong — they’ll click through, seem satisfied, and only discover the problem later. A user who gets a correct but verbose answer might bounce — looking like a negative signal when the system actually worked.

Proxy metrics are necessary but treacherous. Common ones:

Task completion rate. Did the user finish what they started? But completion doesn’t mean the answer was right.
Reformulation rate. Did the user rephrase their query? High reformulation might mean the system is bad, or it might mean the user is exploring.
Thumbs up/down. Direct feedback, but biased toward strong opinions and heavily affected by UI placement.
Time to resolution. How long did it take the user to get what they needed? But you’re measuring a noisy signal over a long time horizon.

None of these are great. The best approach is to combine multiple metrics and look for directional agreement. If task completion is up, reformulation is down, and feedback is positive — you probably have a real improvement. If the signals disagree, you don’t have a clear result and you shouldn’t ship based on the A/B test alone.

What to do when you can’t A/B test

Sometimes A/B testing is impractical. Your user base is too small. The feature is too niche. The metric requires too long an observation window.

In those cases, lean on offline evaluation and qualitative assessment:

Run your eval suite on the new version. Compare to baseline.
Have domain experts review a sample of outputs. Rate them blind — don’t tell them which version produced which output.
Deploy to a small group of internal users first. Collect structured feedback.
Ship with a kill switch and monitor closely for the first week.

This is not as rigorous as a well-run A/B test. But a well-run A/B test on an AI feature is harder than most teams think, and a poorly-run A/B test gives you false confidence — which is worse than no data at all.

The heuristic

Before running an A/B test on an AI feature: pin every version (model, prompt, retrieval, guardrails), run offline evals as a gate, and prefer 1 week at high traffic over 4 weeks at low traffic. If you can’t freeze the system for the duration of the experiment, you can’t A/B test it. Use offline evals instead and be honest about the uncertainty.

tl;dr

The pattern. A/B tests on AI features run for four weeks while the model provider pushes updates, the retrieval index rebuilds, and someone edits “just a typo” in the prompt — so the control group is never actually constant and the result measures drift, not the treatment. The fix. Pin every moving part (model version, prompt, retrieval config, guardrails) for the entire experiment window, run offline evals as a gate before starting, and prefer one week at high traffic over four weeks at low traffic to shrink the contamination window. The outcome. You ship changes based on what your treatment actually did instead of what the background noise of a drifting system happened to produce during your experiment.

The latency budget your PM forgot

Fri, 25 Jul 2025 00:00:00 +0000

Your product spec says “fast.” Let’s do the math.

Your LLM call takes 3 seconds. Your retrieval takes 800ms. Your reranker takes 400ms. Your post-processing — guardrails, formatting, logging — takes another 200ms. You are at 4.4 seconds before any business logic, before any database writes, before the response even starts rendering in the UI.

Your PM’s mental model is a web app. Click a button, see a result. 200ms feels fast. 500ms feels acceptable. Anything over a second feels slow. They specced the feature assuming the latency profile of a REST API. You are building something with the latency profile of a batch job.

Nobody talked about this at kickoff. Now it’s week 6 and the feature works but nobody wants to use it because it takes 5 seconds.

The components

Here’s where the time goes in a typical RAG-powered feature:

Embedding the query: 50-100ms. This is the cheap one. People rarely worry about this, and they’re right not to.

Retrieval: 200-800ms. Depends on your vector database, your index size, and how much filtering you’re doing. Most managed vector databases land around 200-400ms for a simple query. Add metadata filtering and it climbs. Add hybrid search — vector plus keyword — and you’re north of 500ms.

Reranking: 200-600ms. If you’re using a cross-encoder reranker — and you should be, the quality improvement is real — you’re adding another few hundred milliseconds. The latency scales with the number of candidates you rerank. Rerank 20 chunks and it’s fast. Rerank 100 and it’s not.

LLM call: 1-8 seconds. This is the dominant cost. It depends on the model, the prompt length, and the output length. GPT-4-class models are 2-5 seconds for a typical completion. Smaller models are faster but less capable. Streaming helps the perceived latency but doesn’t reduce time-to-last-token.

Post-processing: 100-500ms. Guardrails, output validation, structured extraction, logging, writing to a database. Each step is small. They add up.

Network overhead: 100-300ms. Round trips to external services, TLS handshakes, DNS lookups. If your vector database is in a different region than your compute, add more.

Total: 2-10 seconds for a single turn. And that’s the happy path — no retries, no fallbacks, no model timeouts.

Why PMs don’t think about this

Product managers think in user stories, not in architecture diagrams. When a PM writes “user asks a question, system returns an answer,” the implicit assumption is that the answer appears quickly. They’re pattern-matching against search — type a query, see results. Google does it in 400ms. How hard can it be.

The gap is that nobody sits down with the PM and says: here is the latency budget for this feature. Here is what each component costs in wall-clock time. Here is the total. Do you still want to build it this way?

This conversation should happen at spec time, not at demo time. But it almost never does, because at spec time the engineering team hasn’t built the thing yet and doesn’t have concrete numbers. So they say “it should be fine” and move on. By the time they have numbers, the feature is built and the only question is how to make it faster — not whether the approach was right in the first place.

The latency budget

A latency budget is exactly what it sounds like: a breakdown of how much time each component gets, summing to a total that the user experience can tolerate.

Here’s an example for a conversational RAG feature with a 3-second target:

Component	Budget	Notes
Query embedding	80ms	Mostly fixed
Retrieval	300ms	Requires index tuning
Reranking	250ms	Limits candidate count to 20
LLM (TTFT)	1500ms	Streaming, time to first token
LLM (full)	2500ms	Streaming hides this
Post-processing	200ms	Async where possible
Network	170ms	Co-locate services
Total (perceived)	2300ms	With streaming
Total (actual)	3500ms	Full completion

The perceived latency — the time until the user sees something happening — is 2.3 seconds because streaming starts delivering tokens after the time-to-first-token. The actual latency is 3.5 seconds for the full response. This is the difference between “feels responsive” and “feels slow” even though the underlying work is identical.

Notice that the budget forces design decisions. Reranking is capped at 20 candidates, which means retrieval needs to return high-quality results in the first pass. Post-processing must be async where possible — log writes and analytics don’t block the response. Services must be co-located to keep network overhead low.

These are engineering decisions driven by a latency budget. Without the budget, you make these decisions reactively — after the thing is too slow — instead of proactively.

UX patterns that buy time

When your latency budget exceeds what a synchronous interaction can tolerate, you have three options. Most teams reach for the first and ignore the other two.

Streaming. Stream the LLM output token by token. This is table stakes now. It drops perceived latency from time-to-last-token to time-to-first-token, which is typically 500-1500ms faster. But streaming doesn’t help with the pre-LLM latency — retrieval and reranking still block.

Progressive loading. Show intermediate results as they become available. Show the retrieved sources before the LLM response. Show a skeleton of the answer before it’s complete. Show confidence indicators that update as more context is processed. This is more work than streaming but it transforms a 4-second wait into a 4-second experience where things are visibly happening.

Async processing. Not every AI interaction needs to be synchronous. If the user is submitting a document for analysis, the result can arrive in a notification. If the user is requesting a report, it can be emailed. The UX should match the latency, not the other way around. A 30-second generation is unbearable as a synchronous wait and perfectly fine as “we’ll notify you when it’s ready.”

The choice depends on the use case. Chat interfaces need streaming. Search interfaces need progressive loading. Document processing can be async. The mistake is assuming everything must be synchronous because the PM specced it as a button click.

The conversation to have

Before you write a line of code, have this conversation with your PM:

Here is the latency budget for this feature. Here is what each component costs.
The total is N seconds. Here is what the user will experience.
Given that latency, here are the UX options: streaming, progressive loading, async.
Which of these is acceptable? That determines how we build it.

If the PM says “none of those are acceptable, it needs to be under 500ms” — great, now you know this feature requires a fundamentally different architecture. Maybe you pre-compute. Maybe you use a smaller model. Maybe you cache aggressively. But you know that before you build, not after.

The heuristic

Every AI feature needs a latency budget before it has a product spec. Add up the components. Show the total to your PM. If the number is uncomfortable, redesign the UX or redesign the architecture — but don’t pretend the number is going to be different when you ship.

tl;dr

The pattern. PMs spec AI features with a web-app mental model, nobody does the latency math at kickoff, and the team discovers at week 6 that retrieval plus reranking plus an LLM call adds up to 5 seconds before any business logic runs. The fix. Before writing a line of code, build a latency budget that breaks down each component’s wall-clock cost, show the total to your PM, and choose the appropriate UX pattern — streaming, progressive loading, or async — based on what the number actually is. The outcome. Architecture decisions get made proactively during design instead of reactively after users complain that the feature is too slow to use.

AI teams need on-call. Not optional.

Fri, 11 Jul 2025 00:00:00 +0000

If your AI system is in production and nobody is on-call for it, you have made a decision. You have decided that your users will be the ones who discover failures. That is a choice you are making — you should at least make it consciously.

Most AI teams we work with don’t have on-call rotations. They have a Slack channel. Maybe a dashboard someone checks on Mondays. When something goes wrong, the signal path is: user notices bad output, user complains to support, support files a ticket, ticket gets triaged, engineer looks at it 3 days later, engineer discovers the model has been hallucinating since Thursday.

That is not an operational posture. That is hope.

AI failures are quiet

Traditional software fails loudly. A null pointer throws an exception. A database timeout returns a 500. A broken deployment triggers a health check failure. Your monitoring catches these because they are binary — the system either works or it doesn’t.

AI systems fail quietly. The model doesn’t crash. It returns a 200. The response looks plausible. It’s just wrong. Your user gets a confident answer that cites a document that was deleted 6 weeks ago, or a classification that’s subtly shifted because the input distribution changed, or a summary that omits the most important paragraph.

No alert fires. No error log gets written. The system is running perfectly — it’s just producing garbage.

This is why traditional monitoring is necessary but not sufficient. You need health checks and latency tracking and error rate dashboards, yes. But you also need monitoring that understands the outputs.

What on-call for AI actually means

On-call for AI systems is not the same as on-call for a web service. You’re watching for different things.

Output distribution shifts. If your classification model usually returns category A 40% of the time and it suddenly starts returning category A 80% of the time, something changed. Maybe the model updated. Maybe the input distribution shifted. Either way, a human should look at it.

Drift detection. Compare today’s outputs to last week’s outputs on similar inputs. If the distribution is moving, you want to know before your users do.

Latency anomalies. LLM latency is noisy, but it’s not random. If your p95 doubles overnight, either the provider is having issues or your prompts got longer or your retrieval is returning more context. All of these matter.

Cost spikes. A bug in your chunking logic can 10x your token usage overnight. A retry loop that doesn’t back off can burn through your API budget in hours. If you’re not alerting on cost, you will get a surprise invoice.

Eval regression. Run your eval suite on a schedule — daily at minimum. If your accuracy on the held-out set drops below your threshold, page someone. Don’t wait for the weekly review.

The “but we’re a small team” objection

Every AI team we’ve talked to about on-call has the same response: we’re too small. We can’t afford a rotation. We only have 3 engineers.

You have 3 engineers and a production system that serves users. Traditional engineering teams your size have on-call. The AI team doesn’t get an exemption because the system is newer or less understood. If anything, the opposite is true — less understood systems need more operational rigor, not less.

The rotation doesn’t need to be heavy. Start with:

One person is primary each week. They carry a phone.
Alerts fire for: eval regression below threshold, latency p95 above target, cost anomaly above 2x daily average, output distribution shift above threshold.
Response expectation: acknowledge within 30 minutes during business hours, 2 hours outside.
Escalation path: if primary can’t resolve, they pull in the model owner.

That’s it. Four alert types. One person per week. Acknowledgment SLAs. This is not a large operational burden. It is the minimum bar for running a production system.

What to monitor — concretely

Here is the monitoring stack we recommend for most AI systems:

Tier 1 — page someone.

Eval suite accuracy drops below threshold (run daily).
Latency p95 exceeds 2x baseline for 15 minutes.
Error rate exceeds 5% for 10 minutes.
Daily cost exceeds 2x trailing 7-day average.

Tier 2 — ticket, investigate within 24 hours.

Output distribution shift detected (KL divergence above threshold).
New failure mode appears in error logs (novel error string).
Retrieval hit rate drops below baseline (for RAG systems).
User feedback negative rate increases above baseline.

Tier 3 — review weekly.

Model provider changelog (did the model update?).
Input distribution trends (are users asking different questions?).
Cost trends (are we drifting up?).
Eval suite coverage (are we testing the right things?).

The specific thresholds depend on your system. But the structure doesn’t. You need all three tiers, and you need them before your users start complaining.

The eval suite is your smoke detector

The most important piece of the monitoring stack is the eval suite running on a schedule. Everything else — latency, cost, error rates — those are infrastructure metrics. They tell you the system is running. They don’t tell you the system is right.

Your eval suite tells you the system is right. It is the only thing in your monitoring stack that checks the quality of the outputs. If your eval suite is only running in CI — only running when someone pushes a code change — you are missing the most important class of failures: the ones that happen when nothing in your code changes.

Model provider updates. Retrieval index drift. Input distribution shifts. These all degrade quality without any deployment. Your CI pipeline doesn’t catch them because there’s nothing to trigger the pipeline.

Run your evals daily. On production data if possible, on a representative sample if not. Compare against your baseline. Alert when it drops.

The organizational problem

The deeper issue is organizational. Most companies treat AI systems as a special category — not quite software, not quite data, something new that doesn’t fit existing operational patterns. This leads to operational gaps.

The infrastructure team doesn’t own the AI system because “that’s the ML team’s thing.” The ML team doesn’t do ops because “that’s infrastructure’s job.” Nobody is on-call because nobody owns the full stack.

The fix is simple in concept and hard in execution: someone owns the production AI system end-to-end. That person — or that team — is on-call for it. They are responsible for the model and the infrastructure and the pipeline and the outputs. They don’t get to say “the model is fine, it must be an infrastructure issue” or “the infrastructure is fine, it must be a model issue.” They own both.

This is the same pattern that DevOps solved a decade ago for traditional software. You build it, you run it, you get paged for it. AI systems don’t get a special exemption.

The heuristic

If your AI system is in production and nobody gets paged when it fails, you don’t have a production system. You have a demo that happens to be serving users.

The bar: run evals daily, alert on regressions, have one person on-call per week, and treat output quality as a production metric — not a research metric. Do this before you build the next feature.

tl;dr

The pattern. AI systems fail silently — returning a 200 with a plausible but wrong answer — and the signal path from failure to fix runs through user complaints, support tickets, and a three-day triage queue. The fix. Stand up a four-alert on-call rotation (eval regression, latency spike, cost anomaly, output distribution shift) with one primary per week before you ship anything to users. The outcome. Output quality becomes a production metric you catch internally instead of a research metric your users discover first.

The build-vs-buy decision nobody wants to make

Fri, 20 Jun 2025 00:00:00 +0000

Build or buy. The question comes up in every AI engagement we do. And every time, the team has been discussing it for weeks — sometimes months — without making a decision. They have a spreadsheet with pros and cons. They have had three meetings about it. They have a Slack channel called #ai-vendor-eval with 400 messages and no conclusion.

Meanwhile, they have built nothing and bought nothing. The opportunity cost of indecision is the cost nobody puts on the spreadsheet.

The one-sentence framework

Build when the AI is your product. Buy when the AI is a feature in your product.

That is the framework. Everything else is detail. But the detail matters, so let’s walk through it.

If your company’s competitive advantage comes from the AI itself — if the model’s performance is what makes customers choose you over the alternative — you should build. You need to control the training data, the model architecture, the evaluation criteria, the deployment pipeline. Outsourcing your core differentiator to a vendor is outsourcing your moat.

If your company’s competitive advantage comes from something else — your distribution, your brand, your data, your relationships — and the AI is a capability that makes your product better but is not the product itself, you should buy. You do not need to be world-class at AI infrastructure to add a summarization feature to your app. You need to be world-class at the thing that actually makes you money.

Most teams are in the second category and think they are in the first. This is the main source of bad build decisions.

The hidden costs of building

Building looks cheap on the whiteboard. You have engineers. You have data. The models are open-source. How hard can it be?

Here is what the whiteboard does not show.

Maintenance. A model in production is not a feature you ship and forget. It is a system that degrades. Data distributions shift. User behavior changes. The model that worked in January gives subtly worse results by June. You need monitoring, alerting, and a retraining pipeline. This is not a one-time cost — it is a permanent line item.

On-call. When the model starts producing bad output at 2am — and it will — someone has to debug it. AI failures are not like software failures. There is no stack trace. The model is not “broken” — it is confidently wrong. Debugging requires someone who understands the model, the data, the evaluation criteria, and the production environment. That person is expensive, hard to hire, and miserable if they are on call alone.

Model upgrades. The foundation model you built on today will be obsolete in 18 months. When the next generation ships — faster, cheaper, more capable — you need to evaluate it, migrate to it, re-run your evals, update your prompts, and regression-test everything. This is a project every time it happens, and it happens constantly.

Eval infrastructure. You need to know if your system is working. That means building an evaluation framework — test sets, metrics, automated runs, dashboards. The eval infrastructure is often as much work as the model itself. Teams that skip it do not know when their system breaks. Teams that build it spend significant engineering time maintaining it.

Opportunity cost. Every engineer working on AI infrastructure is not working on your product. If AI is not your product, this trade-off is probably wrong.

The hidden costs of buying

Buying looks expensive on the contract. But the hidden costs are not in the contract — they are in the constraints.

Vendor lock-in. Once you integrate a vendor’s API, switching costs are real. Your prompts are tuned to their model. Your data pipeline feeds their format. Your team’s expertise is in their platform. Switching means rebuilding, re-evaluating, and re-deploying. Most teams never switch, even when a better option appears, because the switching cost is too high.

Data residency. Your data goes to the vendor. Where does it go? What jurisdiction? Who can access it? Is it used for training? These questions matter — especially in regulated industries. The answers are in the terms of service, which change. You are making a data governance decision every time you send a request to a vendor API.

Customization limits. The vendor’s model works for the general case. Your use case is not the general case. You need it to handle your domain’s terminology, your customers’ phrasing, your company’s specific edge cases. The vendor gives you a prompt and a temperature slider. That might not be enough. And if it is not enough, your options are limited — you cannot fine-tune their model, you cannot modify their retrieval pipeline, you cannot change their output format beyond what the API exposes.

Pricing changes. The vendor’s pricing today is not their pricing next year. API costs drop — good for you. But platform fees, enterprise tiers, and per-seat pricing tend to move in the other direction. You are betting on the vendor’s incentives aligning with yours over a multi-year horizon. Sometimes they do. Sometimes they do not.

The hybrid approach that usually works

The answer for most teams is neither pure build nor pure buy. It is: buy the foundation, build the last mile.

Use a vendor for the base model — the language model, the embedding model, the reranking model. These are commodities. They are getting cheaper and better every quarter. Building your own foundation model is almost certainly not a good use of your resources unless you are a very large company with very specific requirements.

Build the parts that are specific to your business — the data pipeline that feeds your domain knowledge into the system, the evaluation framework that measures performance on your use cases, the integration layer that connects the model to your systems, the prompt engineering that encodes your business logic.

This is where the value is. The base model is the same for everyone. The last mile — the data, the evals, the integration, the prompts — is what makes your system work for your business. You own that. The vendor owns the commodity underneath.

In practice, this means you might use OpenAI or Anthropic for the model, build your own retrieval pipeline with your data, write your own evaluation suite with your domain experts, and maintain your own prompt library that encodes your business rules. The vendor provides the intelligence. You provide the judgment.

The honest self-assessment

Before you decide, answer one question honestly: do you have the team to build and maintain this for 3 years?

Not build it once. Build it, maintain it, improve it, debug it, upgrade it, and keep it running — for 3 years. Because that is the minimum commitment. AI systems are not projects. They are products. They need ongoing investment. They need people who understand them. They need a roadmap.

If you have a team of 2 ML engineers and they are also doing data science for the marketing team, you do not have the team to build. If you have a team of 6 with dedicated ML engineering and MLOps capability, you might.

The question is not “can we build it?” Teams can build almost anything given enough time. The question is “can we build it and maintain it better than a vendor can, while also doing everything else we need to do?” For most teams, the honest answer is no. And that is fine. That is what vendors are for.

The decision matrix

Ask these four questions. If you answer “yes” to 3 or more, build. Otherwise, buy the foundation and build the last mile.

Is the AI your core product differentiation — the reason customers choose you?
Do you have a dedicated team (3+ engineers) who will own this for 3+ years?
Do you have data or domain constraints that make vendor solutions unworkable?
Is the total cost of building (including maintenance, on-call, upgrades) less than 2x the vendor cost?

Most teams answer “yes” to 1 or 2 of these. That is a buy signal, not a build signal. The sooner you accept that, the sooner you ship.

tl;dr

The pattern. Teams spend months debating build-vs-buy while doing neither — burning runway on indecision instead of shipping. The fix. Build when the AI is your product, buy when it is a feature — and for most teams, the right move is to buy the foundation model and build the last mile of data, evals, and integration. The outcome. You ship in weeks instead of quarters, your engineers work on your actual product, and you preserve the ability to switch vendors when the market moves.

Build one pipeline well before building two

Fri, 06 Jun 2025 00:00:00 +0000

Your first AI pipeline is not a product. It is a lesson. The lesson is: this is what it takes to operate an AI system. If you try to learn that lesson twice, in parallel, you will learn it zero times.

The parallelization instinct

Teams under pressure do the same thing: they try to run three AI initiatives at once. The logic sounds reasonable. “We have a customer support use case, a document processing use case, and an internal search use case. They share infrastructure. We can parallelize.”

They cannot.

The three use cases do not share infrastructure — not yet. They share the aspiration of infrastructure. The actual infrastructure — the eval frameworks, the monitoring dashboards, the cost tracking, the incident response playbooks, the deployment workflows — does not exist. It will be built during the first project. If you are running three projects, it will be built three times, by three sub-teams, in three incompatible ways.

We have watched this happen at two companies in the last year. Both had competent engineering teams. Both launched three AI workstreams simultaneously. Both ended up with three half-built systems, three incomplete eval suites, and zero production deployments after six months.

What the first pipeline teaches you

Your first pipeline to production teaches you things you cannot learn from a blog post, a conference talk, or a vendor demo. You learn them by shipping.

How to build evals for your domain. Not evals in the abstract — evals that measure the thing your users care about, using data that reflects your actual distribution. This takes iteration. Your first eval set will be wrong. You will measure the wrong thing, or measure the right thing with the wrong metric, or measure with the right metric on the wrong data. It takes two or three rounds before you have an eval suite you trust.

How to monitor an AI system. Not just uptime and latency — the metrics that matter for a non-deterministic system. Output quality scores. Hallucination rates. Retrieval recall. Token costs per query. User satisfaction signals. You will not know which of these matter most until you are watching them in production. Different use cases have different critical metrics, but the monitoring infrastructure is reusable.

How to handle model updates. The base model changes. The API changes. The pricing changes. Your system breaks in a way your test suite did not cover, because the model’s behavior shifted subtly. The first time this happens, it is a crisis. The second time, it is a process. You need the first time to build the process.

How to manage costs. AI system costs are not like traditional compute costs. They are per-token, per-request, and they scale with usage in ways that are hard to predict before you have real traffic. Your first pipeline teaches you how to forecast, how to set budgets, how to optimize — cache layers, prompt compression, model routing. These learnings transfer directly to every subsequent pipeline.

How to respond to incidents. Your AI system will produce a bad output that reaches a user. What happens next? Who gets paged? How do you diagnose the root cause — was it the retrieval, the prompt, the model, or the data? How do you roll back? How do you communicate to the user? These playbooks take time to write and they only get written in response to real incidents.

The compounding effect

Every one of these learnings — evals, monitoring, model updates, cost management, incident response — compounds. The second pipeline benefits from all of them. The eval framework is reusable. The monitoring dashboards need one new panel, not a new dashboard. The cost management patterns transfer. The incident playbook gets a new section, not a new playbook.

A team that builds one pipeline to production and then starts the second pipeline ships the second one in half the time. We have seen this consistently. The first pipeline takes 3-4 months. The second takes 6-8 weeks. Not because the second is simpler — because the team knows what they are doing.

A team that builds two pipelines in parallel ships neither in 3-4 months. They ship both in 6+, if at all, because they are learning every lesson twice and building every piece of infrastructure twice.

The objection

“But we are under pressure to show results across multiple use cases.”

Yes. And the fastest way to show results across multiple use cases is to ship one use case quickly and well, then use the infrastructure and learnings to ship the next two in rapid succession.

Shipping one thing in 3 months and two more things in the following 2 months — that is 3 things shipped in 5 months. Shipping three things in parallel and landing all three at month 6 — if you land them at all — that is 3 things in 6 months. The sequential approach is faster. It is also less risky, because each subsequent pipeline benefits from the lessons of the ones before it.

The uncomfortable truth: parallelizing AI initiatives is not a strategy for moving fast. It is a strategy for looking busy.

How to pick the first one

The first pipeline should not be the most important use case. It should be the one that teaches you the most while carrying the least risk.

Pick the use case that is:

Internal-facing, so failures are embarrassing, not catastrophic.
Measurable, so you can build evals that actually tell you something.
Small enough to ship in 6-8 weeks with a small team.
Representative enough that the infrastructure you build will transfer.

Internal document search is often a good first pipeline. It hits retrieval, generation, evaluation, monitoring, and cost management. It has real users who give real feedback. And if it hallucinates, nobody sues you.

The heuristic

One pipeline to production. Then scale the learnings. The first pipeline is the tuition — you are paying to learn how your organization operates AI systems. Do not pay tuition twice.

tl;dr

The pattern. Teams under pressure launch three AI workstreams simultaneously and end up with three half-built systems, three incompatible eval frameworks, and zero production deployments after six months. The fix. Ship one internal-facing, measurable pipeline to production first, and use the evals, monitoring, cost patterns, and incident playbooks you build there as the foundation for every pipeline that follows. The outcome. The second pipeline ships in half the time because the hard operational lessons were paid for once, not learned in parallel by three sub-teams.

Your test suite passed. Your system is still broken.

Fri, 23 May 2025 00:00:00 +0000

A passing test suite for an AI system tells you one thing: the known scenarios still work. It tells you nothing about the unknown ones. And with AI systems, the unknown scenarios are where the failures live.

The green checkmark problem

Traditional software has a useful property: it is deterministic. Given the same input, it produces the same output. Your test suite verifies this contract. If the tests pass, the contract holds. You can deploy with confidence.

AI systems do not have this property. The same input can produce different outputs. The model’s behavior changes with temperature, with context window contents, with the phase of the moon — or more precisely, with the random seed, the batching order, and whatever the provider changed in their last silent update.

Your test suite checks the scenarios you thought of. It passes. You deploy. Then a user sends a query you did not think of — phrased in a way your test cases do not cover, referencing a topic your eval set does not include — and the system fails. Not with an error. With a confident, plausible, wrong answer.

This is worse than a crash. A crash is visible. A wrong answer is invisible until someone notices.

Why traditional testing is not enough

A traditional test suite for a deterministic system is a contract verification tool. You define the expected behavior, you assert against it, you move on. The surface area is bounded — you can enumerate the states, or at least the important ones.

An AI system’s surface area is unbounded. The input space is natural language — every possible sentence, in every possible context, with every possible intent. You cannot enumerate it. You can sample it, but your samples are biased by your own imagination.

The tests you write reflect the scenarios you can think of. The failures that hurt you are the ones you cannot. This is not a testing problem. It is an epistemological problem. And the solution is not “write more tests” — it is “supplement your tests with mechanisms that find the scenarios you missed.”

The three supplements

We recommend three additions to every AI system’s test suite. None of them are optional.

1. Fuzz testing.

Send random, malformed, adversarial, and unexpected inputs to your system. Not just once — continuously, as part of CI.

This is not novel. Fuzz testing has been standard practice in security engineering for decades. The surprise is how few AI teams do it. They test with well-formed queries from their eval set and call it done.

A basic fuzz test for an AI system:

Random strings. Unicode. Empty inputs. Inputs that are 100,000 characters long.
Inputs in languages your system does not support. Inputs that mix languages mid-sentence.
Inputs that are technically valid but semantically nonsensical.
Inputs that contain prompt injection attempts — “Ignore previous instructions and…”
Inputs that reference your system prompt, your company name, your competitors.

You are not looking for correct answers. You are looking for catastrophic failures — crashes, infinite loops, data leaks, offensive outputs, or responses that reveal system internals. Set your assertions accordingly: the system should not crash, should not leak the system prompt, should not produce output longer than X tokens, should respond within Y seconds.

Run this nightly. Keep the seeds that trigger failures. Add them to your regression set.

2. A regression log.

Every production failure becomes a test case. No exceptions.

When a user reports a bad output, when your monitoring catches a hallucination, when a support ticket mentions the AI giving wrong information — that input-output pair goes into a regression log. The log becomes a test suite. Run it on every deployment.

This sounds obvious. In practice, most teams do not do it. The failure gets fixed — the prompt gets tweaked, the context gets adjusted — but the test case does not get written. Three months later, a different change reintroduces the same failure. Nobody connects the dots.

The regression log is your institutional memory for AI failures. It grows over time. It gets more valuable as it grows. After six months, you have a test suite that reflects the actual failure modes of your system, not the hypothetical ones you imagined at design time.

The mechanics are simple. A shared document or a database table with three columns: input, bad output, expected output. A script that runs every entry against the current system and flags regressions. Integrate it into CI.

3. Periodic red-teaming.

Once a month, someone on your team — or better, someone not on your team — spends a focused session trying to break the system. Not automated testing. Human adversarial testing.

The red-teamer’s job is to find failures that neither the fuzz tests nor the regression log would catch. They bring creativity, domain knowledge, and malicious intent — the combination that produces the most interesting failures.

What a red-team session looks like:

2 hours, focused. One or two people. A shared doc for findings.
Try to make the system contradict itself. Ask the same question two different ways and see if you get conflicting answers.
Try to make the system exceed its authority. Ask it to do things it should not be able to do.
Try to make the system leak information. Ask it about other users, internal processes, system configurations.
Try to make the system produce harmful output. This is uncomfortable but necessary.
Try edge cases specific to your domain. If you are in healthcare, try drug interactions. If you are in finance, try market manipulation scenarios.

Every finding goes into the regression log. The red-team session feeds the automated tests. Over time, the automated tests get better because they are shaped by human adversarial thinking.

The integration

These three supplements are not separate from your test suite — they feed into it. The fuzz tests find crash-level failures, which become regression tests. The regression log captures production failures, which become permanent test cases. The red-team sessions find creative failures, which become both regression tests and new fuzz test patterns.

Your test suite grows in the direction of your actual failure modes, not your imagined ones. After six months, you have a test suite that would have caught 80% of the failures you encountered — because it was built from those failures.

The cost

A fuzz test takes a day to set up and runs on a schedule. The regression log is a process change, not a technical one. The red-team session is 2 hours per month. Total investment: maybe 2 engineer-days per month.

Compare this to the cost of a production failure in an AI system — a hallucinated medical dosage, a leaked customer record, a confidently wrong financial calculation. The math is not close.

The heuristic

A green test suite means your known scenarios work. It says nothing about the unknown ones. Supplement it with three things: fuzz tests for crash-level failures, a regression log for production failures, and a monthly red-team session for creative failures. The test suite you ship with is not the test suite that will protect you. The one that protects you is the one shaped by real failures over time.

tl;dr

The pattern. AI teams ship with a passing test suite that only covers scenarios they imagined, while production failures arrive as confident wrong answers from inputs nobody thought to test. The fix. Supplement your test suite with nightly fuzz tests for crash-level failures, a regression log that converts every production incident into a permanent test case, and a monthly human red-team session. The outcome. After six months, your test suite reflects your system’s actual failure modes rather than your assumptions about them, and you catch regressions before users do.

Fine-tuning is maintenance, not a one-time cost

Fri, 09 May 2025 00:00:00 +0000

The fine-tuning run is the easy part. You curate a dataset, configure a training job, wait for it to finish, deploy the model. A senior engineer can do this in a day. The hard part is everything that comes after — and “after” is where most teams get stuck.

The one-and-done fallacy

Teams treat fine-tuning like a deployment. Train it, ship it, move on to the next thing. This works for about 90 days. Then one of the following happens:

The base model gets a major update. Your fine-tune was built on GPT-4o-2024-08-06. The provider ships a new version. Your fine-tune is now pinned to the old model. You can keep using it, but you are missing out on improvements — and eventually the old version gets deprecated.

Your training data goes stale. The product changed, the terminology shifted, new features were added, old workflows were removed. The fine-tuned model confidently describes a UI that no longer exists. Users notice.

Distribution shift happens. The queries your users send in month 6 look different from the queries they sent in month 1. The model was trained on month 1 queries. Its performance degrades gradually — not catastrophically, just enough that users start saying “it used to be better.”

These are not edge cases. They are the normal lifecycle of a fine-tuned model. If you are not planning for them, you are planning to be surprised.

What maintenance actually looks like

A production fine-tuned model needs five operational components. Not all at once — you can build them incrementally — but eventually you need all five.

1. A training data pipeline. Not a one-time CSV export. A pipeline that continuously collects, cleans, and formats new training examples. The best source is usually production traffic — real user queries paired with good responses, reviewed by a human. This is boring work. It is also the most important work, because the quality of your training data is the ceiling on your model’s performance.

The pipeline does not need to be fancy. A script that pulls flagged interactions from your production logs, formats them into the training schema, and appends them to a versioned dataset is enough. Run it weekly. Review the output manually. Remove the garbage.

2. A versioned dataset. Every training run should be traceable to a specific version of the training data. When your model starts producing bad outputs — and it will — you need to diff the training data between the last good version and the current one. Without versioning, debugging is guesswork.

Git works for small datasets. DVC works for large ones. The tool matters less than the discipline.

3. An evaluation suite. A set of test cases that measure the model’s performance on the things you care about. Not perplexity — task-specific metrics. If your model classifies support tickets, measure classification accuracy on a held-out set. If it generates code, measure pass rates on a curated set of problems. If it writes customer emails, have a human score a sample every week.

The eval suite is your early warning system. Run it after every training run, and run it on a schedule against your production model even when you have not retrained. If the scores drop, something changed — the data, the queries, or the base model.

4. Retraining triggers. When do you retrain? Two approaches, and you should use both.

Time-based: retrain on a fixed schedule. Monthly is a reasonable starting point for most use cases. Quarterly if the domain is stable. Weekly if the domain changes fast — financial data, news, trending topics.

Metric-based: retrain when your eval scores drop below a threshold. This requires the eval suite from step 3. Set an alert. When accuracy drops below 85% — or whatever your threshold is — trigger a retraining run.

The time-based trigger catches gradual drift. The metric-based trigger catches sudden degradation. You need both.

5. A deployment workflow. You trained a new version. How do you ship it? Not by swapping the model endpoint and hoping for the best.

The minimum viable deployment workflow: train the new model, run the eval suite against it, compare scores to the production model, deploy to a shadow environment (same traffic, no user-facing output), compare shadow outputs to production outputs, promote to production if the metrics are better.

If you want to be more rigorous — and you should, if the model serves paying customers — add an A/B test. Route 10% of traffic to the new model for a week. Measure user satisfaction, error rates, and task completion. Promote or roll back based on the data.

The decision framework

Here is the question we ask teams before they start a fine-tuning project: “Can you commit to maintaining this model for 12 months?”

Not the initial training run. The pipeline, the evals, the retraining schedule, the deployment workflow. All of it. For a year.

If the answer is yes, fine-tuning is probably the right call. You will get better performance than prompt engineering, and you will be able to maintain that performance over time.

If the answer is no — and it is often no, because the team is small, or the use case is not important enough to justify the operational overhead — stick with prompt engineering. A well-crafted prompt with few-shot examples gets you 80% of the performance of a fine-tuned model with 20% of the operational burden.

There is no shame in prompt engineering. There is significant risk in fine-tuning without the infrastructure to maintain it.

The hidden cost

The thing nobody mentions in the fine-tuning tutorials: the operational cost of maintaining a fine-tuned model is typically 3-5x the cost of the initial training run, annualized. The training run is a GPU bill. The maintenance is a people bill — engineers reviewing training data, running evals, debugging regressions, managing deployments.

Budget for this upfront or do not fine-tune at all.

The heuristic

Fine-tuning is not a project. It is a commitment. If you cannot build and staff the five operational components — data pipeline, versioned dataset, eval suite, retraining triggers, deployment workflow — then prompt engineering is the better choice. The initial performance gap is smaller than you think. The maintenance gap is larger than you think.

tl;dr

The pattern. Teams treat a fine-tuning run as a one-time deployment, then get caught off guard when the base model updates, training data goes stale, and query distribution drifts — all within 90 days. The fix. Before starting a fine-tuning project, confirm you can staff and maintain the five operational components — data pipeline, versioned dataset, eval suite, retraining triggers, and deployment workflow — for at least 12 months. The outcome. Teams that plan for the full maintenance burden ship fine-tuned models that stay accurate over time; teams that don’t end up with degraded models nobody wants to own.

The fractional AI leader your board is asking about

Fri, 25 Apr 2025 00:00:00 +0000

Your board wants AI leadership. You are not ready for a full-time hire. This is not a contradiction — it is a phase. The mistake is treating it as a hiring problem when it is actually a scoping problem.

The board meeting that starts this

It usually happens in a Q3 or Q4 board meeting. Someone — usually the board member who just came from a conference — asks: “What is our AI strategy?” The CEO gives an honest answer, which is some version of “we’re exploring.” The board nods, takes a note, and the next morning the CEO calls the CTO and says: “We need to hire someone for AI.”

This is where things go sideways.

The full-time trap

The instinct is to hire a full-time AI leader. VP of AI, Head of ML, Chief AI Officer — the title varies, the problem is the same. You write a job description before you know what the job is.

Here is what happens next. You spend 3-4 months recruiting. You find someone credentialed and expensive — $400k+ total comp for someone senior enough to satisfy the board. They start. They spend their first 60 days doing a landscape assessment. They present a strategy deck. The strategy requires a team they do not have, a data infrastructure that does not exist, and a budget that was not approved.

Six months in, they have shipped nothing. Not because they are bad at their job — because the job was never scoped. The mandate was “do AI,” which is not a mandate. It is a wish.

Eight months in, they leave. The board asks what happened. The cycle restarts.

We have seen this pattern at four companies in the last 18 months. The details change. The arc does not.

What fractional actually means

A fractional AI leader is not a consultant who writes a deck and leaves. It is not a contractor who builds a thing and moves on. It is a senior operator who embeds with your team on a part-time basis — typically one to three days per week — for a sustained engagement.

The difference matters. A consultant optimizes for the deliverable. A fractional leader optimizes for the outcome, because they are around long enough to see it.

What this looks like in practice:

Month 1. Assess the landscape. Not a 60-page report — a clear-eyed look at what data you actually have, what your team can actually build, and what use cases would actually move a business metric. This takes two weeks if you are honest about it.

Month 2-3. Build the first thing. Not the biggest thing — the one that teaches your team how to operate an AI system. Stand up the eval framework. Set up cost monitoring. Ship to a small group of internal users. Learn what breaks.

Month 4-6. Scale the learnings. Now you know what the work actually looks like. You know whether you need an ML engineer or an infrastructure engineer. You know whether the bottleneck is data quality or model capability. You can write a real job description because you have done the job.

At the end of six months, you have one of three outcomes:

You know exactly who to hire full-time, because the fractional leader defined the role by doing it.
You realize you don’t need a dedicated AI leader — you need an AI-literate engineering team with fractional strategic support.
You convert the fractional leader to full-time, because the fit is proven.

All three are better outcomes than the cold hire.

The strategic cover problem

There is a less discussed benefit of fractional AI leadership: it gives you air cover with the board while you figure things out.

Boards want to know there is a senior person thinking about AI. They want a name on the org chart, a point of contact, someone who can present a coherent view of where the company is headed. A fractional leader provides this without the $400k commitment and without the organizational risk of a full-time hire you are not ready to support.

This is not cynical. It is practical. The board’s concern is legitimate — the company does need AI leadership. The question is whether the company is ready to absorb a full-time leader, and in most cases, the honest answer is “not yet.”

The anti-patterns

Hiring for the title. “Chief AI Officer” sounds impressive at a board meeting. It also creates expectations — internal and external — that a Series B company with 80 engineers cannot meet. A fractional engagement lets you get the strategic value without the title inflation.

Hiring before you scope. The most common failure mode. You hire a senior AI person, they assess the landscape, they discover you need 18 months of data infrastructure work before you can do anything interesting with AI. Now you have a $400k/year employee managing a data engineering project. This is not what they signed up for, and it is not what you are paying for.

Confusing strategy with execution. Some companies hire a senior AI strategist and expect them to also write the code. Some hire an ML engineer and expect them to present to the board. These are different skills. A fractional engagement lets you be honest about which one you actually need right now — and it is almost always execution first, strategy second.

What to look for

A good fractional AI leader has three qualities:

Pattern recognition across orgs. They have seen the movie before — at multiple companies, in multiple industries. They know which problems are unique to your business and which ones every company hits. This cross-pollination is the main advantage of fractional over full-time.

Willingness to build. Not just advise — actually build. Write code, set up pipelines, review PRs. If they only produce slide decks, they are a consultant, not a fractional leader.

A clear exit criteria. They should be able to articulate what “done” looks like — the point at which you no longer need them, or the point at which you need to convert the role to full-time. If they cannot describe their own obsolescence, they are optimizing for the engagement, not for you.

The heuristic

If your board is asking about AI and you don’t have an AI leader, don’t hire one. Engage one fractionally. Use the first six months to scope the role by doing the work. Then hire for the role you actually defined — or keep the fractional arrangement if it is working. The worst outcome is hiring a $400k leader for a job that does not exist yet.

tl;dr

The pattern. Boards demand AI leadership, companies write a job description before the job exists, and a $400k hire spends six months doing a landscape assessment before leaving because the mandate was never scoped. The fix. Engage a fractional AI leader for the first six months — someone who actually builds, not just advises — and use that time to define the role by doing the work. The outcome. You satisfy the board, ship the first AI initiative, and arrive at a full-time hire decision with a real job description instead of a wish.

When to kill an AI project

Fri, 11 Apr 2025 00:00:00 +0000

The hardest decision in AI is not what to build. It is what to stop building. Every team we work with has at least one project that should have been killed months ago. They know it. Their engineers know it. But nobody wants to say it out loud because saying it out loud means admitting the last six months were a write-off.

They were not a write-off. But the next six months will be if you keep going.

The problem with AI projects

AI projects fail differently than software projects. A software project that is going badly shows obvious symptoms — missed deadlines, broken builds, escalating bugs. An AI project that is going badly looks like progress. The model improves from 70% to 80%. Then from 80% to 83%. Then from 83% to 84.5%. The team is working hard. The demos are getting better. The charts go up and to the right.

But the charts are lying. The difference between 84.5% and the 95% you need for production is not 10 percentage points of effort. It is a fundamentally different problem — one that might require different data, different architecture, different people, or a different approach entirely. And nobody on the team wants to be the one to say that.

This is how AI projects become zombies. Not dead, not alive. Just consuming resources and producing demos.

The five kill signals

We have seen dozens of AI projects across different companies and domains. The ones that should have been killed — and eventually were — all showed at least two of these signals.

1. The accuracy plateau

You have been at 85% accuracy for three sprints. Each sprint, the team tries something new — more data, different preprocessing, a bigger model, a fancier training recipe. Each time, the needle moves a fraction of a point. Sometimes it moves backward.

This is the most common kill signal and the hardest to act on. The team is doing real work. The experiments are legitimate. But the results are asymptotic. You are converging on a ceiling that is below where you need to be.

The question to ask: is the remaining gap a problem of scale — more data, more compute — or is it a problem of kind? If adding 2x the training data moved you from 80% to 85%, you would need roughly 16x more data to get to 95%. Do you have 16x more data? Can you get it? At what cost? If the answer is no, the accuracy gap is telling you something about the problem itself, not about your execution.

2. The integration wall

The model works. In a notebook. With clean data. On your test set.

Now you need to connect it to your actual systems — the ERP that exports CSV over SFTP, the CRM with a rate-limited API that returns XML, the data warehouse that updates on a 6-hour lag. The model that worked beautifully in isolation needs to handle missing fields, stale data, encoding issues, and a schema that changes without notice.

This is the integration wall. The model was the easy part. The plumbing is the hard part. And the plumbing is not an AI problem — it is a systems engineering problem that was hidden by the excitement of the model working.

The question to ask: is the integration cost proportional to the value the model delivers? If you are spending 3 months integrating a model that saves 20 minutes of manual work per day, the math does not work. Kill it or simplify the integration to something you can ship in a week.

3. The champion left

Every successful AI project has an executive sponsor — someone who fights for budget, clears organizational blockers, and shields the team from “can you also make it do X” requests. When that person leaves, gets reassigned, or loses interest, the project enters a quiet death spiral.

Nobody cancels it. The team keeps working. But the air cover is gone. Other priorities start pulling team members away. The project slips from “strategic initiative” to “that thing the ML team is doing.” Within two quarters, it is a line item that nobody can justify but nobody wants to kill.

The question to ask: who is the new champion? If you cannot name a specific person — not a team, not a function, a person with a name and a title — the project is already dead. It just does not know it yet.

4. The use case shifted

You started building a model to predict customer churn. Halfway through, the business pivoted to a new pricing model that makes churn less relevant. Or you started building a document classification system, and then the company switched document platforms and the old categories no longer apply.

The use case shifted but the project did not. The team is still building the thing they scoped six months ago because that is what the roadmap says. Nobody updated the roadmap because nobody wants to admit the original scope is no longer the right scope.

The question to ask: if you were starting from scratch today — no sunk cost, no existing code, no commitments — would you build this exact thing? If the answer is no, you are building the wrong thing. Stop.

5. The cost math broke

The pilot worked. Ten users, curated data, generous latency budget. The results were good. Everyone was excited. Then you ran the production cost model.

The pilot cost $200/month. Scaling to 10,000 users would cost $200,000/month. The business case assumed $20,000/month. The gap is a full order of magnitude, and no amount of optimization will close it. You can cache aggressively, batch requests, use a smaller model — and you might cut costs by 3x. You are still 3x over budget.

The question to ask: does the unit economics work at production scale? Not at pilot scale. Not with “future cost reductions we expect from model providers.” Does the math work today, with today’s costs, at today’s scale? If not, you are betting on the market to make your business case viable. That is a venture capital strategy, not an engineering strategy.

Killing is a leadership act

Killing a project is not a failure. Continuing a project that should be dead — that is a failure. Every month you spend on a zombie project is a month you are not spending on the project that would actually work.

The best AI leaders we have worked with treat project kills the same way they treat launches. They do a retro. They document what was learned. They celebrate the team for the work, not the outcome. And they move fast — the longer you wait to kill a project, the harder it gets, because the sunk cost grows and the emotional investment deepens.

How to extract value from a killed project

A killed project is not wasted work if you are intentional about what you keep.

The data. The data you collected and cleaned is almost certainly useful for something else. Label it, document it, store it somewhere accessible. Future projects will thank you.

The evals. If you built an evaluation framework — and you should have — it transfers. The methodology, the tooling, the habit of measuring things rigorously. That is organizational muscle that survives the project.

The team’s skills. The engineers who worked on the project learned things. They learned what does not work, which is often more valuable than knowing what does. They built intuition about the problem space. That intuition goes with them to the next project.

The relationships. The stakeholders you worked with, the domain experts who labeled data, the ops team that helped with integration — those relationships are assets. Maintain them. You will need them again.

The worst thing you can do with a killed project is pretend it never happened. The second worst thing is let it continue.

tl;dr

The pattern. Teams keep AI projects alive long past their expiration date because killing feels like failure — so the project becomes a zombie that consumes resources and produces demos. The fix. Watch for the five kill signals — accuracy plateau, integration wall, champion departure, use case shift, broken cost math — and act when you see two of them. The outcome. You free up your best people and your limited budget for the project that will actually ship and compound.

Features your users didn't ask for and won't use

Fri, 21 Mar 2025 00:00:00 +0000

Here is a conversation we have had, in some variation, at least once a month for the past year.

Us: “Why are you building this?”

Them: “Users will love it.”

Us: “Which users?”

Them: “… Users in general.”

Us: “Has a specific user asked for this?”

Them: “Not exactly, but it’s an obvious improvement.”

The feature is always something AI-powered. An AI summarizer for a dashboard nobody reads. An AI-generated insight panel on a page with a 4-second average dwell time. A chatbot overlay on a product that already has a perfectly functional search bar.

The team is not dumb. They are excited. AI is genuinely powerful and the urge to apply it everywhere is understandable. But excitement is not a product strategy, and “we could” is not the same as “we should.”

The pattern

The pattern has a specific shape, and it repeats with remarkable consistency.

Step 1: Someone on the team — usually an engineer, sometimes a PM — sees a demo or reads a blog post about a new AI capability. They get excited. The excitement is genuine and well-founded. The capability is real.

Step 2: They map the capability to their product. “We have a lot of text data. We could summarize it.” “We have user questions. We could answer them automatically.” “We have reports. We could generate insights.” The mapping is logical. It makes sense on a whiteboard.

Step 3: They build it. Sometimes in a hackathon, sometimes as a side project, sometimes as an official initiative. The prototype is impressive. Demos go well. Leadership is excited.

Step 4: They ship it. Usage is low. Not zero — some users try it because it is new and shiny. But sustained usage is low. The feature does not become part of anyone’s workflow. It sits there, consuming compute, requiring maintenance, and slowly becoming the thing nobody wants to own.

Step 5: Six months later, someone asks whether the feature can be removed. The answer is always “not yet, because some users might be using it.” Nobody checks. The feature persists.

Why AI features are especially prone to this

Every product team ships features that do not land. That is not new. But AI features are uniquely susceptible to this pattern for a few reasons.

AI demos are disproportionately impressive. A summarizer that condenses a 10-page document into three bullet points looks magical in a demo. It looks less magical when the user already knows what is in the document because they wrote it. Demos show the capability in isolation. Users experience it in context.

AI features are expensive to build and maintain. A traditional feature that nobody uses wastes engineering time but is cheap to run. An AI feature that nobody uses wastes engineering time and burns compute on every invocation. LLM calls are not free. An unused AI feature has a recurring cost that a static UI element does not.

AI features create quality obligations. Once you ship an AI feature, you are on the hook for its output quality indefinitely. The model might degrade. The data might change. Edge cases will surface. Each one requires attention. You are not just maintaining code — you are maintaining behavior. That is harder and less predictable.

AI creates a false sense of user value. “We added AI to it” feels like a value proposition. It is not. “We solved a user problem” is a value proposition. The AI is an implementation detail. If you cannot articulate the user problem independent of the technology, you are selling the technology, not the solution.

The tell

The single most reliable indicator that an AI feature will fail: nobody on the team can name a specific user who asked for it, or point to a specific user behavior that suggests they need it.

This does not mean every feature needs to come from a user request. Sometimes you build things users did not know they wanted. But in those cases, you should be able to point to a behavior — something users are currently doing manually, inefficiently, or painfully — that the feature addresses. “Users spend 20 minutes reading these reports every morning” is a reason to build a summarizer. “We have reports” is not.

The naming test is simple. Before you commit to building an AI feature, sit down with the team and ask: “Who is this for? Name them. What are they doing today without this feature? What will they do differently with it?”

If the answers are abstract — “knowledge workers,” “data analysts,” “busy professionals” — the feature is speculative. It might work. But the odds are against it.

If the answers are specific — “Sarah on the compliance team, who spends three hours every Friday manually cross-referencing these two reports” — the feature has a fighting chance. You know who to test it with. You know what success looks like. You know how to measure adoption.

The cost of keeping it

The insidious part of AI feature creep is not the features that fail obviously. It is the features that half-succeed — the ones with just enough usage to justify not removing them, but not enough usage to justify the maintenance cost.

These features accumulate. Each one adds a small amount of ongoing work: monitoring model performance, updating prompts when the model version changes, handling edge cases that surface slowly, answering support tickets from confused users.

Individually, the cost is small. Collectively, it is a tax on the team’s velocity. We see teams where 30-40% of AI engineering time is spent maintaining features that serve a tiny fraction of users. The team is too busy maintaining yesterday’s experiments to build tomorrow’s products.

The fix is ruthless prioritization. Set a usage threshold for AI features — something concrete, like “if fewer than X% of eligible users engage with this feature weekly after 90 days, we remove it.” Apply this retroactively to existing features. Yes, some users will complain. That is fine. More users will benefit from the engineering time freed up.

Building for pull, not push

The best AI features we have seen share a common trait: they were built in response to an observed user need, not a technology capability.

A customer support team was drowning in ticket volume. They needed help triaging — not answering, just triaging. The AI feature they built did one thing: classify incoming tickets by urgency and route them to the right team. It was not flashy. It did not demo well. But it saved 6 hours of manual work per day and the team adopted it immediately.

Compare that to another team that built an AI-powered “insight engine” that generated natural-language summaries of business metrics. The demo was stunning. The product page was beautiful. Usage after launch: negligible. Users already had dashboards. They did not want a natural-language overlay on data they could already read.

The difference was not technical quality. The insight engine was well-built. The difference was demand. One feature was pulled by user need. The other was pushed by technology capability.

The heuristic

Before building any AI feature, name the specific user or user behavior it serves. If you cannot point to a real person or a real workflow, you are building for the technology, not the user. The feature might still work — but you are gambling, and the house edge on speculative AI features is steep.

Build for the pull. The push features are the ones you will be maintaining — and apologizing for — a year from now.

tl;dr

The pattern. Teams build AI features because the technology is exciting, not because a specific user asked for them or exhibits a behavior that signals the need. The fix. Before committing to any AI feature, name the specific person or workflow it serves — if you can only describe the user in abstract terms like “knowledge workers,” stop building. The outcome. Features built for named users with observable needs get adopted; features built for the technology get maintained indefinitely by a team that’s too busy to build the next thing.

Three questions before you greenlight an AI project

Fri, 07 Mar 2025 00:00:00 +0000

There is no shortage of AI project ideas. Every team has a backlog of things that “could be AI-powered.” The problem is not generating ideas. The problem is picking the ones that will actually ship, actually work, and actually matter — versus the ones that will consume six months of engineering time and produce a demo that nobody uses.

After watching dozens of AI projects succeed and fail, we’ve landed on three questions that predict the outcome. Not perfectly — nothing predicts perfectly — but reliably enough that we won’t greenlight a project until the team can answer all three.

If you can answer them, build. If you can’t, stop and figure out why before you spend the money.

Question 1: Can you measure the baseline today?

This is the question that kills the most projects — not because the answer is no, but because nobody bothers to ask.

The baseline is the current performance of the process you’re trying to improve. How long does it take? How much does it cost? What’s the error rate? How many units flow through per week? You need these numbers before you build anything, because without them you cannot prove that AI made it better.

This sounds obvious. It is not practiced.

Here’s what happens when teams skip it. They build an AI system. It works. Someone asks “how much did this improve things?” The team says “it’s faster.” The executive asks “how much faster?” The team says “significantly faster.” The executive asks “compared to what?” And nobody has an answer, because nobody measured the old process before they replaced it.

Now you’ve built something that might be brilliant and you can’t prove it. You can’t calculate ROI. You can’t justify the ongoing cost. You can’t make the case for expanding it to other use cases. You have an AI system running in production and a gut feeling that it’s helping.

Gut feelings don’t survive budget season.

The pattern when this goes wrong. A fintech company we spoke with built an AI system to review loan applications. It worked well — the team was confident it was faster and more consistent than the manual process. But they hadn’t measured the manual process before they built the AI. When the CFO asked for the ROI, they had to estimate the old baseline from memory. Their estimate was contested. The project lost its budget expansion because the numbers weren’t defensible. The AI worked. The business case didn’t.

What to do. Before you write a line of code, spend one week measuring the current process. Time it. Count the errors. Calculate the cost per unit. Document it. This is boring work. It is also the work that makes everything else possible. A week of measurement pays for itself a hundred times over when someone asks “was this worth it?”

If you genuinely cannot measure the current process — if there’s no way to quantify what you’re improving — that’s a signal. It means either the process is too informal to measure (fix that first) or the value of AI is too diffuse to capture (pick a different project). Either way, you’re not ready to build.

Question 2: Who owns this in production?

This is the question that determines whether your project ships or becomes a permanent prototype.

When we ask this question, the most common answer is “the AI team.” This is the wrong answer. The AI team builds AI systems. They do not — and should not — operate every AI system in the company. If the AI team owns every production AI system, you’ve created a bottleneck that doesn’t scale and a team that spends all its time on operations instead of building new things.

The right owner is the team that owns the process the AI is improving. If AI is classifying support tickets, the support operations team owns it. If AI is extracting data from invoices, the finance operations team owns it. The AI team builds it, trains the operations team, hands it over, and moves on.

This sounds simple. It is organizationally hard. The operations team didn’t ask for an AI system. They don’t understand how it works. They don’t know how to debug it when it fails. Handing them a model and saying “you own this now” is a recipe for failure.

The pattern when this goes wrong. A logistics company built an AI system to optimize route planning. The AI team built it, tuned it, and got great results. Then they tried to hand it off to the dispatch team. The dispatch team had no idea how the model made decisions. When the model suggested a route that seemed wrong, they had no way to evaluate whether the model was right and their intuition was wrong, or vice versa. They stopped using it within a month. The AI team had to take it back and operate it themselves — while also trying to build the next project. Within six months, the AI team was spending 60% of its time operating old projects and 40% building new ones. Velocity collapsed.

What to do. Name the production owner before the project starts. Involve them from day one — not as a stakeholder who gets updates, but as a team member who sees how the system is built, understands the failure modes, and helps define the monitoring and escalation procedures. Build the runbook together. When the system goes live, the production owner should feel like they helped build it — because they did.

If no team is willing to own the system in production, that’s a signal. It means either the system isn’t valuable enough for anyone to care about, or the organizational structure doesn’t support AI adoption. Both are worth knowing before you spend the engineering time.

Question 3: What happens when you turn it off?

This is the question that separates real value from theater.

Imagine the AI system has been running for three months. Now imagine you turn it off. What happens?

If the answer is “nothing” — nobody notices, no process breaks, no metric changes — the system wasn’t creating value. It was creating activity. Activity feels productive. It is not the same as value.

If the answer is “people notice within a day, and the team has to scramble to cover the gap” — that’s real value. The system is doing something that matters. It’s embedded in a workflow. People depend on it.

This question works as a filter at every stage. Before you build: “If we built this and then turned it off, would anyone care?” During the pilot: “If we stopped the pilot tomorrow, would the pilot users fight to get it back?” After launch: “If this went down for 24 hours, what’s the impact?”

The pattern when this goes wrong. A media company built an AI system to generate content summaries. The system worked. It produced summaries. The summaries were fine. But when we asked “what happens if you turn it off,” the answer was revealing: “The editors would just write the summaries themselves. They did it before and it took about 5 minutes each.” The AI system was saving 5 minutes per article on a task that produced 3 articles per day. That’s 15 minutes of daily savings. The system cost $2,400/month in compute and API fees, plus engineering time for maintenance. The math didn’t work. The system was technically successful and economically pointless.

What to do. Ask this question honestly before you start. Not “would it be nice to have this?” — everyone says yes to that. Ask “if we built it and then removed it, what would break?” If nothing breaks, the use case is too thin. If the answer is “we’d need to hire two people to cover the gap” — now you have a business case.

The turn-it-off test also reveals dependency risk. If the answer is “everything breaks and we have no fallback,” you need to build with more redundancy. The ideal answer is “things would get worse in a measurable way, and we have a manual fallback that’s painful but functional.” That’s a system creating real value with a manageable risk profile.

Using the three questions together

The three questions work as a filter. Run every proposed AI project through them.

Can you measure the baseline? If no, measure first or pick a different project.

Who owns this in production? If nobody, solve the ownership problem first or pick a different project.

What happens when you turn it off? If nothing, pick a different project.

Projects that pass all three tend to ship, tend to work, and tend to justify their cost. Not because the questions are magic — but because they force the team to answer the hard organizational questions before they start writing code.

Most AI projects don’t fail because the technology doesn’t work. They fail because nobody measured the baseline, nobody owned the production system, or the thing they built didn’t matter enough for anyone to miss it. These are all knowable in advance. You just have to ask.

tl;dr

The pattern. Teams greenlight AI projects without answering basic questions about measurement, ownership, and value — then wonder why the projects stall or get cut. The fix. Before committing resources, confirm you can measure the baseline, name a production owner, and articulate what breaks if the system is turned off. The outcome. You only build AI projects that can prove their impact, ship to production, and survive budget season.

Prompt versioning is not optional

Fri, 21 Feb 2025 00:00:00 +0000

Last month a client reported that their AI-powered support system had started giving worse answers. Not catastrophically worse — subtly worse. Longer responses, less specific, occasionally missing the point of the question.

We asked when it started. They were not sure. Sometime in the past two weeks, maybe. We asked what changed. They checked their deploy logs. No code changes. No model changes. No data pipeline changes.

After two hours of investigation, we found it. A developer had tweaked the system prompt — changed three sentences — as part of an unrelated PR. The change was buried in a string literal inside a Python file. It was not called out in the PR description. The reviewer did not notice it. There was no way to correlate the change with the behavior regression because there was no record of which prompt version was running at any given time.

This is the default state of prompt management at most organizations. It is not good.

The current reality

Most teams store prompts as string literals in application code. Sometimes they are in a constants file. Sometimes they are inline in a function. Sometimes they are split across multiple files and assembled at runtime. Occasionally they are in a database, editable via an admin panel, with no version history at all.

The common thread: there is no systematic way to know which prompt was running at a given time, no way to roll back to a previous version without a code deploy, and no way to correlate prompt changes with changes in system behavior.

This would be unacceptable for any other part of the system. You would not run a database migration without tracking which schema version is active. You would not deploy a config change without recording what changed and when. But prompts — which are arguably the most sensitive part of an AI system, the part that most directly controls behavior — get treated as informal text edits.

Why this matters operationally

Prompts are not documentation. They are not comments. They are runtime configuration that directly determines system behavior. A one-word change in a prompt can shift the model’s output distribution in ways that are difficult to predict and difficult to detect without proper monitoring.

When something goes wrong — and it will — you need to answer three questions:

What prompt was running when the bad output was generated?
What was the previous prompt, and when did it change?
Did the behavior change correlate with the prompt change, or is something else going on?

If you cannot answer question 1, you cannot debug the problem. You are guessing. You might fix it by accident. You might make it worse.

The minimum viable approach

You do not need a prompt management platform. You do not need a SaaS tool. You need three things you already have.

1. Prompts live in version-controlled files.

Move your prompts out of application code and into dedicated files. We use YAML, but the format does not matter. What matters is that each prompt is a discrete artifact with its own change history.

prompts/
support-system-prompt.yaml
summarization-prompt.yaml
classification-prompt.yaml

Each file contains the prompt text, a version identifier, and any metadata that is relevant — when it was last changed, who changed it, why.

When a developer wants to change a prompt, they change a file. That change shows up in a PR. It gets reviewed. It gets merged. It has a timestamp, an author, and a commit hash. This is not new technology. This is git.

2. Each deploy records the active prompt version.

Your deployment process should capture which prompt versions are active. This can be as simple as logging the git commit hash of the prompts directory at deploy time. Or including the prompt version identifiers in your application’s health check endpoint. Or writing them to a deploy manifest.

The goal is that when someone asks “which prompt was running at 3pm last Thursday,” you can answer in under a minute.

3. Your logs include the prompt identifier.

Every LLM call should log which prompt version was used. Not the full prompt text — that is wasteful and potentially a security concern. Just the version identifier. A hash, a semver string, a timestamp — anything that lets you join your request logs to your prompt history.

With these three pieces, you can do something that most teams cannot: correlate prompt changes with behavior changes. When accuracy drops, you check what prompt was running. When you roll out a new prompt, you compare metrics before and after. When a regression occurs, you roll back to the previous version and confirm the regression resolves.

What this enables

Once you have prompt versioning, a set of practices becomes possible that is impossible without it.

Prompt rollbacks. When a new prompt makes things worse, you roll back. This takes seconds if your prompts are in config files. It takes a full deploy cycle if they are in application code.

A/B testing. Run two prompt versions simultaneously, route traffic between them, and compare results. This is just feature flagging. Your existing feature flag system can do it — if the prompt version is a config value rather than a hardcoded string.

Prompt auditing. For regulated industries, you may need to demonstrate which prompt was active when a specific output was generated. This is trivially easy with proper versioning. It is nearly impossible without it.

Regression detection. If your evals run on every prompt change — the same way your unit tests run on every code change — you catch regressions before they ship. This requires the prompt change to be a discrete, observable event. String literal edits buried in code are not observable events.

The objection

“This is overengineering. We only change prompts occasionally.”

You change prompts more often than you think. Every time someone tweaks the system prompt “real quick” to fix an edge case, that is a prompt change. Every time someone adds a clarifying sentence because a user reported a bad answer, that is a prompt change. These changes are invisible precisely because the prompts are not tracked.

The teams that tell us they “rarely change prompts” are the same teams that cannot explain when their last prompt change was. They are not changing prompts rarely. They are changing prompts without noticing.

The heuristic

If you cannot tell me which prompt version was running in production at any arbitrary point in the past, you do not have prompt management — you have prompt chaos. The fix takes a day: move prompts to files, log the version on each call, record the active version at deploy time. You already have git and a logging system. Use them.

tl;dr

The pattern. Prompts get changed as buried string literals in unrelated PRs, so when behavior quietly degrades — longer responses, missed intent, subtle regressions — the team cannot trace the change that caused it or roll it back without a full deploy. The fix. Move prompts to dedicated version-controlled files, record the active prompt version in every LLM call’s logs, and capture which version deployed at each release. The outcome. Regressions become debuggable in minutes instead of hours, rollbacks take seconds, and A/B testing and prompt auditing become possible because prompt changes are finally observable events.

Your AI engineer is doing three jobs

Fri, 07 Feb 2025 00:00:00 +0000

We keep meeting the same person. Different company, different title, same situation. They were hired as an “AI engineer.” They are writing prompts, building data pipelines, deploying models, setting up evals, managing GPU infrastructure, and fielding Slack messages from product managers who want to know why the chatbot said something weird yesterday.

They are doing three jobs. They are good at one of them. They are adequate at another. The third one is held together with duct tape and optimism. They are tired.

The three jobs

The “AI engineer” title has become a catch-all. When you unpack what the role actually requires, it splits into at least three distinct skill sets:

Prompt engineering and evaluation. This is the application-layer work. Writing prompts, iterating on them, building eval suites, analyzing failure modes, tuning for specific use cases. It is close to product work. The best prompt engineers think like product managers — they obsess over user intent, edge cases, and the gap between what the user asked and what the model understood.

Data engineering. This is the pipeline work. Ingesting documents, chunking them, building embeddings, maintaining vector stores, keeping data fresh, handling deduplication, managing metadata. It is unglamorous and critically important. Bad data pipelines produce bad retrieval, and bad retrieval produces bad answers — regardless of how good your prompt is.

ML infrastructure and operations. This is the deployment work. Serving models, managing GPU instances, optimizing latency, monitoring for quality regressions, handling failover, managing model versions. It is classic ops work adapted for a new stack. The skills transfer from traditional DevOps, but the specifics — quantization, batching strategies, KV cache management — are domain-specific.

These three jobs require different backgrounds, different tools, and different ways of thinking. Prompt engineering is iterative and experimental. Data engineering is methodical and plumbing-intensive. ML ops is reliability-focused and systems-oriented.

Where the breakdown happens

In theory, one strong generalist can handle all three. In practice, every person has a strongest skill and a weakest skill. The weakest skill becomes the bottleneck for the entire system.

Here is the pattern we see most often:

The AI engineer was hired for their ML background. They are good at model selection, evaluation, and prompt engineering. They can build a solid eval suite and iterate on prompts effectively. They are adequate at deployment — they can get a model running in production, even if the infrastructure is not optimally configured.

But the data engineering is where things fall apart. The ingestion pipeline is a series of scripts that run on someone’s laptop. The chunking strategy was chosen once and never revisited. There is no monitoring on data freshness. When the source documents change format, the pipeline breaks silently and nobody notices until a user reports bad answers three weeks later.

The team looks at the bad answers and assumes it is a model problem. They spend two weeks tuning prompts. The answers do not improve, because the problem is not the prompt — it is the data. But the data pipeline is the thing nobody is paying attention to, because the person responsible for it is also responsible for everything else and does not have time to instrument it properly.

We have seen this exact failure mode at least a dozen times in the last year. The details vary. The pattern does not.

The second most common failure

The other common version: the AI engineer is strong on data and prompts but weak on ops. The system works beautifully in development. The eval numbers are great. The demos are impressive.

Then it hits production traffic. Latency spikes. The autoscaling does not work because nobody configured it properly. The monitoring is basic — just uptime checks, no quality metrics. When the model starts producing worse outputs because a dependency changed, nobody notices for days.

The AI engineer knows the system needs better ops. They just do not have time to build it, because they are also maintaining the data pipeline and iterating on prompts for the next feature.

Why this happens

The root cause is organizational. Most companies hired their first AI engineer 12-18 months ago. That person was expected to build the whole stack — prototype to production. For a prototype, one person is fine. Prototypes do not need robust data pipelines or production-grade ops.

But prototypes become products. The scope grows. The traffic grows. The stakeholders multiply. And the team does not grow with it. The single AI engineer who built the prototype is now operating a production system alone, and the org has not noticed that the role has outgrown one person.

Part of the problem is that the hiring market uses “AI engineer” as a single role. Job postings list requirements that span all three skill sets — as if finding someone who is equally strong at prompt engineering, data engineering, and ML ops is a reasonable expectation. It is like posting a job for someone who is equally strong at frontend, backend, and infrastructure. That person exists, but they are rare, expensive, and probably already running their own company.

The fix

You have two options. Both are legitimate. Pick the one that matches your stage and budget.

Option 1: Split the role. If you can hire, the highest-leverage split is separating the application layer (prompts, evals, product integration) from the infrastructure layer (data pipelines, model serving, monitoring). These two halves have different cadences — the application layer changes daily, the infrastructure layer should change infrequently but must be reliable when it does.

You do not necessarily need to hire a third person immediately. A strong data engineer from your existing team can often take on the data pipeline work if given context. A strong DevOps engineer can take on model serving if given some upskilling. The ML-specific knowledge is thinner than people think — the operational patterns are familiar.

Option 2: Accept the tradeoff explicitly. If you cannot hire, decide which of the three areas will be weak and manage accordingly. This is not admitting defeat. This is being honest about constraints.

If ops will be weak, invest in managed services that reduce the ops burden. Use hosted model APIs instead of self-hosting. Use managed vector databases instead of running your own. Trade cost for reduced operational complexity.

If data engineering will be weak, invest in monitoring that catches data quality issues early. Instrument your pipeline with freshness checks, schema validation, and output sampling. You cannot fix what you cannot see.

If prompt engineering will be weak, invest in a strong eval suite and iterate more slowly. Fewer prompt changes, more thoroughly tested. Ship less frequently but with higher confidence.

The worst option is not choosing — letting all three areas be mediocre without a conscious decision about which one matters least.

The conversation to have

If you are a leader with an AI engineer on your team, ask them this: “Which of these three areas — prompts and evals, data pipelines, or deployment and ops — do you spend the least time on, and does that worry you?”

Their answer will tell you where your risk is. If they say “data pipelines” and they have a RAG system, you have a problem. If they say “ops” and you are running in production, you have a problem. If they say “prompts” and accuracy is slipping, you have a problem.

The heuristic: if one person is responsible for prompts, data, and infrastructure, identify which of the three is their weakest skill. That is where your next production incident will come from. Either shore it up with a hire, reduce the scope with managed services, or add monitoring so you see the failure before your users do.

tl;dr

The pattern. The single “AI engineer” is simultaneously responsible for prompt engineering, data pipelines, and ML ops — three different skill sets with different cadences — so the weakest one silently degrades until a production incident makes it visible. The fix. Either split the role at the application-versus-infrastructure boundary or explicitly decide which area will be weak and compensate with managed services, monitoring, or slower iteration cycles. The outcome. The team stops diagnosing prompt problems that are actually data problems, and the next production incident gets caught by instrumentation instead of a user complaint.

The 90% accuracy problem

Fri, 24 Jan 2025 00:00:00 +0000

“We’re at 90% accuracy.” We hear this in almost every initial call. It is delivered like good news. Sometimes it is. Usually, we do not have enough information to know — and neither does the team saying it.

90% accuracy means 1 in 10 answers is wrong. Whether that is a rounding error or a crisis depends entirely on one question that almost nobody asks: what happens when the wrong answer ships?

The missing variable

Accuracy is not a quality metric. It is half of a quality metric. The other half is the cost of being wrong.

Consider two systems, both at 90% accuracy:

System A recommends blog posts to readers. When it gets it wrong, the reader sees an irrelevant article. They scroll past it. Nobody notices. Nobody cares. 90% accuracy is fine. 80% might be fine too.

System B answers patient questions about drug interactions. When it gets it wrong, a patient might take two medications that should not be combined. 90% accuracy means roughly 1 in 10 patients gets bad information. That is not a product issue. That is a liability issue.

Same accuracy number. Entirely different risk profiles. The number alone tells you nothing.

Why teams get stuck on a single number

There is a natural pull toward a single accuracy metric. It is easy to track. It goes on a dashboard. You can set a target and measure progress. Product managers love it. Executives love it more.

The problem is that a single number averages across all your failure modes. It treats a harmless miss the same as a dangerous one. It hides the distribution of errors behind a mean.

We audited a customer support system last year. Overall accuracy was 92%. Very respectable. But when we broke it down by category, the picture changed. For simple FAQ questions — “what are your hours,” “how do I reset my password” — accuracy was 98%. For billing disputes — “why was I charged twice,” “I want a refund” — accuracy was 71%.

The 92% number was masking the fact that the hardest, highest-stakes questions were the ones the system handled worst. Which makes sense — hard questions are hard. But the team was not tracking this. They saw 92% and moved on to other priorities.

The failure mode framework

Here is how we think about it. Classify every failure mode into one of four categories:

Harmless. The user notices the error but it has no consequence. A recommendation engine suggesting a mediocre article. A search system returning a slightly suboptimal result. The user shrugs and moves on.

Embarrassing. The error is visible and reflects poorly on the product, but causes no material harm. A chatbot giving a confidently wrong answer about your company’s founding date. A summarizer producing an awkwardly worded sentence. Trust erodes slowly.

Costly. The error has a direct financial or operational consequence. A pricing system that miscalculates a quote. A routing system that sends a high-value ticket to the wrong team. An extraction pipeline that pulls the wrong dollar amount from a contract.

Dangerous. The error creates legal, safety, or regulatory risk. Medical advice. Legal interpretation. Financial compliance. Anything where being wrong can hurt someone.

Once you have this classification, set accuracy thresholds per category — not for the system as a whole. 85% accuracy on harmless failures might be perfectly fine. 85% accuracy on dangerous failures is almost certainly not.

The threshold conversation

This is where it gets uncomfortable. Setting per-category thresholds forces you to have conversations that most teams would rather avoid.

“What accuracy do we need on billing questions before we let the AI handle them without human review?” That is a real question with real consequences. It requires input from legal, from customer success, from finance. It cannot be answered by the ML team alone.

Most teams skip this conversation. They ship with a single accuracy number and a vague sense that it is “good enough.” Then an edge case blows up, and they scramble.

The teams that do this well have a simple artifact — a table. Rows are failure categories. Columns are: current accuracy, target accuracy, what happens when it is wrong, and who approved the threshold. It is not a sophisticated document. It fits on one page. But it forces the conversation that needs to happen before you ship.

Measuring per-category accuracy

This requires labeled data that is tagged by category. Which means your eval set needs to be stratified, not just large.

A common mistake: teams build a 500-example eval set, sample randomly, and measure aggregate accuracy. The result is a number that over-represents common, easy cases and under-represents rare, hard ones. You end up with high accuracy on the things that did not need an AI system in the first place.

A better approach: build your eval set category by category. Ensure you have at least 50 examples per failure category — more for the dangerous ones. Measure accuracy within each category independently. Report the per-category numbers alongside the aggregate.

Yes, this is more work. It is dramatically less work than dealing with a production incident in a high-stakes category you were not measuring.

The human review escape hatch

For the dangerous categories, the answer is often not “improve accuracy.” The answer is “do not let the AI answer without human review.”

This is not a failure of the AI system. This is a design decision. A well-designed system knows its own limitations and routes accordingly. The AI handles the harmless and embarrassing categories autonomously. The costly and dangerous categories get flagged for human review.

The accuracy requirement for the routing itself is high — you need the system to correctly identify which category a query falls into. But that is a classification problem, and classification problems are much more tractable than open-ended generation problems.

We have seen teams spend months trying to improve accuracy on their hardest category from 85% to 95%. They would have been better served spending a week building a routing layer that sends those queries to a human. The accuracy improvement was not realistic on their timeline. The routing layer was.

The number you actually need

Here is the uncomfortable truth: there is no universally “good” accuracy number. There is only the number that is appropriate for your specific failure modes, your specific users, and your specific risk tolerance.

The heuristic: before you report an accuracy number, you should be able to answer “what happens when this is wrong?” for every category of error. If you cannot answer that question, the accuracy number is meaningless — you do not know what you are measuring against.

A system at 85% accuracy with well-understood, well-classified failure modes and appropriate human escalation paths is safer than a system at 95% accuracy where nobody has thought about what happens in the remaining 5%.

Measure the cost of being wrong. Then decide how often you can afford it. Then set the threshold. That is the order.

tl;dr

The pattern. Teams report a single aggregate accuracy number that masks the distribution of failures, so a 92% overall score hides 71% accuracy on the billing questions that matter most. The fix. Classify every failure mode as harmless, embarrassing, costly, or dangerous, build a stratified eval set with at least 50 examples per category, and set separate accuracy thresholds — not a single aggregate — for each one. The outcome. The dangerous categories get routed to humans before they cause liability events, and the team stops spending months chasing accuracy improvements that the architecture, not the model, needs to solve.

Open-weights models don't eliminate vendor risk

Fri, 10 Jan 2025 00:00:00 +0000

January 2025. DeepSeek R1 drops. It is very good. It is open-weights. The discourse immediately shifts to: “We can finally stop depending on OpenAI.”

We heard this from three clients in the same week. The reasoning was always the same — if we self-host, we control our destiny. No more API price changes. No more rate limits. No more worrying about a provider deprecating the model we built on.

The reasoning is correct about what it eliminates. It is wrong about what it introduces.

What vendor risk actually means

Vendor risk is not “I pay someone money and they might raise the price.” That is price risk. Vendor risk is broader — it is the set of things you depend on that you do not control.

When you use GPT-4 via API, your vendor risks include: pricing changes, rate limits, model deprecation, data handling policies, uptime, and the model’s behavior changing between versions.

When you self-host an open-weights model, you eliminate the first set. But you pick up a new one.

The new risk surface

Training data risk. You did not curate the training data. You do not know what is in it. You do not know what biases it carries. You do not know if it was trained on data that creates legal exposure for your use case. The model card might tell you some of this. It will not tell you all of it.

Architecture risk. The model’s architecture was chosen by someone else, for their objectives. Those objectives may not match yours. A model optimized for reasoning benchmarks may not be the right choice for your customer-facing summarization pipeline. You cannot change the architecture without retraining — which you almost certainly are not going to do.

Update cadence risk. Open-weights models do not have an SLA for improvements. The creator might release an update next month, next year, or never. If a critical capability gap emerges, you are on your own. With an API provider, you can at least file a support ticket. With an open model, you file a GitHub issue and hope.

Infra risk. This is the big one. Self-hosting a model means your team is now responsible for GPU procurement, serving infrastructure, autoscaling, latency optimization, and uptime. Your team was not doing this before. They may not be good at it yet. The model is free. The compute is not. The ops burden is definitely not.

Talent risk. You now need people who understand model serving, quantization tradeoffs, inference optimization, and GPU cluster management. These people are expensive and hard to find. If your one ML infra person leaves, you have a production system that nobody knows how to operate.

The pattern we see

Here is what typically happens. A team decides to self-host. They get the model running on a single GPU instance. It works. They deploy it. The first week is fine.

Then traffic grows. Latency increases. They need to scale. Scaling means load balancing inference requests across multiple GPU instances. That means building or adopting a serving framework — vLLM, TGI, Triton. Each has its own operational complexity.

Then someone asks about failover. Then someone asks about model versioning. Then someone asks about A/B testing between model versions. Then someone asks about monitoring for quality regressions.

Six months in, they have built a small platform team around model serving. The platform team is three people. They are doing important work. But the original goal was to reduce dependency, and now the team depends on an internal platform that three people built and only three people understand.

They traded vendor risk for internal platform risk. Which is, in some ways, worse — because when your vendor has an outage, you can blame the vendor. When your internal platform has an outage, you just have an outage.

The false binary

The mistake is framing this as “API vs. self-host.” That is the wrong axis. The right axis is “how expensive is it to switch?”

If switching models costs you six months of re-engineering, you have vendor risk — regardless of whether the model is open or closed, hosted or self-hosted. If switching models costs you a config change and a round of evals, you have almost no vendor risk.

The risk is not in the model. The risk is in the coupling.

What actually reduces vendor risk

Abstraction layers. Your application code should not know which model it is calling. It should call a function that returns a response. Which model serves that response is a deployment decision, not an application decision. This is not a new idea. It is the same reason you put a load balancer in front of your web servers.

Eval-driven model selection. Build your eval suite first. Run every candidate model through it. Pick the one that scores best on your specific tasks. When a new model drops — open or closed — run it through the same evals. If it wins, switch. If it doesn’t, don’t.

This only works if your evals are good and your switching cost is low. Which means the abstraction layer is not optional — it is the thing that makes eval-driven selection practical.

Multi-model architectures. Use different models for different tasks. Your summarization pipeline might run on a self-hosted open model because latency and cost matter more than peak quality. Your complex reasoning tasks might run on a frontier API model because quality matters more than cost. Different risk profiles for different tasks.

Prompt portability. Write prompts that are not tightly coupled to a specific model’s behavior. This is harder than it sounds — every model has quirks, and tuning prompts to exploit those quirks makes you more dependent, not less. The prompts that port well are the ones that are clear, structured, and rely on general capabilities rather than model-specific tricks.

The honest tradeoff

Self-hosting open-weights models is a legitimate strategy. It gives you control over data residency, inference costs at scale, and availability. These are real benefits.

But it does not eliminate vendor risk. It transforms it. You trade dependency on an API provider for dependency on a model creator’s training decisions, an open-source community’s update cadence, and your own team’s ability to operate GPU infrastructure.

For some organizations, that is the right trade. For others — particularly those without existing ML infrastructure teams — it creates more risk than it eliminates.

The heuristic: if you cannot name the three people who will operate your self-hosted model infrastructure, and what happens when one of them leaves, you are not ready to self-host. Start with APIs, invest in abstraction layers, and build the infra muscle before you take on the infra burden.

tl;dr

The pattern. Teams self-host open-weights models to escape API vendor risk and end up trading it for infra risk, training data risk, and internal platform risk — while building a three-person GPU ops team that only those three people understand. The fix. Invest in abstraction layers and low-switching-cost architecture so that whether you use an API or a self-hosted model is a deployment decision, not an application decision. The outcome. You can move between models based on eval results rather than lock-in, and the question of open versus closed weights becomes a cost-and-control tradeoff instead of an existential one.

Model migrations are database migrations

Fri, 20 Dec 2024 00:00:00 +0000

A team we were advising switched from GPT-4 to GPT-4o on a Friday afternoon. Changed the model string in their config, deployed, went home for the weekend. By Monday they had 40 support tickets. The outputs were different — slightly different phrasing, different formatting, different handling of edge cases. Their downstream parsing code broke on 15% of responses. Their eval scores dropped 8 points. Their latency improved, which was nice, but nobody noticed because they were too busy triaging the regressions.

This was not a negligent team. They were experienced engineers who understood that model changes have consequences. They just underestimated how many consequences, and they treated the change like a config update instead of a migration.

What changes when you change a model

A model is not a library with a stable API. It is a function with deterministic inputs and non-deterministic outputs. When you change the function, everything downstream of it changes too.

Outputs. This is the obvious one, and teams still underestimate it. Different models produce different text for the same prompt. The differences are often subtle — a word choice here, a formatting choice there. But if you have code that parses model outputs — extracting JSON, splitting on delimiters, matching patterns — subtle differences break things. A model that returns {"answer": "yes"} and a model that returns {"answer": "Yes"} are functionally different if your code does an exact string match.

Latency. Different models have different speed profiles. Switching from a larger model to a smaller one usually improves latency. Switching providers — say, from OpenAI to Anthropic — changes latency in unpredictable ways that depend on routing, server load, and context length. If you have SLAs or timeout settings tuned to your current model’s latency profile, a model change can violate them.

Cost. Pricing varies by model and by provider. A change that looks like a drop-in replacement might double your per-token cost, or halve it. If you are processing high volumes, this matters. If you have a budget that assumes a specific cost-per-query, a model change is a budget change.

Token limits and context windows. Models have different context windows. A prompt that fits in one model’s context might not fit in another’s. If your system dynamically constructs prompts — stuffing retrieved chunks into context — you need to verify that your prompts still fit. A prompt that silently gets truncated because it exceeds the context window produces wrong answers without raising an error.

Eval results. Your eval suite was built and calibrated against a specific model’s behavior. Your thresholds, your scoring rubrics, your golden set — all of these assume a particular output style. A new model might score differently on your eval even if the actual quality is equivalent. You need to re-baseline, not just re-run.

The database migration analogy

Software engineers learned decades ago that database schema changes are dangerous. A schema migration can break queries, corrupt data, and take down production. The industry developed a discipline for this: migration scripts, rollback plans, staged rollouts, shadow reads, canary deploys. Nobody changes a database schema on a Friday afternoon.

Model changes have the same risk profile. They change the shape of your system’s outputs. They can break downstream consumers. They require testing against production-like data. They need rollback plans.

The discipline should be the same.

The migration plan

Here is the process we recommend. It is not novel. It is the database migration playbook applied to models.

Step 1: Run the eval suite. Before deploying anything, run your full eval suite against the new model. Compare scores to your current model’s baseline. Look at the overall score, but also look at per-category breakdowns. A model might score the same overall but regress on a specific category that matters to your users.

Step 2: Compare outputs. Take a sample of 100–200 production queries. Run them through both models. Diff the outputs. Look for systematic differences — formatting changes, tone changes, refusal patterns, verbosity differences. This step often reveals issues that the eval suite misses because the eval is measuring accuracy and the issue is formatting.

Step 3: Check the plumbing. If you have code that parses model outputs — JSON extraction, regex matching, structured output parsing — test it against the new model’s outputs. This is where most migrations break. The model is fine. The parsing code is not.

Step 4: Shadow test. Deploy the new model alongside the old one in production. Send real traffic to both. Log the new model’s responses but serve the old model’s responses to users. Compare the outputs over a few days of real traffic. This catches issues that synthetic testing misses — unusual query patterns, edge cases in production data, load-dependent behavior.

Step 5: Canary deploy. Send 5–10% of production traffic to the new model. Monitor error rates, latency, user feedback, and downstream system health. If anything degrades, roll back. If everything looks stable after 24–48 hours, increase the percentage.

Step 6: Cut over. Move 100% of traffic to the new model. Keep the old model configuration available for immediate rollback. Monitor closely for a week.

This process takes 1–2 weeks for a straightforward model upgrade. It takes longer if the model change involves a provider switch or a significant capability difference. This is not slow — this is responsible.

The shortcuts that hurt

“The new model is just a minor version bump.” Minor version bumps can still change outputs. GPT-4-0613 and GPT-4-1106 are both “GPT-4” and they behave differently. Test every change. There is no safe shortcut.

“We’ll just watch the dashboards after deploying.” By the time the dashboards show a problem, your users have already seen it. Shadow testing and canary deploys exist specifically so your users don’t have to be your test suite.

“Our prompts are model-agnostic.” No they are not. Prompts are tuned — consciously or unconsciously — to the behavior of the model they were written for. A prompt that works well with Claude might not work well with GPT-4, and vice versa. Model-agnostic prompts are a useful aspiration and a dangerous assumption.

“We can always roll back.” Can you? How fast? Is the rollback automated? Have you tested it? A rollback plan that exists only as an idea in someone’s head is not a rollback plan. Script it. Test it. Time it.

The organizational discipline

Model migrations should have owners. Not the whole team — one person who is responsible for the migration plan, the eval comparison, the shadow test, and the cut-over decision. This person is the equivalent of the DBA who runs the schema migration. They do not need to be the most senior engineer. They need to be the most careful one.

Model changes should have calendar entries. Not “we’ll switch sometime next week.” A specific date, with a specific rollback window, and a specific person on-call for the first 48 hours. Same as a database migration.

Model changes should have runbooks. What to check. What thresholds to watch. When to roll back. Who to notify. This document takes an hour to write and saves a day of chaos when something goes wrong.

The heuristic

Treat every model change — version bump, provider switch, or capability upgrade — with the same rigor as a database schema migration. Eval, compare, shadow, canary, cut over. If you would not change your database schema on a Friday afternoon with no rollback plan, do not change your model that way either.

tl;dr

The pattern. Teams treat model changes as config updates and discover on Monday that the new model’s slightly different formatting broke their parsing code, shifted their eval scores, and generated 40 support tickets over the weekend. The fix. Run evals, diff production outputs, shadow test in production, canary deploy at 5–10%, then cut over — with a named owner, a calendar entry, and a tested rollback plan. The outcome. Model upgrades become routine, regressions get caught before users see them, and the team builds the confidence to migrate often instead of deferring until a model is deprecated.

Your annual AI review should fit on one page

Fri, 06 Dec 2024 00:00:00 +0000

It is December. Someone on your leadership team asks for a retrospective on the AI program. Your team produces a 30-page deck. There are timelines, architecture diagrams, model comparison tables, a section on “learnings,” and a roadmap that extends to Q3 next year. It takes two weeks to write. Nobody reads past slide 8.

The length is not a sign of thoroughness. It is a sign that the team cannot identify what actually matters.

The one-page format

A useful annual AI review has four sections. Each section gets a few lines. If you cannot fill a section, that tells you something. If you need more than a few lines, you are hiding behind detail.

What shipped. List 3–5 things that are in production, serving real users, today. Not “we explored.” Not “we prototyped.” Not “we have a proof of concept that leadership was excited about.” What shipped. If the list is shorter than 3, your AI program has not yet earned its budget. That is useful information. Do not pad the list with work-in-progress.

What it cost. Total spend — compute, headcount, tooling, data labeling, everything. Break it down per shipped feature. This number is often uncomfortable. A feature that cost $400k to build and saves $50k per year is not a good investment yet. Write the number down anyway. Intellectual honesty about costs is what separates a program that will improve from one that will get cut.

What it changed. User metrics and business metrics. Did support ticket volume drop. Did user engagement increase. Did revenue change. Did time-to-resolution decrease. Use actual numbers, not percentages of percentages. If you cannot connect your AI features to a business metric, either the measurement is missing or the impact is.

What we would do differently. Two or three honest statements about what did not work. Not “we learned a lot about embeddings.” Specific operational lessons. “We spent 8 weeks on fine-tuning that delivered less accuracy improvement than a prompt change we made in 2 days.” “We shipped without an eval suite and spent a quarter recovering from a regression we could have caught.” This section is the most valuable part of the review. It is also the section teams most often skip.

Why the length matters

A 30-page retrospective serves the team’s need to justify its existence. A one-page retrospective serves the organization’s need to make decisions.

Leadership does not need to know your embedding dimensions or your chunking strategy. They need to know whether the AI program is working. Working means: it shipped things, those things cost a known amount, and those things had a measurable impact. Everything else is supporting detail that belongs in a team wiki, not in a review.

The discipline of compression is the discipline of understanding. If you cannot fit your program’s impact on one page, one of two things is true. Either the impact is too diffuse to articulate — which means the program lacks focus — or the team does not know which parts of their work actually mattered. Both are problems worth discovering.

The sections nobody wants to fill in

Cost per feature is the number that generates the most discomfort. Teams resist calculating it because the answer is often unflattering. An AI chatbot that cost $600k to build and serves 200 queries per day is an expensive system. Writing down the per-query cost forces a conversation about whether this is the right investment. That conversation is necessary. Having it in December is better than having it in June when someone else initiates it.

What it changed is the section that exposes measurement gaps. Many teams ship AI features without instrumenting them for business impact. They can tell you the model’s accuracy on their eval set. They cannot tell you whether users are better off. If this section is empty, the problem is not the review format — it is that the team has been building without feedback loops. Fix the instrumentation, not the review.

What we would do differently is the section that requires psychological safety. If the team writes “nothing, it all went great,” the review is useless. The real version includes statements that make someone uncomfortable. The feature that should have been killed earlier. The hire that was wrong for the role. The dependency on a vendor that turned out to be a bottleneck. These are the lessons that save you money next year.

How to use the review

The one-page review is not a filing exercise. It is a decision tool. After writing it, three decisions should be obvious.

Continue, expand, or cut. For each shipped feature, the cost and impact data tells you whether to keep investing. A feature with high impact and decreasing cost gets expanded. A feature with low impact and stable cost gets cut. A feature with high impact and high cost gets an optimization sprint.

Where to focus next year. The “what we’d do differently” section points directly at the highest-leverage changes. If you spent too long on fine-tuning, invest in prompt engineering infrastructure. If you shipped without evals, make evals the first project next quarter.

Whether the program is earning its budget. This is the question leadership is actually asking. The one-page format makes the answer legible. Either the AI program shipped things that moved business metrics, or it didn’t. If it didn’t, the review should say so — and explain what needs to change.

The heuristic

At the end of the year, write your AI program review on one page. Four sections: what shipped, what it cost, what it changed, what you’d do differently. If you cannot fill the page, your program needs focus. If you need more than a page, you need to decide what actually mattered. Either way, the constraint is the point.

tl;dr

The pattern. AI teams produce 30-page retrospectives to justify their existence, hiding the uncomfortable numbers — cost per feature, unmeasured business impact, mistakes worth repeating — behind architecture diagrams and roadmaps nobody reads past slide 8. The fix. Force the review into four sections on one page: what shipped, what it cost, what it changed, and what you would do differently — with actual numbers in each. The outcome. Leadership can make a real decision about the program’s budget, and the team finally has the specific operational lessons that will save them money next year.

Multimodal is not a feature, it's a stack change

Fri, 22 Nov 2024 00:00:00 +0000

A product manager walks into a planning meeting and says, “Can we add image understanding? The new model supports it.” The team estimates it at two sprints. They are wrong by a factor of four, and they will not realize it until sprint three.

This happens constantly. Multimodal capabilities — image, audio, video — look like feature additions. They are not. They are stack changes. The distinction matters because feature additions work within your existing infrastructure. Stack changes require you to rebuild parts of it.

What actually changes

Here is what “add image understanding” means in practice, for a team that has a working text-based AI product.

Data pipeline. Your current pipeline ingests text. It chunks it, embeds it, indexes it. Images are different in every way. They need to be extracted from documents — PDFs, slides, emails with attachments. They need preprocessing — resizing, format conversion, OCR for text-in-images. They need metadata extraction — what page is this image on, what text surrounds it, what is the caption. Your text chunking code does not handle any of this. You are building a second pipeline.

Storage. Text chunks are small. A typical chunk is 500–1000 tokens, a few kilobytes. Images are large. A single page rendered at reasonable quality is 200KB–2MB. A 100-page PDF produces 100 images. Your vector store was sized for text. Your blob storage budget was sized for text. Your bandwidth costs were sized for text. Multiply everything by 100x and see if your architecture still makes sense.

Embedding and retrieval. Text embeddings and image embeddings live in different spaces. If you want to retrieve images based on text queries — and you do — you need a multimodal embedding model. These models have different dimensionality, different performance characteristics, and different failure modes than your text embedding model. You are not adding a column to your index. You are adding a second index with a different model and a different query path.

Latency. Sending an image to a vision model takes longer than sending text. Significantly longer. A text-only call to GPT-4 might take 1–3 seconds. The same call with an image might take 5–15 seconds. If your product has a 3-second SLA on response time, you just violated it. You need to rethink your latency budget. Maybe you preprocess images asynchronously. Maybe you cache image analysis results. Maybe you accept a slower experience for image queries. All of these are architectural decisions, not feature decisions.

Cost. Vision model calls cost more than text-only calls. Often 2–5x more per request, depending on image resolution and token count. If you are processing images at ingestion time — analyzing every image in every document — your ingestion costs go up dramatically. If you are processing images at query time — sending images to the model when the user asks about them — your per-query costs go up dramatically. Either way, your cost model changes.

Eval. This is the one teams forget about until it is too late. How do you evaluate whether the model correctly understood an image? Your text eval is straightforward — compare the generated answer to a reference answer. Image understanding eval is fundamentally harder. Did the model correctly read the chart? Did it understand the diagram? Did it extract the right numbers from the table? Each of these is a different eval task with different scoring criteria. Your eval suite just tripled in complexity.

The three-component rule

Here is a heuristic we use. Count the number of infrastructure components that need to change to support the new capability. If it is one or two, it is a feature. If it is three or more, it is a platform evolution.

Adding image understanding typically changes six components: data pipeline, storage, embedding and retrieval, latency architecture, cost model, and eval suite. This is not a feature. This is a second product built on top of the first one.

The reason this distinction matters is planning. Features get estimated in sprints. Platform evolutions get estimated in quarters. If you estimate a platform evolution in sprints, you will be wrong, and you will spend the extra time in a state of perpetual “we’re almost done” that erodes team morale and stakeholder trust.

The right way to scope it

If you genuinely need multimodal capabilities — and sometimes you do — scope it as a platform project.

Phase 1: Spike. One engineer, one week. Build the simplest possible version — take one image, send it to the vision model, get a response. This tells you whether the model can actually do what you need. Many teams discover in the spike that the model’s image understanding is not good enough for their use case. Better to learn this in a week than in a quarter.

Phase 2: Pipeline. Build the image ingestion pipeline. Extraction, preprocessing, storage. Do not integrate it with the existing text pipeline yet. Run it in parallel. This takes 2–4 weeks depending on document complexity.

Phase 3: Retrieval. Add image retrieval to your query path. This might mean a multimodal embedding model, a separate index, or a hybrid approach. Test it in isolation before connecting it to the generation step. Another 2–4 weeks.

Phase 4: Eval. Build an eval suite for image understanding. This is its own workstream. You need golden sets with images, scoring functions that can handle visual content, and CI gates that run image-specific tests. 2–3 weeks.

Phase 5: Integration. Connect the image pipeline to the existing text pipeline. Handle the mixed-modality queries — “what does the chart on page 7 show and how does it relate to the text?” This is where the complexity lives. 2–4 weeks.

That is 9–16 weeks. Not two sprints. And this is the optimistic timeline assuming the spike validates the approach.

The alternative nobody considers

Before building multimodal infrastructure, ask whether you actually need it. In many cases, the user’s real need can be met with a simpler approach.

If users want to understand charts and tables, OCR plus structured extraction might be enough. Convert the visual to text, process it with your existing text pipeline. It is less impressive in a demo. It ships in a week instead of a quarter.

If users want to search for images, metadata and captions might be enough. Tag images during ingestion with descriptions generated by a vision model, then search the text descriptions with your existing retrieval stack.

These approaches are not as powerful as true multimodal understanding. They are an order of magnitude simpler to build, operate, and evaluate. For many products, “good enough” ships and compounds while “perfect” sits in a planning document.

The heuristic

Before adding a multimodal capability, count the infrastructure components it touches. If it is more than three, call it what it is — a platform evolution — and scope it in quarters, not sprints. Then ask whether a text-only approximation would meet 80% of the user need at 20% of the cost. Usually it does.

tl;dr

The pattern. Teams estimate image understanding at two sprints because they are scoping a feature, when they are actually scoping six infrastructure changes across data pipelines, storage, retrieval, latency, cost, and evals. The fix. Count the infrastructure components the capability touches — if it is more than three, scope it as a quarter-long platform project with explicit phases, not a sprint item. The outcome. The timeline becomes honest, the phased build surfaces problems before they compound, and the team discovers early whether a simpler text-only approximation would have shipped the same user value in a week.

The AI team that reported to product shipped. The one that reported to research didn't.

Fri, 08 Nov 2024 00:00:00 +0000

We have worked with about a dozen AI teams over the past few years. The single strongest predictor of whether a team ships production AI is not their talent, their budget, or their model choice. It is who they report to.

Teams that report to a product org ship products. Teams that report to a research org ship papers, prototypes, and demos that never quite make it to production. This is not a judgment on research — it is an observation about incentive alignment. And most companies get it wrong.

The reporting structure shapes everything

Your reporting structure determines four things that matter more than any technical decision your AI team will make.

What gets prioritized. A product-reporting team prioritizes features that users need. A research-reporting team prioritizes problems that are technically interesting. These overlap sometimes. They diverge often. When they diverge, the reporting structure breaks the tie.

How success is measured. Product teams are measured on shipped features, user adoption, and business metrics. Research teams are measured on publications, novelty, and technical depth. An AI team that reports to research will naturally optimize for work that is novel and publishable. Production reliability is neither novel nor publishable.

How long work takes. Product orgs have release cycles. Sprints. Deadlines that are connected to revenue. Research orgs have horizons. Quarters. Goals that are measured in papers submitted and benchmarks beaten. The cadence is different. The urgency is different. A product team that needs an AI feature in six weeks will get it from a product-reporting AI team. A research-reporting AI team will say six weeks is not enough time to do it properly.

Who the team hires. Product-reporting AI teams hire ML engineers who can write production code, operate services, and debug at 3am. Research-reporting AI teams hire researchers who can publish, present at conferences, and push the state of the art. Both are valuable. They are not interchangeable.

The pattern we keep seeing

Here is how it typically plays out.

A company decides to invest in AI. They hire a senior researcher — someone with a strong publication record, maybe from a major lab. This person is given a team and a mandate: “build AI capabilities for the company.”

The researcher does what researchers do. They hire other researchers. They set up a research agenda. They pick interesting problems. They build prototypes. The prototypes are impressive. The demos go well. Leadership is excited.

Then someone asks: “When does this ship?”

The answer is always some version of “it’s not quite ready.” The model needs more fine-tuning. The accuracy isn’t high enough. The edge cases are tricky. These are legitimate technical concerns. They are also the concerns of a team that is optimizing for correctness over shipping.

Meanwhile, a product team down the hall needs an AI feature. They cannot wait for the research team’s timeline. They hire an ML engineer, use an off-the-shelf model, build a quick eval, and ship something in three weeks. It is not as sophisticated as what the research team is building. It works. Users like it. It generates revenue.

Six months later, the research team’s prototype still hasn’t shipped. The product team’s scrappy feature is in production, handling real traffic, getting better with every iteration. Leadership starts asking hard questions about the research team’s ROI.

This is not a failure of talent. It is a failure of org design.

The asymmetry

Here is the nuance that matters: you can embed a research function inside a product-reporting team, but you cannot embed a shipping function inside a research-reporting team.

A product-reporting AI team can allocate 20% of its time to exploratory research. Some sprints, an engineer investigates a new technique. They prototype it. If it works, it goes into the next sprint’s production backlog. If it doesn’t, the team learned something and moves on. This works because the default mode is shipping. Research is a controlled departure from the default.

A research-reporting AI team cannot allocate 20% of its time to “just ship something.” The culture, the incentives, the hiring profile — all of it resists production work. Shipping is not a controlled departure from their default. It is a fundamentally different mode of operating that the team is not staffed or incentivized for.

This asymmetry means the product-reporting structure strictly dominates for companies that need production AI. You get shipping by default and research as an option. The reverse gives you research by default and shipping as an aspiration.

The exceptions

Two situations where a research-reporting structure makes sense.

You are building foundation models. If your core product is the model itself — if you are OpenAI, Anthropic, or a similar lab — then research is the product. The reporting structure aligns because the research output is what ships. This does not apply to 95% of companies.

You have a genuine long-horizon research need. Some companies need to solve problems where no off-the-shelf solution exists. Drug discovery. Materials science. Autonomous systems. These require multi-year research programs. If this is your situation, a research-reporting structure is appropriate. But be honest about whether your AI needs are truly in this category. Most are not. Most companies need to apply existing models to their data, not invent new ones.

How to restructure

If you have a research-reporting AI team and you need production AI, here is the migration path we have seen work.

Step 1. Move the team’s reporting line to a product leader. VP of Engineering or VP of Product — someone who owns a P&L or a product roadmap.

Step 2. Change the team’s success metrics. Replace publications and benchmarks with shipped features, user adoption, and production reliability. Do this explicitly and in writing.

Step 3. Expect turnover. Some researchers will leave. This is not a failure. They joined a research team, and the team is becoming a product team. The ones who stay are the ones who want to ship. These are the people you want.

Step 4. Backfill with ML engineers. People who have run models in production. People who know what an SLA is. People who have been on-call for an ML system.

Step 5. Keep a research allocation. 10–20% of team time for exploratory work. This retains the researchers who stayed and keeps the team’s technical edge. But it is time-boxed and it reports up through product.

This transition takes about a quarter. It is uncomfortable. It works.

The hybrid that doesn’t work

Some companies try to solve this with a matrix structure — the AI team reports to both research and product. Dotted lines. Dual metrics. Shared goals.

We have never seen this work. Matrix structures create ambiguity about priorities. When the research lead wants the team to spend a sprint investigating a new embedding architecture and the product lead wants them to ship a retrieval feature, who wins? In a matrix, the answer is “whoever argues longer.” In a clear reporting structure, the answer is obvious.

Pick one. Make it product. You will ship more and regret less.

The heuristic

If your AI team has been operating for more than six months and nothing is in production, check the reporting structure. If the team reports to research, move it to product. If it already reports to product and still hasn’t shipped, you have a different problem — but at least you can see it clearly.

Reporting structure is not a detail. It is the decision that determines all the other decisions. Get it right first.

tl;dr

The pattern. AI teams that report to research optimize for novelty and correctness and never quite ship, while teams that report to product are forced to build things that work well enough to go live. The fix. Move your AI team’s reporting line to a product leader, change the success metrics from publications to shipped features, and protect a 10–20% research allocation for exploratory work. The outcome. You get production AI by default, and the research capability you kept funds the technical edge that stops your product from going stale.

The GPU bill is not the expensive part

Fri, 25 Oct 2024 00:00:00 +0000

Every AI cost conversation we walk into starts with the cloud bill. How much are we spending on inference. Can we use a smaller model. Should we self-host. The GPU line item is visible, legible, and easy to optimize. It is also the smallest cost in most AI systems.

The expensive part is everything else.

The costs that don’t have line items

Engineering time on non-deterministic debugging. A traditional bug has a stack trace. You read it. You find the line. You fix it. An AI bug has a prompt, a model response, a retrieval result, and a user complaint that says “the answer was wrong.” There is no stack trace. The same input might produce a different output tomorrow. Your engineer spends four hours reproducing the issue, three hours tracing it to a retrieval problem, and two hours writing a targeted eval to make sure it doesn’t happen again. That is a full day for one bug report. At senior engineer rates, that day costs more than a week of inference.

Product management overhead. Your PM is in a meeting explaining to sales why the AI feature “sometimes gets things wrong.” Sales wants a guarantee. The PM cannot give one. This meeting happens every two weeks. It is never the same meeting twice because the failure modes keep changing. The PM starts building a spreadsheet of “known limitations” that gets longer every sprint. This is not product management. This is reputation management. It is expensive and it does not scale.

Support cost. A wrong answer from a traditional system is a bug — file a ticket, we’ll fix it. A wrong answer from an AI system is a trust event. The user does not think “there’s a bug.” The user thinks “this system doesn’t work.” Support has to triage whether the answer was actually wrong, whether it was wrong in a way that matters, and whether the user has lost confidence in the product. This is a fundamentally different support interaction. It takes longer. It requires more context. It often escalates.

Trust erosion. This is the most expensive cost and the hardest to measure. Every wrong answer costs you a fraction of user trust. Users who lose trust stop using the feature. Usage drops. The team looks at the metrics and concludes the feature is not valuable. They scale it back or kill it. The feature was valuable — the reliability was not. But by the time you realize this, the window has closed.

Why teams undercount these costs

The GPU bill arrives on the first of the month. It has a number on it. You can graph it. You can set alerts on it. You can optimize it.

Engineering time spent on AI debugging does not have its own line item. It is hidden inside sprint velocity. Your team shipped fewer features this quarter. Why? Partially because two engineers spent a cumulative three weeks investigating hallucination reports. This cost does not appear in any dashboard. It appears as “we’re a little behind on the roadmap.”

Product management overhead does not have a line item either. It appears as “the PM seems really busy” and “we need another PM.” You hire another PM. The cost is now $180k per year in salary, benefits, and overhead. Nobody connects this to AI reliability.

Support cost shows up as “ticket volume is higher than expected.” You hire another support engineer. Another $120k. Nobody connects this to the model that confidently told a customer their contract included a feature it did not.

The math nobody does

Here is a rough accounting we have done with a few teams. The numbers are anonymized but the ratios are real.

Monthly GPU and API costs: $8k. The number everyone talks about.

Monthly engineering time on AI-specific debugging and maintenance: $24k. Three engineers, each spending roughly 30% of their time on AI reliability work instead of feature development.

Monthly product management overhead for AI limitations and expectations: $6k. One PM spending roughly 40% of their time on AI-related stakeholder management.

Monthly support cost increment from AI-related tickets: $4k. Higher handle time, more escalations, more “is this right?” questions.

Quarterly trust recovery costs — re-engagement campaigns, user interviews, feature re-launches after reliability incidents: $15k per quarter, call it $5k per month.

Total monthly cost: $47k. The GPU bill is 17% of it.

The fix is not cheaper GPUs

Optimizing the GPU bill is fine. Use smaller models where they work. Cache common queries. Batch requests. These are good engineering practices. They save real money.

But they do not touch the 83% of cost that lives outside the cloud bill. To reduce those costs, you need reliability.

Evals. An eval suite that catches regressions before they reach users. This directly reduces engineering debugging time and support ticket volume. A good eval suite pays for itself within a month — not in GPU savings, but in engineering time recovered.

Monitoring. Not just uptime monitoring. Output monitoring. Track answer confidence, retrieval quality, and user feedback signals in production. When something starts degrading, you find out from your dashboard, not from your support queue.

Graceful degradation. When the model is unsure, say so. “I’m not confident in this answer — here are the source documents” costs you nothing and saves you a trust event. Teams that build graceful degradation into their AI systems see dramatically lower support costs. The wrong answer is expensive. The honest “I don’t know” is cheap.

Scope control. The AI features with the worst cost ratios are the ones with the broadest scope. “Ask anything about our product” is a hallucination machine. “Get a summary of your last 5 invoices” is a tractable problem. Narrow scope means fewer failure modes, which means less debugging, less support, and less trust erosion.

The conversation to have

Next time someone asks “how do we reduce our AI costs,” don’t open the cloud console. Open the engineering time tracker. Open the support ticket system. Open the PM’s calendar.

Add up the hours your team spends on AI reliability work — debugging, explaining, apologizing, rebuilding trust. Multiply by your fully loaded cost per hour. Compare that number to your GPU bill.

The GPU bill is the easy cost. The hard costs are the ones you are already paying but have not yet named. Name them. Measure them. Then invest in the reliability work that actually reduces them.

The heuristic

If your monthly engineering time spent on AI debugging and maintenance exceeds your monthly inference cost, your reliability investment is too low. Fix the reliability. The other costs follow.

tl;dr

The pattern. Teams optimize the GPU bill while ignoring the engineering time, product management overhead, and trust erosion that together account for 83% of their actual AI costs. The fix. Build evals, output monitoring, and graceful degradation into your AI system so reliability failures get caught before they reach users. The outcome. When wrong answers stop reaching users, the debugging hours, support tickets, and stakeholder management meetings that consume your team quietly disappear from the sprint.

The pilot that never graduated

Fri, 11 Oct 2024 00:00:00 +0000

The demo went great. The pilot hit its accuracy targets. The stakeholders were impressed. Someone said “this is a game-changer” in a meeting, and they meant it. That was six months ago.

The pilot is still a pilot. It runs on a laptop. Or a notebook in someone’s personal cloud account. Or a prototype environment that nobody monitors. A handful of users test it occasionally. It kind of works. Nobody has a plan to move it to production. Nobody is quite sure whose job that is.

This is the most common outcome for AI pilots. Not failure — limbo. The pilot works well enough that nobody kills it. It doesn’t work well enough — or isn’t integrated enough — to run as a real system. It just sits there, consuming attention and budget, never quite graduating.

We see this at nearly every company that’s past the “should we do AI” conversation. They have pilots. What they don’t have is production systems. The gap between the two is where most AI investment goes to die.

Why pilots get stuck

The problem is almost never technical. The pilot proved the technology works. The problem is organizational — a set of missing decisions that nobody made because the pilot was “just an experiment.”

No production owner. The pilot was built by the AI team, the innovation team, or a couple of engineers who were interested. None of these people run production systems. When the pilot is “done,” there’s nobody whose job it is to operate it. The AI team moves on to the next experiment. The platform team wasn’t involved and doesn’t want to adopt a system they didn’t build. The pilot sits in limbo because nobody owns what happens next.

This is the single most common reason pilots fail to graduate. Ownership. The team that builds a pilot is almost never the team that should run it in production. And if you don’t figure out that handoff before the pilot starts, you won’t figure it out after.

No success criteria defined upfront. The pilot was approved with a vague mandate: “explore whether AI can help with X.” There were no specific metrics, no thresholds, no definition of what “works” means. The pilot produced results. Some were good. Some were mediocre. Nobody knows whether the pilot succeeded because nobody agreed on what success looked like.

Without success criteria, you can’t make a go/no-go decision. And without a go/no-go decision, the pilot just continues. It’s easier to keep running a pilot than to declare it a success or a failure. So it runs. And runs. And runs.

No integration plan. The pilot runs in isolation. It takes manual input, produces output, and someone copies the output into the real system. In the pilot phase, this is fine — you’re testing the AI, not the integration. But in production, the integration is the product. The AI model is maybe 20% of the work. The other 80% is getting data in, getting results out, handling errors, monitoring quality, and fitting into the existing workflow.

Most teams don’t think about integration until the pilot is “done.” Then they discover it’s a 3-month engineering project to connect the pilot to the systems it needs to talk to. The 3-month estimate kills momentum. The pilot stays a pilot.

The people problem. The person who championed the pilot got promoted. Or moved teams. Or left the company. The pilot lost its advocate. Nobody else cares enough to push it through the organizational friction of getting to production. Pilots need a champion, and champions have a half-life.

The pilot tax

Here’s what nobody talks about: running a pilot is often more expensive than running the production version.

A pilot requires manual intervention. Someone feeds it inputs. Someone reviews outputs. Someone restarts it when it crashes. Someone explains to stakeholders why the results are different this week. All of this is human time — untracked, unbudgeted, invisible in the cost model.

A production system, by contrast, is automated. It has monitoring. It has error handling. It has a runbook. It’s less work per unit of output because someone invested the time to make it self-sufficient.

The pilot tax is real, and it compounds. Every month a pilot runs, you’re paying the operational cost of a prototype — which is higher than the operational cost of a production system — while getting the limited value of a system that only a few people use. You’re paying more for less.

This is the argument for graduating or killing. There is no cost-effective middle ground. A pilot that deserves to exist deserves to be in production. A pilot that doesn’t deserve to be in production doesn’t deserve to exist.

The three things that get a pilot into production

We’ve helped teams graduate about 30 AI pilots over the past few years. The ones that make it share three properties. All three are set before the pilot starts, not after.

1. Define graduation criteria before you start

Before the pilot begins, write down what “done” looks like. Not “the AI works” — specific, measurable criteria that trigger the decision to move to production.

“The model classifies incoming tickets with at least 87% accuracy on a held-out test set of 200 tickets, measured weekly for 4 consecutive weeks.”

“Processing time per document drops from 8 minutes to under 2 minutes, with no increase in error rate above the current 4% baseline.”

“Three out of five pilot users rate the system ‘useful’ or ‘very useful’ in the exit survey, and provide specific examples of time saved.”

These criteria serve two purposes. First, they force you to define what matters before you’re emotionally invested in the outcome. Second, they create an automatic trigger for the graduation decision. When the criteria are met, you move to production. When they’re not met, you either iterate with a deadline or kill the pilot. There’s no “let’s keep running it and see.”

2. Assign a production owner from day one

On the first day of the pilot, name the person or team who will own this system in production. Not after the pilot. Not when you “get closer to launch.” Day one.

This person attends the pilot standups. They see how the system works. They understand the data pipeline, the failure modes, the monitoring needs. When the pilot graduates, the handoff is smooth because the production owner has been involved the entire time.

If you can’t name a production owner, that’s a signal. It means either nobody wants to own this in production — which suggests the system isn’t valuable enough to build — or the organizational structure doesn’t support it — which suggests you have a bigger problem than the pilot.

The production owner doesn’t have to build the pilot. They have to be ready to run it. That’s a different skill set and a different commitment. Clarifying this upfront avoids the most common handoff failure: “the AI team built this cool thing and now they want us to support it but we have no idea how it works.”

3. Set a deadline — and mean it

Pilots without deadlines become permanent. The default is entropy: the pilot keeps running, the team keeps tweaking, nobody makes the hard decision.

Set a deadline. 8 weeks. 12 weeks. Whatever’s appropriate. At the deadline, one of three things happens:

Graduate. The criteria are met. Move to production. This is a project with a budget, a timeline, and a team — not “we’ll get to it eventually.”

Iterate with a new deadline. The criteria are close but not met. You see a clear path to getting there. Set a new deadline — no more than 4 weeks — with specific changes to make. This happens once. Not twice. If you’re on your third iteration deadline, the pilot is telling you something.

Kill. The criteria are not met and there’s no clear path to meeting them. Kill the pilot. This is not a failure — it’s a decision. You learned that this use case doesn’t work with current technology, current data, or current organizational capacity. That’s valuable information. Document it, archive the code, move on.

Killing a pilot is hard. Nobody wants to be the person who pulls the plug on something the CEO saw a demo of. But running a pilot forever is worse — it costs more, delivers less, and blocks the team from working on something that might actually make it to production.

The honest conversation

Before you start your next AI pilot, have this conversation: “If this pilot works, who runs it in production, and what does ‘works’ mean specifically?”

If you can’t answer both questions, you’re not ready for a pilot. You’re ready for a research spike — a time-boxed exploration with no expectation of production. That’s fine. Research spikes are valuable. But call them what they are. Don’t call it a pilot unless you’re prepared to graduate it.

The word “pilot” implies a path to production. If there’s no path, it’s not a pilot. It’s a demo that never stops demoing. And your organization already has enough of those.

tl;dr

The pattern. AI pilots succeed in demos but never reach production because nobody defined success criteria, assigned a production owner, or set a deadline. The fix. Before the pilot starts, write graduation criteria, name the production owner, and set a hard deadline for the go/no-go decision. The outcome. Pilots either graduate to production and deliver real value, or get killed quickly — either way, the team stops paying the pilot tax.

You don't need agents, you need a queue

Fri, 27 Sep 2024 00:00:00 +0000

Most “agent” architectures we audit are a task queue with an LLM step. That is fine. Call it what it is, and you will make better infrastructure decisions.

The agent hype

In 2024, every AI feature became an “agent.” A system that reads an email and drafts a reply — agent. A pipeline that processes invoices, extracts fields, and writes them to a database — agent. A workflow that takes a support ticket, classifies it, routes it, and suggests a response — agent.

Frameworks appeared to build these agents: LangChain, CrewAI, AutoGen, and a dozen more. Each one came with abstractions for memory, planning, tool selection, multi-agent coordination, and cognitive architectures. Conference talks showed diagrams with arrows looping between “think,” “act,” and “observe.” The vocabulary shifted. We were not building pipelines anymore. We were building agents.

But when you look at what these systems actually do — what the code actually executes — most of them are a task queue with an LLM call in one of the steps.

What an agent is (and is not)

An agent, in the meaningful sense of the word, is a system where the LLM decides what to do next. Not just what to output — what actions to take, in what order, with what tools, based on what it observes. The control flow is non-deterministic. The model is the orchestrator.

This is a real pattern. It exists. Code generation agents like Devin or Cursor’s agent mode actually do this — they decide which files to read, what to change, when to run tests, and how to respond to errors. The LLM is in the loop of a control flow it is steering.

Most production AI systems are not this. Most production AI systems have a deterministic control flow — step 1, step 2, step 3 — with one or more steps that call an LLM. The steps are defined by a developer. The order is fixed. The LLM provides a specific capability (classification, extraction, generation) at a specific point in the pipeline. The system does not plan. It does not observe and react. It executes.

This is a pipeline. It is a good, useful, battle-tested pattern. It is also not an agent.

Why the label matters

Calling a pipeline an agent has practical consequences. It changes what infrastructure you reach for. It changes how you debug problems. It changes what abstractions you adopt. And most of those changes make things worse.

Agent frameworks add complexity you do not need. If your system processes invoices in a fixed sequence — OCR, extract fields, validate, write to database — you do not need a memory system. You do not need a planner. You do not need tool selection. Each of these adds code, adds failure modes, and adds latency. You need a queue, a worker, and an LLM call.

Agent frameworks obscure what is happening. When you wrap your pipeline in an agent framework, the control flow gets buried under abstractions. Debugging goes from “read the code and follow the steps” to “understand the framework’s execution model, check the agent’s memory state, figure out which tool it selected and why.” This is harder for no benefit when the control flow was deterministic to begin with.

Agent frameworks make the wrong tradeoffs for pipelines. Agent frameworks are optimized for flexibility — the ability to handle novel situations by selecting different tools and strategies. Pipeline workloads need the opposite: consistency, predictability, and reliability. You want the same steps to execute in the same order every time. The system should be boring. Boring systems are reliable systems.

What you actually need

For most AI workloads — the ones that are not genuinely agentic — the infrastructure you need is the infrastructure you already know.

A task queue. Celery, Bull, SQS, Cloud Tasks, Temporal. Pick one. It gives you concurrency control, retry logic, dead letter queues, backpressure, and observability. These are solved problems. You do not need to re-solve them inside an agent framework.

Workers that execute steps. Each step is a function. Some functions call an LLM. Some call a database. Some call an external API. The worker pulls a task from the queue, executes the steps in order, and writes the result. This is the same pattern you use for processing uploads, sending emails, or generating reports.

Error handling at each step. If the LLM call fails, retry it. If it returns garbage, send it to a dead letter queue for human review. If the downstream API is down, back off and retry. You know how to build this. You have been building this for a decade.

Monitoring per step. Latency, error rate, and throughput — per step, not per pipeline. This lets you see immediately when the LLM step degrades without conflating it with the database step or the API step.

This architecture is simpler than an agent framework. It is easier to reason about. It is easier to debug. It is easier to monitor. It uses infrastructure your team already understands. And it handles the actual workload — deterministic pipeline with non-deterministic LLM steps — perfectly.

When you actually need an agent

You need an agent when the control flow itself is non-deterministic. When the system genuinely needs to decide what to do next based on what it has observed so far. When the number of steps is not known in advance. When the system needs to backtrack and try a different approach.

These workloads exist. Research tasks where the model needs to follow chains of references. Code generation where the model needs to write, test, debug, and iterate. Complex analysis where the model needs to gather information from multiple sources and decide when it has enough.

For these tasks, agent frameworks provide real value. Memory, planning, and tool selection are not overhead — they are requirements.

The question is: does your task actually require this? In our experience, roughly 80% of the systems we audit that are built as agents do not. They are pipelines that someone called agents because that was the vocabulary in the room when the architecture was decided.

The refactoring

We worked with a team that had built a document processing system using an agent framework. The “agent” received a document, “decided” to extract metadata, “decided” to classify it, “decided” to route it to the right team, and “decided” to generate a summary. Each decision point was deterministic — the same types of documents always followed the same path.

The agent framework added 4 seconds of latency per document (planning and memory management overhead). It made debugging difficult — when a document was misclassified, the team had to reconstruct the agent’s “reasoning” to understand the failure. Error handling was inconsistent because the framework handled retries differently at each step.

They replaced it with a Celery queue and four worker functions. Processing time dropped from 12 seconds to 6 seconds. Error rates dropped because retry logic was explicit and consistent. Debugging became straightforward — each step logged its input and output, and you could trace a document through the pipeline by reading the logs.

The LLM calls were identical. The prompts were identical. The model was identical. The only thing that changed was the orchestration — from an agent framework to a task queue. Everything got better.

The takeaway

If your control flow is deterministic and only the LLM call is non-deterministic, you do not need an agent. You need a queue with a worker that calls an LLM.

The heuristic: before you adopt an agent framework, draw the control flow. If every arrow points in one direction — if there are no loops, no branches decided by the model, no dynamic tool selection — you are building a pipeline. Build it like one.

tl;dr

The pattern. Teams reach for agent frameworks to orchestrate AI workloads that are actually deterministic pipelines with a single LLM step, then absorb the overhead of memory systems, planners, and tool selection they never needed. The fix. Draw the control flow before picking infrastructure — if every arrow points in one direction and the model isn’t deciding what to do next, use a task queue with worker functions and an LLM call at the right step. The outcome. As one team found when they replaced an agent framework with Celery, processing time drops by half, error handling becomes explicit and consistent, and debugging goes back to reading logs instead of reconstructing an agent’s reasoning.

Your competitor's AI press release is lying

Fri, 13 Sep 2024 00:00:00 +0000

Your competitor just put out a press release. “AI-Powered” is in the headline. The product page has a glowing gradient and the words “intelligent automation.” Your CEO forwarded it to you at 7am with no message — just the link. You know what that means.

By 9am there’s a meeting on your calendar. By noon someone has asked why you don’t have that feature. By Thursday you’re supposed to have a plan to “respond.”

Stop. Take a breath. The press release is almost certainly lying — not maliciously, but structurally. And chasing it is one of the most expensive mistakes you can make.

The anatomy of an AI press release

Here’s what an AI press release actually describes, in our experience, about 80% of the time.

A demo. Someone on the product team built a prototype, the marketing team saw it, and the press release went out before the prototype became a product. The feature exists. It works in controlled conditions, with curated inputs, on a happy path. It does not work at scale. It does not handle edge cases. It is months — sometimes years — from being the thing described in the press release.

We’ve seen this from the inside. A company announces “AI-powered contract analysis.” What they have is a GPT wrapper that extracts three fields from a specific contract template. It works on that template. It fails on everything else. The press release doesn’t mention that. Press releases never mention that.

The other 20% of the time, the feature is real. It works. It’s in production. But even then, the press release overstates its scope, understates its limitations, and implies a level of intelligence that doesn’t exist. This is not deception — it’s marketing. Marketing’s job is to make things sound impressive. Your job is to figure out what’s actually there.

How to decode a competitor’s AI announcement

Don’t panic. Investigate. Here’s a framework.

Look at the product, not the press release. Sign up for a trial. Use the feature. Push it past the happy path. Ask it something weird. Give it messy input. If the feature is real and robust, you’ll know within an hour. If it’s a demo wrapped in a product page, you’ll know even faster.

Check the hiring page. If a company has just shipped a production AI feature, they’re hiring to support it — ML engineers, data engineers, infrastructure people. If their hiring page is unchanged, the feature is probably thinner than the press release suggests. If they’re hiring “AI Product Manager — founding role,” they haven’t built it yet.

Talk to their customers. This is the single most reliable signal. Find someone who actually uses the feature. Ask them how it works in practice. You’ll hear things like “it’s okay for simple cases” or “we still do most of it manually” or “it’s cool but we don’t really rely on it.” The gap between the press release and the customer experience is usually enormous.

Read the fine print. Look for words like “beta,” “preview,” “select customers,” “powered by [third-party API].” Each of these tells you something. Beta means it’s not done. Preview means it might never be done. Select customers means it doesn’t scale. Powered by a third party means they didn’t build it — they integrated someone else’s product and put their logo on it.

Why speed-to-announce is not speed-to-value

There’s a pervasive assumption that the first company to announce an AI feature wins. This is wrong in a way that’s worth understanding, because it drives bad decisions.

Speed-to-announce is a marketing metric. Speed-to-value is a product metric. They are not correlated.

The company that announces first has the worst version. They’ve optimized for the press release, not the product. They’ve shipped the thing that looks good in a demo. They have not solved the hard problems — edge cases, accuracy at scale, monitoring, graceful degradation, cost management, user trust. Those problems take months to solve. You don’t solve them by being first. You solve them by being patient.

The companies that create real competitive advantage with AI are rarely the ones that announce first. They’re the ones that ship a feature that quietly works — that users rely on without thinking about it, that handles the messy cases, that gets better over time because someone built the evals and the feedback loops.

Being second with a thing that works is better than being first with a thing that doesn’t. Every time.

The real danger: the panic build

The actual risk is not that your competitor has something you don’t. The actual risk is that you react to their press release by building the wrong thing in the wrong way.

Here’s how the panic build works. CEO sees press release. CEO asks for a response. Product team scrambles. They pick the feature the competitor announced — not because it’s the highest-value use case for your customers, but because the competitor announced it. They build it fast, skipping the baseline measurement, skipping the evals, skipping the integration planning. They ship a demo in 6 weeks. It kind of works. It’s not great. It doesn’t solve a problem your customers actually have. But it exists, and someone can point to it and say “we have AI too.”

Now you’ve spent 6 weeks of engineering time and political capital on a feature that doesn’t compound. It doesn’t make your product better. It doesn’t make your customers more successful. It just sits there, a monument to competitive anxiety, slowly accumulating tech debt while no one uses it.

We’ve seen this pattern at a dozen companies. The feature sits at 2% adoption for a year, then someone quietly deprecates it. The team that built it has moved on. The cost — in time, in opportunity, in morale — is never recovered.

What to do instead

When a competitor announces an AI feature, do these three things.

Assess what’s real. Use the framework above. Figure out whether the feature is a demo, a beta, or a real product. This takes a few days, not a few months. Don’t build anything until you know what you’re responding to.

Ask what matters for your customers. This is the question that gets skipped in the panic. Your competitor’s AI feature was designed for their customers, their use cases, their data. Your customers might not care about the same thing. Before you respond, talk to five customers. Ask them: “Our competitor just launched this. Is this something you need?” The answer is surprisingly often “no” or “sort of, but what I really need is this other thing.”

Build the thing that compounds for your business. If there is an AI feature worth building, build the one that makes your product uniquely better — not the one that copies your competitor’s marketing. The best AI features are built on proprietary data, proprietary workflows, or proprietary customer relationships. Your competitor can’t copy those any more than you can copy theirs.

The goal is not to match your competitor’s feature list. The goal is to build the AI capability that makes your customers more successful in ways only you can deliver. That’s the feature that compounds. That’s the feature that creates a moat. And it almost never looks like whatever your competitor just announced.

The board conversation

If the board is driving the panic — “why don’t we have what they have” — the answer is straightforward.

“We assessed their announcement. Here’s what it actually is [demo/beta/real but limited]. Here’s what our customers actually need, based on conversations with [specific customers]. Here’s the project we’re building instead — it targets [specific use case], costs [specific amount], and will be in production by [specific date]. It’s a better bet than chasing their press release because [specific reason: proprietary data advantage, higher-value use case, stronger customer pull].”

This is a better answer than “we’ll have our version in 8 weeks.” It shows judgment, not just speed. Boards value judgment.

The uncomfortable truth

Most AI press releases describe the future, not the present. Most “AI-powered” features are thin wrappers that solve a narrow problem and don’t scale. Most competitive advantages in AI come from operational excellence — evals, monitoring, feedback loops, data quality — not from who shipped the chatbot first.

Your competitor’s press release is not a threat. Your panic response to it might be. The companies that win with AI are the ones that ignore the noise and build the thing that matters for their customers. That takes discipline. It also takes the confidence to look at a press release, understand what’s actually there, and say: “That’s nice. Here’s what we’re going to build instead.”

tl;dr

The pattern. A competitor’s AI press release triggers a panic build that copies their marketing instead of solving your customers’ actual problems. The fix. Assess what’s real, ask your customers what they need, and build the AI feature that compounds for your business — not theirs. The outcome. You ship something that creates lasting value instead of a copycat demo that sits at 2% adoption for a year.

Hire the infra engineer before the ML engineer

Fri, 23 Aug 2024 00:00:00 +0000

Your first AI hire should not be someone who trains models. It should be someone who can deploy them, monitor them, and wake up when they break. We have watched this play out at a dozen companies now. The order matters.

The hiring mistake

A company decides to invest in AI. They open a req for a “Senior ML Engineer.” The job description mentions model training, fine-tuning, feature engineering, and maybe some research. They hire someone good — someone with a strong background in machine learning, papers on their resume, experience with PyTorch.

That person arrives on day one and asks reasonable questions. Where is the GPU cluster? How do we deploy models to production? What is the CI/CD pipeline for model artifacts? Where do experiment metrics get logged? What is the monitoring setup?

The answers are: we don’t have one, we haven’t figured that out yet, there isn’t one, nowhere, and there isn’t one.

So the ML engineer — the person you hired to improve models — spends their first six months writing Dockerfiles, setting up a model registry, building a deployment pipeline, configuring monitoring, and arguing with the platform team about Kubernetes resource limits.

This is a waste. Not because the work is unimportant — it is critical. But because you hired someone whose expertise is in modeling and asked them to do infrastructure. They will do it adequately. An infrastructure engineer would do it well.

What the ML engineer actually needs

An ML engineer is productive when the following things exist:

A way to deploy a model. Not “push a Docker image and open a PR to update the Kubernetes manifest.” A pipeline. Code goes in, an API endpoint comes out. Canary deployment. Rollback. The ML engineer should not have to think about load balancers.

A way to monitor a model. Request latency, error rates, input/output distributions, drift detection. Not just application-level monitoring — model-level monitoring. Is the distribution of predictions changing? Are confidence scores dropping? This is specialized infrastructure, but it is infrastructure.

A way to run experiments. A/B testing or shadow mode for new models. The ability to route a percentage of traffic to a new version and compare metrics. Without this, every model change is a yolo deploy.

A way to log and query predictions. Every prediction should be logged with its input, output, latency, and model version. This data is how the ML engineer diagnoses problems and measures improvements. Without it, they are guessing.

A way to manage training data. Versioned datasets, labeling pipelines, data quality checks. The ML engineer should be improving the model with better data and better architectures — not building the data pipeline from scratch.

None of these are ML problems. They are infrastructure problems. They require someone who thinks in terms of systems, pipelines, reliability, and operational excellence. Someone who has built and maintained production services. Someone who knows what a pager feels like.

The right first hire

The right first AI hire is a senior backend or infrastructure engineer who is curious about ML. Not an ML engineer who can tolerate infrastructure. The distinction matters.

This person has built production services before. They know how to set up CI/CD, monitoring, alerting. They can design a data pipeline. They can stand up a model serving layer — whether that is a FastAPI wrapper, a managed service like SageMaker endpoints, or a simple API gateway in front of an LLM provider. They understand operational concerns: what happens at 3am when the model serving pod OOMs? What happens when the upstream data source changes its schema?

They do not need to know how to train models. They need to know how to deploy, monitor, and operate them. They need to be curious enough about ML to understand the domain — to know why drift detection matters, why you cannot just A/B test a model like a button color, why latency percentiles matter more than averages for inference workloads.

This person builds the platform. When the ML engineer arrives — hire number two or three — they walk into a functioning environment. They can focus on what they are actually good at: improving the models. Their first week is running experiments, not writing Terraform.

The compound effect

The order creates a compound effect. When the ML engineer is productive from day one, you get model improvements faster. Those improvements produce results. Results justify more investment. More investment means more hires. The next hires — whether ML engineers, data engineers, or applied scientists — all benefit from the platform that hire number one built.

When you do it in the other order, you get the opposite. The ML engineer spends months on infrastructure. The infrastructure is adequate but fragile — built by someone whose heart is in modeling, not operations. When the second hire arrives, they inherit infrastructure that needs to be rebuilt. The compound effect runs in reverse.

We have seen this pattern at companies ranging from 50-person startups to 500-person mid-market companies. The ones that hired infra first shipped their first AI feature in 2-3 months. The ones that hired ML first shipped in 6-9 months — and then spent another 3 months stabilizing the infrastructure.

The objection

The objection we hear most often is: “But we need someone who understands ML to make architectural decisions. What if the infra engineer builds the wrong thing?”

This is a valid concern with a straightforward answer. The infra engineer does not need to work in a vacuum. You can get ML architecture guidance from a consultant, an advisor, or even a part-time hire. What you cannot easily outsource is the day-to-day work of building and maintaining production infrastructure. That requires someone embedded in the team, full-time, who owns the system.

The other objection: “We want to start with fine-tuning / training a custom model.” If this is genuinely your starting point — not an API-based AI feature but a custom model — then yes, you need ML expertise first. But most companies in 2024 are not training models. They are using APIs. They are building applications on top of foundation models. For this work, the infrastructure is the bottleneck, not the modeling.

The job description

If you are writing the req for your first AI hire, here is what it should look like:

Senior backend/infrastructure engineer with production experience
Has built and operated services at scale (you get to define what “scale” means for your context)
Familiar with ML concepts — does not need to train models but should understand the lifecycle
Comfortable with model serving infrastructure (Ray, TorchServe, Triton, or even just FastAPI)
Has opinions about monitoring and observability
Willing to carry a pager for the AI system

Notice what is absent: no mention of papers, no mention of research, no mention of model architectures. Those matter — for hire number two.

The takeaway

Your first AI hire builds the stage. Your second AI hire performs on it.

The heuristic: if your ML engineer is writing Dockerfiles, you hired in the wrong order.

tl;dr

The pattern. Companies open their first AI req for a Senior ML Engineer, who arrives to find no deployment pipeline, no monitoring, and no experiment infrastructure, and spends six months doing infrastructure work they were never hired to do. The fix. Make your first AI hire a senior backend or infrastructure engineer who understands ML concepts and can build the deployment, monitoring, and experiment platform that the ML engineer will actually need to be productive. The outcome. The ML engineer you hire second walks into a functioning environment, ships model improvements in their first week instead of their sixth month, and the entire AI investment compounds faster because the stage was built before the performer arrived.

Structured outputs don't fix structured thinking

Fri, 09 Aug 2024 00:00:00 +0000

JSON mode and function calling are great. But if the model doesn’t understand what you’re asking it to extract, you just get well-formatted garbage. We see this pattern constantly — teams ship structured outputs and assume the quality problem is solved.

The formatting problem is gone

For most of 2023, a significant chunk of LLM engineering was string parsing. You would ask the model to return JSON. Sometimes it did. Sometimes it wrapped it in markdown code fences. Sometimes it added a preamble. Sometimes the JSON was almost valid — a trailing comma here, a missing quote there.

Teams wrote fragile regex parsers. They retried on parse failures. They added “IMPORTANT: Return ONLY valid JSON” to their prompts in increasing font sizes.

Then structured outputs arrived — JSON mode, function calling, tool use with enforced schemas. The formatting problem vanished overnight. You define a schema, the model fills it in, the output parses every time. This was a genuine infrastructure win.

But it solved the wrong problem.

The thinking problem remains

Here is an extraction task we see regularly. A team wants to pull structured data from contracts — party names, effective dates, termination clauses, governing law. They define a JSON schema. They pass the contract to the model. They get back a perfectly formatted JSON object.

The party names are right 95% of the time. The effective date is right 90% of the time. The termination clause is right 70% of the time. The governing law is right 60% of the time.

Before structured outputs, the termination clause was wrong 30% of the time and the JSON was broken 15% of the time. Now the JSON is never broken and the termination clause is still wrong 30% of the time.

The formatting fix masked the extraction quality problem. The team’s error rate dropped — because parse failures went away — but the semantic accuracy did not change. They shipped it. Users started trusting the output because it looked clean and professional. Well-formatted JSON feels more reliable than a messy text blob, even when the content is identical.

This is the danger. Structured outputs increase trust without increasing accuracy.

Why the model gets it wrong

The model fails on extraction for the same reasons it always has. The information is ambiguous. The document uses domain-specific language the model has not seen enough of. The relevant clause is buried in a 40-page document and the model’s attention gets diluted. The schema asks for a field that requires inference, not extraction — “Is this contract auto-renewing?” is a judgment call, not a lookup.

None of these problems are formatting problems. Putting the answer in a JSON field does not make the model think harder about it. The model produces the same internal representation whether it outputs free text or structured JSON. The structured output layer is downstream of the thinking. It is a serialization step.

Think of it this way: if you ask someone who does not understand contracts to fill in a form about a contract, the form will be neatly filled in and mostly wrong. Giving them a better form does not help. Teaching them about contracts helps.

What actually improves extraction quality

Better prompts. This is unsexy but true. A prompt that explains what a termination clause is, what forms it can take, and what to do when it is ambiguous will outperform a terse prompt with a perfect schema every time. The schema tells the model what shape to produce. The prompt tells it what to think about.

Few-shot examples. Show the model 3-5 examples of inputs and correct outputs. Not synthetic examples — real ones, from your actual corpus, including the tricky cases. Few-shot examples communicate expectations more precisely than instructions. They show the model what “right” looks like in your domain.

Domain-specific validation. A well-structured output can be validated beyond “is this valid JSON.” Is the effective date in the future? Is the governing law a real jurisdiction? Is the extracted dollar amount within a plausible range? These checks catch errors that the model will make regardless of output format.

Decomposition. Instead of asking the model to extract 12 fields from a 40-page document in one pass, break it into steps. First, find the relevant section. Then extract from that section. This reduces the attention problem and gives you a chance to validate intermediate results.

Confidence calibration. Ask the model to rate its confidence on each field. This is not perfectly calibrated — models are notoriously overconfident — but it correlates with accuracy well enough to be useful. Flag low-confidence extractions for human review. This turns the system from “fully automated” to “automated with targeted human oversight,” which is almost always the right design for high-stakes extraction.

The organizational pattern

Here is the pattern we see play out. A team adopts structured outputs. Their parse-error rate drops to zero. They report a quality improvement to leadership. Leadership approves scaling the system to more document types. The team scales it. Accuracy on the new document types is poor — but since the output is always valid JSON, the failures are silent. They surface weeks later when a downstream consumer notices bad data.

The root cause is that the team measured format compliance and called it quality. These are different things. Format compliance is a necessary condition for a usable system. It is not a sufficient condition for a correct one.

The takeaway

Structured outputs are a serialization layer. They guarantee that the model’s answer fits your schema. They do not guarantee that the answer is right.

The heuristic: after you adopt structured outputs, your error rate on formatting should drop to zero. If your overall error rate drops by the same amount, you had a formatting problem. If it doesn’t, you have a thinking problem — and you need to solve it with better prompts, better examples, and better validation.

tl;dr

The pattern. Teams adopt JSON mode or function calling, watch their parse-error rate drop to zero, report a quality improvement to leadership, and scale a system where the outputs are well-formatted but the extractions are still wrong 30% of the time. The fix. Treat structured outputs as the serialization layer they are, then separately improve extraction quality with better prompts, few-shot examples from your actual corpus, domain-specific validation, and decomposition of complex documents into targeted steps. The outcome. You stop conflating format compliance with correctness, silent failures surface before they reach downstream consumers, and the system earns the trust that clean JSON was giving it for free.

The demo is not the product

Fri, 26 Jul 2024 00:00:00 +0000

Getting an LLM to do the thing once in a notebook is the easy part. The hard part is getting it to do the thing reliably, at scale, for every user, on every edge case, at 3am. Most teams confuse the first part for the second.

The notebook moment

Every AI project has a notebook moment. Someone opens a Jupyter notebook, pastes in some data, writes a prompt, hits shift-enter, and the output is shockingly good. The room gets excited. A Slack message goes out: “Look what I got working.” A demo gets scheduled for the end of the week.

The demo goes well. Leadership is impressed. A roadmap appears. Ship date: six weeks.

Here is the problem. That notebook moment — the one that created all the excitement — represents maybe 10% of the work. The other 90% is everything the notebook did not have to deal with.

The 90%

Error handling. The demo showed the happy path. In production, the API will return 429s. The model will occasionally produce unparseable output. The input data will contain characters that break your prompt template. The context window will overflow on long documents. Each of these needs a specific, tested recovery path.

Edge cases. The demo used 5 representative examples. Production will see thousands of variations, including the ones nobody anticipated. The contract written in French. The resume with no work experience section. The support ticket that is actually a love letter. Your system needs to handle all of them — or at least fail gracefully on the ones it cannot.

Latency. The demo ran synchronously and nobody cared that it took 8 seconds. In production, 8 seconds is an eternity. Now you need streaming, caching, prompt optimization, maybe a smaller model for simple cases and a larger one for hard cases. This is an architecture decision that touches every layer of the stack.

Monitoring. In the demo, a human looked at the output and said “that’s good.” In production, nobody is looking. You need automated quality checks, drift detection, cost tracking per request, latency percentiles, error rates by input type. You need alerts. You need dashboards. You need someone who looks at the dashboards.

Eval suites. The demo was evaluated by vibes. Production needs a test suite — a set of inputs with expected outputs that you run on every change. Building this suite is unglamorous work. Maintaining it is worse. But without it, you have no idea whether your next prompt change made things better or worse.

Graceful degradation. What happens when the model is down? What happens when latency spikes to 30 seconds? What happens when your vector store returns no results? The demo did not address any of these because they did not happen during the demo. In production, they will happen on a Tuesday afternoon when half the team is on PTO.

User feedback loops. The demo had no feedback mechanism. In production, you need to know when the system is wrong — and users will not tell you unless you make it trivially easy. Thumbs up/down, explicit corrections, implicit signals from behavior. This data is how you improve. Without it, you are flying blind.

Cost management. The demo made 5 API calls. Production will make 50,000 per day. At $0.01 per call, that is $500/day, $15k/month. Did the business case account for that? What about the calls that retry? What about the calls that hit the large model because the small model was not confident enough? Cost is an ongoing engineering problem, not a line item.

The mid-2024 demo wave

In mid-2024, the AI demo wave crested. Twitter was full of 30-second videos showing remarkable things: agents booking flights, copilots writing legal briefs, chatbots diagnosing medical conditions. Each demo was real. The model really did produce that output, in that context, on that input.

Most of them never shipped. Not because the technology did not work — it did, in the demo. They did not ship because the team that built the demo was not the team that could build the product. Or the team could build the product but the timeline assumed the demo was 80% of the work instead of 10%.

The ones that did ship — the ones that are still running — had something in common. They were built by teams that treated the notebook moment as the starting line, not the halfway point.

How to close the gap

Budget 10x the demo effort for production. If the demo took one engineer two weeks, the production system will take one engineer five months — or three engineers two months. This is not pessimism. This is base rates from every AI project we have seen ship successfully.

Build the eval suite before you build the product. The eval suite defines what “working” means. Without it, you are shipping based on vibes and hoping for the best. Start with 50 test cases. Get to 200 before you launch. Grow it every time you find a failure.

Design for failure from day one. Every LLM call can fail, return garbage, or take too long. Your architecture should assume this. Fallback paths, timeouts, retry logic, human-in-the-loop escalation — these are not nice-to-haves. They are table stakes.

Staff it like a production system. AI features need on-call rotations, incident response, and operational runbooks just like any other production system. The model is a dependency. It will break. Someone needs to wake up.

Separate the research from the engineering. The person who built the demo in a notebook is probably great at prompt engineering and model selection. They may not be the right person to build the production deployment pipeline. These are different skills. Both are necessary.

The heuristic

When someone shows you a demo that works, ask one question: “What happens when this is wrong?” If the answer is “it won’t be,” you are looking at a demo. If the answer is a specific, boring description of error handling, fallbacks, and monitoring — you are looking at a product.

tl;dr

The pattern. Teams mistake the notebook moment — getting an LLM to do the thing once on a clean example — for most of the work, then schedule a six-week ship date for what is actually a five-month engineering project. The fix. Budget 10x the demo effort for production, build the eval suite before you build the product, and staff the feature with on-call rotation and incident response from day one. The outcome. Features that ship are actually reliable: they handle the French contract, the malformed resume, and the 3am API timeout — not just the five representative examples that looked great in the demo.

Your board wants an AI strategy by Thursday

Thu, 11 Jul 2024 00:00:00 +0000

You just got the calendar invite. Thursday, 2pm. “AI Strategy Discussion.” The board wants to know what you’re doing about AI. You have 72 hours.

This happens constantly now. A board member read something, attended a dinner, talked to a portfolio company that’s “doing amazing things with AI.” Now they want to know your plan. The implied question: are we falling behind?

Most teams respond in one of two ways. They grab a vendor’s pitch deck and present it as strategy. Or they spend 60 hours building a 40-slide fantasy about an AI-powered future that would take three years and $10M to build. Both are wrong. The vendor deck is someone else’s strategy. The fantasy deck is a wish list, not a plan.

The right answer fits on one page. Here’s how to write it.

Start with the honest self-assessment

Before you write anything, answer one question honestly: where are you today?

Most companies are at zero. No models in production. No eval framework. No one on the team who has shipped an AI system. This is fine. Most companies are at zero. Pretending otherwise — claiming you’re “experimenting with AI” because someone ran a ChatGPT demo last quarter — wastes the board’s time and yours.

Here’s a simple maturity framework. Not for the slides — for your own clarity.

Level 0: Awareness. You know AI exists. Your team uses ChatGPT for personal productivity. There are no AI systems in your product or operations. This is where most companies are, and saying so is not embarrassing. It’s honest.

Level 1: Experimentation. You’ve built a prototype. Maybe a RAG system over your docs, maybe a classification model for support tickets. It works in demos. It’s not in production. You’ve learned something, but you haven’t shipped anything.

Level 2: Production. You have an AI system running in production, handling real traffic, with monitoring and a human in the loop. You know what it costs and how it performs. You’ve learned what breaks.

Level 3: Operational. You have multiple AI systems in production. You have an eval framework. You have a team that knows how to build, deploy, and maintain these systems. You’re making decisions about what to build next based on data from what you’ve already built.

Be honest about where you are. If you’re at Level 0, say so. The board would rather hear “we’re at zero, and here’s our plan to get to one” than “we’re exploring synergies across our AI-powered innovation pipeline.” One of those is a starting point. The other is noise.

Pick the first bet

The board does not need a comprehensive AI strategy. They need to know what you’re going to do first, why, and how you’ll know if it worked.

Pick one use case. One. Not three “strategic pillars.” Not a “phased roadmap” with 12 initiatives. One thing.

The criteria for picking it are simple. It should be internal — not customer-facing — so the blast radius of failure is small. It should be measurable — you can quantify the current cost or performance. It should be achievable — a small team can build a working prototype in 6 to 10 weeks. And it should teach you something — the process of building it should reveal whether your data is ready, whether your team can execute, and whether AI actually works for your problem domain.

Good first bets: automating a manual data-processing step, classifying or routing incoming requests, extracting structured data from unstructured documents, summarizing long-form content for internal review.

Bad first bets: a customer-facing chatbot, an AI-powered product feature, anything that requires real-time performance or has regulatory implications. Those are fine projects. They’re terrible first projects.

The board will push back. They’ll say “that sounds small.” That’s the point. Small is how you learn without betting the company. Small is how you build the muscle to do the big thing later. Tell them: “This is the project that teaches us whether we can do the bigger ones.”

Size the investment

The board thinks in dollars. Give them dollars.

A typical first AI project — one engineer or a small team, 8 to 10 weeks, cloud compute, API costs — runs $30K to $80K all-in. That’s the pilot. Be specific: “Two engineers for 8 weeks, $15K in API and compute costs, total investment approximately $55K.”

Then tell them what happens after the pilot. If it works — meaning it hits the success criteria you defined — the production build is another $80K to $150K over 8 to 12 weeks. If it doesn’t work, you’ve spent $55K and learned something concrete about your AI readiness. That’s a bounded bet.

Compare this to the alternative: hiring a consulting firm to write an AI strategy. That costs $200K to $500K and produces a document. You can produce a working prototype for less than the cost of the strategy deck, and the prototype teaches you more than any deck ever will.

The three things the board actually cares about

Strip away the jargon, and the board has three questions. Answer these and you’re done.

Risk. “What’s the worst case?” The worst case is you spend $55K and the project doesn’t work. You’ll know within 8 weeks. There’s no existential risk. The technology risk is low — you’re using commodity models and standard infrastructure. The real risk is organizational: can your team learn to build and operate AI systems? That’s exactly what the pilot answers.

Timeline. “When will we see results?” Pilot results in 8 to 10 weeks. Go/no-go decision at the end of the pilot. If it’s a go, production deployment in 12 to 16 weeks after that. First measurable impact within 6 months of starting. Don’t promise faster. Don’t promise slower. Be specific.

Competitive position. “Are we falling behind?” This is the question that triggered the meeting. Answer it honestly. Your competitors who have announced AI features are mostly in one of two states: they shipped something small and real (good for them, you can catch up), or they shipped a press release (ignore it). The competitive advantage in AI does not come from who announces first. It comes from who builds the operational muscle to ship, measure, and iterate. That muscle takes time. Starting now — with a small, real project — is how you build it.

The one-page memo

Here’s the format. One page, four sections.

Where we are. Two sentences about your current AI maturity. Be honest.

What we’ll build first. A description of the pilot — what it does, why this use case, what the success criteria are.

What it costs. Pilot investment, timeline, what happens if it works, what happens if it doesn’t.

How we’ll know it worked. Specific metrics. “Process time drops from 8 minutes to 2 minutes per unit.” “Classification accuracy exceeds 85% on a held-out test set.” “Cost per processed document drops from $3.20 to $0.80.”

That’s it. No technology deep dives. No vendor comparisons. No slides about the history of machine learning. The board doesn’t need to understand how AI works. They need to understand what you’re going to do, what it costs, and how you’ll measure success.

After the meeting

The board will either say yes, say no, or ask clarifying questions. If they say yes, start the pilot on Monday. Don’t let it become a planning exercise. The plan is simple — build the thing, measure it, decide.

If they ask questions, the most common ones are: “Why aren’t we doing something bigger?” (because we need to learn to walk before we run), “What is [competitor] doing?” (we’ll address that in the next section of the memo), and “Do we need to hire an AI team?” (not yet — the pilot will tell us what skills we’re missing).

If they say no — which is rare once you’ve framed it as a $55K bounded bet — ask what would change their mind. Usually the answer reveals a concern you can address.

The point of the memo is not to be comprehensive. It’s to be credible. A one-page memo that says “here’s what we’ll do, here’s what it costs, here’s how we’ll know” is more credible than a 40-slide deck that says “AI will transform everything.” The board has seen enough fantasy decks. Give them a plan.

tl;dr

The pattern. Teams respond to board AI pressure with either a vendor pitch deck or a 40-slide fantasy, neither of which is an actual strategy. The fix. Write a one-page memo with an honest self-assessment, one specific pilot, a bounded investment, and measurable success criteria. The outcome. The board approves a small bet that teaches you more about your AI readiness than any strategy document ever could.

Benchmarks are vanity metrics

Fri, 28 Jun 2024 00:00:00 +0000

Every model release in mid-2024 came with a table. Rows of benchmark names — MMLU, HellaSwag, HumanEval, GSM8K, ARC-Challenge. Columns of numbers. The new model’s numbers were higher than the previous model’s numbers, or higher than the competitor’s numbers, or — if neither of those was true — higher on a carefully selected subset of benchmarks.

The tables were impressive. They were also almost entirely useless for making production decisions.

What benchmarks measure

Public benchmarks measure general capability on standardized tasks. MMLU tests broad knowledge across academic subjects. HumanEval tests code generation on isolated programming problems. HellaSwag tests commonsense reasoning in sentence completion. GSM8K tests grade-school math.

These are real capabilities. They correlate, loosely, with general model quality. A model that scores poorly on all of them is probably not a good model. A model that scores well on all of them is probably a decent model.

But “probably a decent model” is not a production decision. A production decision is: which model, at which price point, at which latency, performs best on my specific task?

And for that question, public benchmarks tell you almost nothing.

The gap between general and specific

Here is a thing we have measured directly, across multiple client engagements: two models with a 2-point difference on MMLU can have a 20-point difference on a task-specific eval.

This is not an exaggeration. It’s not an edge case. It’s the norm.

A model that is 3% better at broad academic knowledge can be 30% worse at extracting line items from invoices. A model that scores higher on code generation benchmarks can be worse at generating code in your specific framework, with your specific conventions, against your specific APIs.

The reason is straightforward. Public benchmarks are averages across broad categories. Your use case is a specific point in a vast capability space. The average tells you very little about the specific point.

Consider what it would mean in other domains. You wouldn’t pick a database by looking at TPC-C benchmarks alone. You’d run your workload on the candidates and measure. You wouldn’t pick a frontend framework by looking at synthetic render benchmarks. You’d prototype with your actual components and measure. Model selection should work the same way.

The benchmark culture

Mid-2024 had a distinctive culture around model releases. A new model would drop. Twitter would erupt with benchmark comparisons. Hot takes would fly about which model was “better.” Teams would start migration discussions based on the benchmark table in the announcement blog post.

This is backwards. The benchmark table is marketing material. It is not evaluation. The model provider chose which benchmarks to highlight. They tuned for those benchmarks. They cherry-picked the comparison points. This is not nefarious — it’s what every company does with every product launch. But treating marketing material as engineering data is a mistake.

The more subtle problem: benchmark numbers create a false sense of precision. “Model A scores 87.3 on MMLU, Model B scores 85.1.” That 2.2-point difference feels meaningful. It is not meaningful for your production use case. The confidence interval on “how well will this model perform on my specific task” is vastly wider than 2.2 points.

Why your task is special

Every team thinks their use case is standard. “We’re just doing summarization.” “It’s basic classification.” “We’re just extracting entities.”

Your use case is not standard. Your summarization has specific length requirements, a specific tone, specific things that must be included, and specific things that must not. Your classification has categories that overlap in domain-specific ways. Your entity extraction deals with formats and edge cases that no benchmark covers.

The gap between the generic task and your specific implementation of that task is where model performance varies wildly. And it’s the gap that public benchmarks don’t measure, because they can’t — they’re generic by definition.

How to build a task-specific eval

You don’t need hundreds of examples. You need 50 good ones.

Start with production data. Pull real inputs from your system — or realistic synthetic ones if you’re pre-launch. Don’t invent examples from scratch. Real data has real messiness, and that messiness is where models differ.

Label the outputs yourself. Have a domain expert — someone who understands what good looks like — review model outputs and rate them. A simple 1-5 scale works. “Would you be comfortable showing this to a user?” works even better.

Cover the edges. Don’t just test the happy path. Include the inputs that are ambiguous, malformed, adversarial, or just weird. These are where models diverge most.

Automate the run. Write a script that sends your 50 examples to each candidate model, collects the outputs, and presents them for review. This takes a few hours the first time. After that, it takes minutes.

Track over time. Every time a new model drops, run your eval. The number you care about is not the public benchmark — it’s your benchmark. “Model A scores 4.2/5 on our task, Model B scores 3.8/5 on our task.” That’s a production decision.

The meta-benchmark problem

There’s a subtler issue with public benchmarks: they become targets. Once a benchmark is widely used, model providers optimize for it. Not through outright data contamination — though that happens — but through training emphasis. If MMLU is the benchmark everyone watches, you allocate more training compute to the kinds of knowledge MMLU tests.

This is Goodhart’s Law applied to AI: when a measure becomes a target, it ceases to be a good measure. The benchmark scores go up, but the improvement doesn’t transfer uniformly to all tasks. It transfers most to tasks similar to the benchmark and least to tasks that are different.

Your production task is, almost certainly, different from any public benchmark. Which means the benchmark improvement you see in the release blog post overstates the improvement you’ll see in practice.

When benchmarks are useful

Benchmarks are not useless. They’re useful for two things.

Filtering. If a model scores poorly across all major benchmarks, you can probably skip it. Benchmarks are a reasonable lower bound on capability. They’re just not a useful upper bound on task-specific performance.

Tracking trends. Watching benchmark scores over time — across model families and providers — tells you how fast the field is moving and which capabilities are improving fastest. This is useful for strategic planning. It is not useful for model selection.

For everything else — for the actual decision of which model to deploy in production — you need your own eval.

The heuristic

Never make a model selection decision based on public benchmarks alone. Build a task-specific eval with 50 examples from your domain. Run every candidate model against it. Use the results — not the benchmark table — to decide.

If you don’t have time to build a 50-example eval, build a 10-example eval. If you don’t have time for 10, something is wrong with your priorities. You’re about to put a model in front of users and you can’t spend an afternoon checking whether it’s good at the thing you’re using it for.

The model that wins on MMLU might lose on your task. The model that loses on HumanEval might be the best at your specific code generation problem. You will not know until you measure. Measure.

tl;dr

The pattern. Teams use public benchmark scores — MMLU, HumanEval, GSM8K — to make model selection decisions, not realizing that a 2-point gap on a generic benchmark can flip to a 20-point gap in the other direction on their specific task. The fix. Build a 50-example task-specific eval using real inputs from your domain, have a domain expert label the outputs, and run every candidate model against it before making any production decision. The outcome. Model selection becomes a measurement exercise rather than a marketing exercise, and you stop discovering mid-migration that the “better” model is actually worse for the thing you’re using it for.

Your model is not your moat

Fri, 14 Jun 2024 00:00:00 +0000

In the first half of 2024, we watched teams agonize over model selection. Weeks of evaluation. Benchmark comparisons. Internal bake-offs. Spreadsheets with weighted scoring rubrics. The decision felt momentous — like choosing a database or a cloud provider. A decision you’d live with for years.

It wasn’t. Most of those teams switched models within 6 months. Some switched twice. The models got cheaper, or faster, or a new one came out that was better for their specific use case. The decision that felt permanent was temporary.

The thing they actually lived with — the thing that was hard to change — was everything around the model.

The commodity layer

Models are commodities. Not yet in the economic sense — pricing varies, capabilities differ, there are real tradeoffs. But in the architectural sense. They are interchangeable components with a standard interface: text in, text out. Some are better at reasoning. Some are faster. Some are cheaper. The rankings shift every quarter.

If your architecture is clean, swapping models is a configuration change. If your architecture isn’t clean — if you’ve hardcoded model-specific prompt patterns, relied on undocumented behaviors, or built your system around a specific model’s quirks — swapping models is a rewrite.

The teams that treated model selection as the primary technical decision ended up coupling themselves to that decision. The teams that treated the model as a replaceable component ended up with systems that could adapt when the market shifted.

The layers that compound

The model doesn’t compound. It depreciates. Today’s best model is next quarter’s second-best model. But the infrastructure around it — done well — compounds.

The data pipeline. How do you get data into the system? How do you clean it, chunk it, embed it, index it? How do you handle updates? How do you deal with deletions? This is plumbing. It is unglamorous. It is the difference between a system that works on demo data and a system that works on production data. And it takes months to get right — not because it’s technically hard, but because production data is messy in ways you don’t discover until you’re in production.

The eval suite. How do you know if your system is working? Golden sets, regression tests, model-as-judge evaluations, A/B tests. Every test you write is an asset. Every eval you run generates signal. Over time, your eval suite becomes your institutional knowledge about what “good” means for your system. It’s the thing that lets you change the model, the prompt, the retrieval, the post-processing — and know whether the change helped or hurt.

The deployment infrastructure. How do you serve the model? How do you handle failures, retries, timeouts, rate limits? How do you do canary deployments? How do you roll back? This is standard infrastructure engineering applied to a new component. The teams that already had mature deployment practices adapted quickly. The teams that didn’t — the ones treating the AI feature as a special snowflake — built fragile systems that were painful to operate.

The feedback loop. How do you learn from production? How do you capture user behavior, error patterns, edge cases? How do you turn that into improvements? The feedback loop is the meta-system — the thing that makes everything else get better over time. Without it, you’re flying blind. With it, every day in production makes your system a little better.

The wrong optimization

The teams that optimized for model quality spent their time on prompt engineering, model evaluation, and benchmark comparison. These are useful activities. But they have diminishing returns and no compounding effect. The perfect prompt for GPT-4 is useless when you switch to Claude. The benchmark comparison is stale in a month.

The teams that optimized for operational quality spent their time on pipelines, evals, deployment, and feedback. These are boring activities. But they compound. The eval suite you build for GPT-4 works for Claude. The deployment infrastructure you build for one model serves the next. The feedback loop you establish gets richer every week.

There’s a useful analogy to web development in the 2000s. Early on, teams agonized over which web framework to use — the choice felt permanent. Over time, the framework became a replaceable component. What mattered — what compounded — was the deployment pipeline, the test suite, the monitoring, the team’s operational muscle. The teams that invested in those things could switch frameworks without rewriting their business logic.

We’re at the same inflection point with AI. The model is the framework. It matters, but it’s not the moat.

How to tell where you are

Here’s a quick diagnostic. Answer these questions about your AI system:

Can you swap models in under a day? If not, you’re coupled to your model. Decouple.

Can you tell, within an hour of deploying a change, whether the system got better or worse? If not, you don’t have evals. Build them.

Can you roll back a bad deployment in under 5 minutes? If not, you don’t have deployment infrastructure. Standard stuff — build it.

Can you point to a specific improvement that came from production feedback in the last month? If not, you don’t have a feedback loop. Start one.

If you answered “no” to more than one of these, you’re investing in the wrong layer. You’re polishing the model while the operational foundation rusts.

The heuristic

Spend 20% of your time on model selection and prompt engineering. Spend 80% on everything else — the pipeline, the evals, the deployment, the feedback loop. The model is what you ship today. The infrastructure is what lets you ship better tomorrow.

When you catch yourself debating which model to use, ask: does it matter? If the operational infrastructure is solid, you can try both and measure. If the infrastructure isn’t solid, it doesn’t matter which model you pick — you won’t be able to tell if it’s working anyway.

tl;dr

The pattern. Teams spend weeks on model selection and prompt engineering — work that depreciates every quarter — while neglecting the data pipeline, eval suite, deployment infrastructure, and feedback loop that compound in value over time. The fix. Spend 20% of your effort on model selection and 80% on the operational layer: decouple your architecture so a model swap is a config change, build evals that survive model migrations, and establish a feedback loop that learns from every day in production. The outcome. When a better or cheaper model arrives — and it will — you can evaluate and switch in a day instead of a rewrite, and the rest of your infrastructure keeps getting better regardless of which model sits inside it.

You don't need a chief AI officer

Thu, 30 May 2024 00:00:00 +0000

In early 2024, the CAIO became the hot hire. Chief AI Officer. The title started appearing on LinkedIn, in board decks, in recruiter DMs. The logic seemed sound: AI is important, it touches everything, it needs senior leadership. Give it a C-suite seat.

We’ve seen how this plays out across a dozen organizations. The short version: the title usually creates more confusion than it resolves.

The problem the title is solving

The CAIO hire is a response to a real problem. AI initiatives are scattered across the org. Product has a chatbot project. Engineering is building a RAG pipeline. Data science is fine-tuning a model. The CEO keeps asking “what’s our AI strategy” and nobody can give a coherent answer.

So the org does what orgs do when they have a coordination problem: they hire a coordinator. They give them a title and a mandate and hope that the title does the coordination work.

It doesn’t. Because the problem isn’t that nobody is in charge of AI. The problem is that the org hasn’t decided what AI is — is it a product, a capability, or an infrastructure layer? That decision determines where AI lives in the org chart, and the answer is different for every company.

The three flavors

Every AI initiative falls into one of three buckets, and each bucket has a natural home.

AI as product. You’re building AI-powered features that your customers use directly. A summarization tool. A search experience. A conversational interface. This is a product problem. It needs product management, design, user research. It reports to product or to a GM. It does not need a CAIO — it needs a product leader who understands AI well enough to make scope decisions.

AI as infrastructure. You’re building the platforms and pipelines that enable AI across the org. Embedding infrastructure, model serving, eval frameworks, feature stores. This is an engineering problem. It reports to the VP of Engineering or the Head of Platform. It does not need a CAIO — it needs a strong infra team with an AI charter.

AI as strategy. You’re making bets about how AI changes your market, your competitive position, your cost structure. Which products to build. Which to kill. Where to invest. This is a CEO problem. It reports to the CEO. It might need a senior advisor — but advisor and officer are different things.

Most orgs have a mix of all three. The question is which one is primary. If you’re a product company adding AI features, AI-as-product is primary. If you’re a platform company enabling AI for your customers, AI-as-infrastructure is primary. If you’re a legacy company trying to figure out whether AI changes your business model, AI-as-strategy is primary.

The CAIO title papers over this question. It says “AI is its own thing” when the more useful answer is “AI is a product thing” or “AI is an infra thing” or “AI is a strategy thing.”

What we’ve seen in practice

At Series B-D companies — the sweet spot where AI is a real investment but the org is still small enough to see the whole picture — the CAIO hire tends to play out in one of three ways.

The orphan. The CAIO reports to the CEO but doesn’t own a team. They produce strategy documents and recommendations. They attend leadership meetings. But the actual AI work happens in product and engineering teams that report to other leaders. The CAIO has influence but not authority. They become a consultant inside their own company — and an expensive one.

The empire builder. The CAIO gets a team and a budget. They start pulling AI work out of product and engineering and into their own org. Now you have a central AI team that builds features for product teams — a services model. This works for approximately six months, until the product teams start complaining about priorities, the AI team becomes a bottleneck, and the roadmap is a negotiation exercise.

The shadow CTO. The CAIO starts making technical decisions that overlap with the CTO’s domain. Model selection, infrastructure choices, vendor relationships, hiring. Scope creep is inevitable because AI touches everything. Now you have two technical leaders with overlapping mandates and neither of them is happy about it.

None of these are inevitable. But they’re common enough that we flag them as risks whenever a client mentions the CAIO title.

What actually works

The pattern we’ve seen work — consistently, across different company sizes and stages — is simpler than a C-suite hire.

A senior IC or director with a clear charter. Not a VP. Not a C-level. Someone who is deep enough technically to make architecture decisions and senior enough organizationally to have access to leadership. They own the AI platform — the shared infrastructure that multiple teams use. They don’t own AI features. Features belong to product teams.

A dotted line to the CEO or CTO. The AI lead has a regular cadence with leadership — biweekly or monthly — to report on what’s working, what’s not, and what decisions need to be made. They don’t need a C-suite title to have this access. They need a standing meeting and an expectation of candor.

An AI council, not an AI org. A lightweight coordination body that meets regularly. Representatives from product, engineering, data, and the AI lead. They share learnings, align on infrastructure investments, and identify duplication. This is governance without empire-building.

Embedded AI engineers. Instead of a central AI team that builds for everyone, put AI-skilled engineers on product teams. They report to the product team’s engineering manager. They use the shared AI platform. They’re close to the problem and close to the user. The AI lead hires them, mentors them, and sets technical standards — but doesn’t own their roadmap.

This model isn’t glamorous. It doesn’t produce a press release about your bold new hire. But it ships AI features faster, with less organizational friction, than a CAIO typically does.

The hiring signal

When a company tells us they’re hiring a CAIO, we ask three questions:

What will this person own that nobody currently owns? If the answer is “AI strategy,” we dig deeper. Strategy is an output, not a job. Who will execute the strategy? If it’s existing teams, you don’t need a CAIO — you need a strategy engagement and an execution plan.

What decisions will this person make that nobody currently makes? If the answer is model selection, vendor management, and infrastructure investment — those are CTO decisions. If the answer is AI product roadmap — those are product leadership decisions. If the answer is “all of the above,” you’re describing a second CTO with a different title.

What does success look like in 12 months? If the answer is vague — “we have a clear AI strategy,” “we’ve shipped AI features,” “we’re an AI-first company” — you don’t have a role. You have a wish.

The heuristic

Before you hire a CAIO, answer this question: is your primary AI problem strategy, execution, or coordination?

If it’s strategy, you need a senior advisor and a few weeks of focused work — not a full-time hire. If it’s execution, you need AI engineers on your product teams and a platform to support them. If it’s coordination, you need an AI lead with a dotted line and a standing meeting.

The title you give them matters less than the mandate. And the mandate matters less than the org structure that supports it. Fix the org first. The title will be obvious.

tl;dr

The pattern. Companies hire a Chief AI Officer to resolve an AI coordination problem, but the title papers over the real question — whether AI is a product, infrastructure, or strategy problem — and typically produces an orphan, an empire builder, or a shadow CTO. The fix. Decide first whether your primary AI problem is strategy, execution, or coordination, then staff accordingly: a senior advisor for strategy, embedded AI engineers for execution, and an AI lead with a standing meeting and a dotted line for coordination. The outcome. AI features ship faster with less organizational friction, and nobody is spending a C-suite salary producing decks that product and engineering teams quietly ignore.

The AI business case your CFO will actually approve

Wed, 15 May 2024 00:00:00 +0000

Most AI business cases die in the CFO’s inbox. Not because the CFO doesn’t believe in AI. Because the business case reads like a technology pitch instead of a financial argument.

We see this constantly. A team spends weeks building a prototype, gets excited, writes up a proposal titled something like “AI-Powered Document Processing Platform.” The deck has architecture diagrams, model comparisons, a slide about “the future of work.” It lands on the CFO’s desk and gets the same response every time: “What does this save us, and how do you know?”

The team doesn’t have a good answer. The project stalls.

This is fixable. But the fix starts before you write the business case — it starts with picking the right problem.

The mistake everyone makes

The most common opener in an AI business case is some version of “AI will transform our operations.” This is the wrong frame. Not because it’s untrue — it might be true eventually — but because it’s unfundable. “Transform” is not a line item. You can’t model it. You can’t measure it. You can’t hold anyone accountable for it.

CFOs fund things that reduce cost, increase revenue, or reduce risk — on a specific process, with a measurable baseline, on a known timeline. That’s it. Everything else is a research project, and research projects get cut in the next budget cycle.

The teams that get AI projects funded don’t talk about transformation. They talk about a process that costs $X today and will cost $Y after the project ships. The gap between X and Y is the business case. Everything else is decoration.

Pick the right first use case

Not every process is a good candidate for your first AI project. The best first use case has four properties.

High volume. You want a process that happens hundreds or thousands of times a month. High volume means the savings multiply. It also means you have data — you can measure the current state, and you have enough examples to train or test the system. A process that happens three times a quarter is a terrible first AI use case, no matter how painful it is.

Measurable. You need to be able to measure the current cost — in time, in labor, in error rate, in dollars. If you can’t measure it today, you can’t prove you improved it tomorrow. This sounds obvious. It eliminates about half the use cases teams pitch.

Low risk. Your first AI project is a proof point. If it fails, it should fail quietly. Don’t pick the use case where a wrong answer triggers a regulatory violation or loses a customer. Pick the one where a wrong answer means a human reviews it and fixes it. You want the stakes to be low enough that you can ship something imperfect and iterate.

Boring. The best first AI use case is boring. It’s data entry. It’s document classification. It’s extracting fields from invoices. It’s routing support tickets. These are not impressive demos. They are impressive P&L impacts. The boring use case is the one the CFO funds because the math is obvious and the downside is manageable.

If your first proposed AI project is a customer-facing chatbot — stop. Chatbots are high-risk, hard to measure, and the failure mode is public embarrassment. Save that for project three.

Establish the baseline before you build

This is the step most teams skip, and it’s the step that kills the business case.

Before you write a line of code, measure the current process. How long does it take? How many people touch it? What’s the error rate? What does it cost per unit? You need these numbers, and you need them to be defensible — not estimates from a brainstorming session, but actual measurements from actual work.

Here’s a simple approach. Pick one week. Track the process in detail. Count the inputs, count the outputs, measure the time, log the errors. Do the math. “This process handles 2,400 invoices per month. Each invoice takes an average of 8 minutes of human review. That’s 320 hours per month. At a fully loaded cost of $45/hour, that’s $14,400 per month — $172,800 per year.”

Now you have a number. The CFO understands numbers. The question becomes: “Can we reduce 320 hours to 80 hours?” That’s a question worth answering.

Without the baseline, your business case is “AI will make this faster.” With the baseline, your business case is “We spend $172,800 per year on this process and we can reduce it to $43,200 with 80% automation at a measured accuracy of 94%.” One of those gets funded.

Frame the pilot as a bet with a known downside

CFOs are not allergic to risk. They are allergic to unbounded risk. “We need $2M to build an AI platform” is unbounded. “We need $40K over 8 weeks to test whether we can automate invoice classification — if it doesn’t work, we stop” is a bet with a known downside.

Frame your pilot this way:

The investment. A specific dollar amount — engineering time, API costs, maybe a contractor. Keep it small. The goal is not to build the production system. The goal is to prove the hypothesis.

The hypothesis. One sentence. “We believe we can classify 80% of incoming invoices correctly with no human review.” Not “AI will transform our accounts payable.” A testable claim.

The timeline. 6 to 10 weeks. Long enough to build something real. Short enough that the CFO doesn’t worry about scope creep. If you can’t show results in 10 weeks, you picked the wrong use case.

The kill criteria. This is the part most teams leave out, and it’s the part the CFO cares about most. What would cause you to stop? “If accuracy is below 75% after 6 weeks of iteration, we stop and reallocate the team.” This shows you’re being honest, not selling.

The graduation path. If the pilot works, what happens next? Not a vague “we’ll scale it” — a specific plan. “If accuracy exceeds 85%, we’ll spend $120K over the next quarter to build the production integration with our ERP system. Expected annual savings: $129,600. Payback period: 11 months.”

That’s a business case. Cost, hypothesis, timeline, downside, upside. The CFO can model it, question it, and approve it — all in the same meeting.

What the CFO actually needs to see

The CFO does not need a slide about large language models. They do not need a competitive landscape of AI vendors. They do not need a paragraph about how GPT-4 works.

They need four things.

A P&L impact on a specific line item. “Accounts payable processing costs $172,800/year. This project reduces it to $43,200/year. Net savings: $129,600/year.” Not “AI will create efficiencies across the organization.” A line item.

A credible cost estimate. “The pilot costs $40K. The production build costs $120K. Ongoing costs are $1,200/month in API fees and $8K/month in partial FTE for monitoring.” If you can’t estimate the ongoing cost, you haven’t thought about it enough.

A risk assessment they can believe. Don’t say “no risk.” Say “the downside is $40K and 8 weeks of one engineer’s time. We’ve defined kill criteria. If the accuracy isn’t there, we stop.” Honesty about risk builds more trust than optimism about outcomes.

A timeline with milestones. “Week 4: baseline model tested against 200 historical invoices. Week 8: pilot running on live invoices with human review. Week 10: go/no-go decision.” Not “Q3: AI implementation.” Milestones.

The meta-lesson

The business case is not about AI. It’s about the process. The CFO doesn’t care whether you use a language model, a rules engine, or an army of trained pigeons. They care about the cost of the process today, the cost of the process after, and whether you’ve thought carefully about the path from here to there.

If your business case works just as well with the word “AI” removed, you’ve written a good business case. If removing “AI” makes it fall apart — if the entire argument rests on the novelty of the technology rather than the economics of the process — you don’t have a business case. You have a technology pitch. And technology pitches don’t survive the CFO’s inbox.

The teams that get AI projects funded are not the ones with the best demos. They’re the ones who did the homework — measured the baseline, sized the bet, defined the downside, and showed the math. The math is what gets approved. Everything else is a slide deck.

tl;dr

The pattern. Teams pitch AI projects with transformation narratives instead of financial arguments, and CFOs kill them for lacking measurable impact. The fix. Pick a high-volume, measurable, low-risk process, establish its cost baseline, and frame the pilot as a small bet with defined kill criteria and a specific P&L impact. The outcome. The project gets funded because the math is obvious, the downside is bounded, and the CFO can model the payback period in a single meeting.

Your AI strategy is a deck, not a system

Sun, 28 Apr 2024 00:00:00 +0000

We reviewed a lot of AI strategies in early 2024. Decks, mostly. 30-60 slides. A few had Notion docs. One had a Miro board the size of a city block.

They all looked roughly the same. A list of use cases. A vendor comparison matrix. A timeline with phases — “quick wins” in Q1, “medium-term” in Q2-Q3, “transformative” in Q4. Maybe a slide about responsible AI. Maybe a slide about data governance. Definitely a slide about ROI projections that nobody believed.

These were not strategies. These were wishlists with a Gantt chart.

What a strategy is not

A strategy is not a list of things you could do with AI. Every team can generate that list. The list is infinite. You could summarize documents, classify tickets, generate emails, power search, automate onboarding, build a chatbot, score leads, detect anomalies, predict churn. The list writes itself — and that’s the problem.

A list of use cases is the starting point, not the strategy. The strategy is what you cut from the list, and why.

A strategy is not a vendor comparison. “We evaluated OpenAI, Anthropic, Google, and Cohere and selected X” is a procurement decision, not a strategy. The model is a component. It will change. Your strategy should survive a model swap.

A strategy is not a timeline. “We’ll do RAG in Q1, agents in Q2, fine-tuning in Q3” is a project plan. Project plans are useful. They are not strategies. A strategy tells you what to do when the plan falls apart — and it will fall apart, because this is AI and the ground is moving under your feet.

What a strategy actually is

A strategy is a set of bets with kill criteria.

A bet has three parts. What you’re building. How you’ll know if it works. When you’ll kill it if it doesn’t.

That’s it. Most of the decks we reviewed had the first part — what you’re building. Almost none had the second — how you’ll know if it works. And we never saw the third — when you’ll kill it.

The absence of kill criteria is the tell. It means the organization has not confronted the possibility that any of these initiatives might fail. And in AI, the failure rate is high. Not because the technology doesn’t work — it often does — but because the integration is hard, the data is messy, the use case doesn’t generate the value you expected, or the users don’t adopt it.

The three questions

Every AI initiative should answer three questions before it starts:

What is the success metric? Not “we’ll improve efficiency.” A number. “We’ll reduce average ticket resolution time from 14 minutes to 9 minutes.” Or “we’ll increase the percentage of customer queries resolved without human escalation from 30% to 50%.” If you can’t name a number, you’re not ready to build.

What is the measurement timeline? When will you check? Not “at the end of the project.” A date. “We’ll measure after 4 weeks of production traffic.” This forces you to define what “production traffic” means, which forces you to define what “launched” means, which forces a dozen useful conversations you’d otherwise skip.

What is the kill criteria? Under what conditions do you stop? “If we haven’t hit 40% of target improvement after 6 weeks, we stop and reallocate the team.” This is the one that hurts. This is the one that separates a strategy from a wishlist. A wishlist never dies. A strategy has conditions under which you walk away.

Why kill criteria matter

Without kill criteria, AI projects become zombies. They don’t die. They don’t succeed. They linger. The team keeps working on them because nobody explicitly decided to stop. The feature is in production but nobody’s using it. The model is running but the results aren’t good enough to trust. The dashboard exists but nobody looks at it.

We saw this pattern repeatedly. A team ships an AI feature. Adoption is low. Quality is mediocre. But the initiative was on the roadmap, it was in the strategy deck, the VP mentioned it in the all-hands. So it stays. It absorbs engineering time. It creates maintenance burden. It blocks the team from working on something that might actually work.

Kill criteria prevent this. They make it safe to stop. They make stopping an expected outcome — not a failure, but a planned checkpoint. “We said we’d measure at 6 weeks and kill it below threshold X. We’re below threshold X. We’re stopping. That’s the system working.”

The portfolio view

Once you have individual bets with kill criteria, you can think about the portfolio.

A good AI portfolio has a mix of bets at different risk levels. A few near-certain things — classification, extraction, structured output. These have well-defined inputs and outputs, they’re easy to eval, and the failure mode is obvious. They build organizational muscle.

A few medium-risk things — RAG, summarization, conversational interfaces. These require more integration work, the eval is harder, and the failure modes are subtle. They’re where the real value often is.

And maybe one high-risk thing — something genuinely novel, where you’re not sure it’ll work. An agent workflow, a generative feature, something creative. This is the one that might be transformative — or might be a waste. That’s why it needs the clearest kill criteria of all.

The portfolio view also helps you sequence. You don’t start with the high-risk bet. You start with the near-certain things, because they build the infrastructure — evals, deployment pipelines, monitoring — that the harder bets need. Teams that jump straight to the ambitious use case skip the boring work that makes ambitious work possible.

The org problem underneath

The reason most strategies are decks instead of systems is organizational, not technical.

Building a deck is a planning exercise. One person or a small team can do it. It requires research, synthesis, maybe some vendor conversations. It produces a deliverable that looks good in a leadership review.

Building a system — bets, metrics, kill criteria, portfolio management — is a governance exercise. It requires cross-functional alignment. It requires someone with the authority to kill initiatives. It requires ongoing measurement and honest reporting. It requires admitting that some things aren’t working.

Most orgs aren’t structured for this. The AI strategy is owned by someone who doesn’t have kill authority. Or the metrics aren’t instrumented. Or the honest reporting culture doesn’t exist. The deck is a symptom of the org, not the cause.

The heuristic

Open your AI strategy document. For each initiative, check whether it has a success metric, a measurement date, and written kill criteria. If any of the three are missing, you don’t have a strategy — you have a deck.

The fix takes an afternoon. Sit down with the people who own each initiative. Write down the three answers. Put them somewhere visible. Review them on the date you wrote down. Kill what needs killing. That’s the system.

tl;dr

The pattern. Most AI strategies are a list of use cases with a Gantt chart — they name what to build but never define how you’ll know if it works or when you’ll stop if it doesn’t. The fix. For every initiative, write down three things before work starts: a specific success metric, a date you’ll measure it, and the threshold below which you kill the project. The outcome. Initiatives that aren’t working get stopped instead of becoming zombies, teams are freed to pursue bets that might actually land, and “we have an AI strategy” means something more than a slide deck.

The eval you skipped is the one that bites

Fri, 12 Apr 2024 00:00:00 +0000

Every team we talked to in Q2 2024 was shipping LLM features. Summarization, extraction, chat, search. The race was real — GPT-4 was mature, Claude 3 had just landed, and the window to build something differentiated was shrinking by the week.

Almost none of them had evals.

Not “not enough evals.” Not “evals that weren’t great.” None. Zero. A model call in production, returning text to users, with no automated way to know if the output was any good.

The excuse

The excuse was always the same. Paraphrased: “We can’t define what good looks like for this feature, so we’ll just ship and see.”

This is backwards. If you can’t define what good looks like, that is the feature most likely to regress. It is the feature most likely to silently degrade when you swap models, change a prompt, or update your retrieval pipeline. It is the feature where “ship and see” means “we’ll find out from our users, eventually, maybe.”

The features where “correct” is easy to define — extraction, classification, structured output — those are the ones that tend to hold up. You notice when they break because you have a schema and a test. The features where correctness is fuzzy — summarization, tone, helpfulness — those are the ones that rot.

Why fuzzy is not an excuse

Here is a thing that is true and that teams resist hearing: you do not need a perfect eval. You need a directional one.

Ten golden examples. Handwritten. Inputs you care about, outputs you’d be happy with. Run your system against them after every change. Did it get worse? Did it get better? You don’t need a score to three decimal places. You need a signal.

The bar is not “automated evaluation that captures every nuance of quality.” The bar is “better than nothing.” Nothing is what most teams had.

Consider what “nothing” actually looks like in practice. You change a prompt. You deploy. A PM notices a week later that summarizations are worse. They file a ticket. An engineer investigates. They can’t reproduce it because the inputs are different now. They tweak the prompt again. They deploy again. Nobody checks. The cycle repeats.

Now consider what “10 golden examples” looks like. You change a prompt. Your CI runs 10 examples. Three of them are clearly worse. You look at why. You fix the prompt. You deploy with confidence. Elapsed time: 20 minutes instead of 2 weeks.

The pattern we kept seeing

In the teams we advised that quarter, there was a reliable pattern. Teams would build an LLM feature in a few days. They’d spend a week on prompt engineering — getting the output to feel right. They’d ship it. Then they’d never touch the prompt again, because touching it meant risking a regression they couldn’t measure.

The prompt became frozen. Not because it was good, but because nobody had a way to tell if a change made it better or worse. The feature shipped at whatever quality level the first prompt achieved, and it stayed there.

This is the opposite of iteration. This is the opposite of what software engineering is supposed to be. We’ve spent decades building the infrastructure to change code safely — tests, CI, staging environments, feature flags. Then we put a model call in the middle and throw all of it away.

What a minimal eval looks like

You don’t need a framework. You don’t need an evaluation platform. You need a script and a file.

The file is your golden set. Each example has an input and one or more reference outputs. The reference outputs don’t need to be perfect — they need to be “acceptable.” You’re not testing for exact match. You’re testing for direction.

The script runs your system against the golden set and produces a report. The report can be as simple as “here are the outputs, diff them against last run.” For teams that want a number, use a model-as-judge pattern — have a second model rate the outputs on a 1-5 scale against criteria you define. It’s not perfect. It doesn’t need to be.

Here’s the part that matters: the golden set is curated by a human who understands the feature. Not generated. Not sampled randomly from production. Handpicked. The 10 inputs that you’d be most embarrassed to get wrong. The edge cases you thought about during design. The examples your PM showed in the demo.

Those 10 examples are worth more than a thousand random ones, because they encode your taste. They represent your opinion about what good looks like — and having an opinion, even an imperfect one, is infinitely better than having no opinion at all.

The cost of skipping

We saw three flavors of pain from teams that skipped evals.

The silent regression. A model provider updates their API. The output format shifts subtly. Nobody notices for three weeks. Customer complaints trickle in. By the time someone investigates, there’s no baseline to compare against.

The frozen prompt. As described above. The team wants to improve the feature but can’t, because any change is a leap of faith. The feature ships at v1 quality and stays there.

The model migration tax. Team wants to switch from GPT-4 to Claude 3 (or vice versa) for cost or latency reasons. Without evals, the migration is a full manual QA cycle. With 10 golden examples, it’s a 5-minute script run. The teams without evals either don’t migrate — leaving money on the table — or migrate blind and pray.

The meta-point

The difficulty of defining “correct” is not a reason to skip evals. It is the reason you need them.

Easy-to-eval features are easy to catch when they break. Hard-to-eval features are the ones that break silently, regress slowly, and create the kind of quality debt that compounds until someone notices you’re shipping garbage.

The harder it is to define what good looks like, the more valuable even a rough approximation becomes. A noisy signal is better than no signal. A biased eval is better than no eval. Ten examples are better than zero.

The heuristic

Before you ship an LLM feature, write down 10 inputs and 10 outputs you’d be happy with. Run your system against them. Save the results. Run them again after every change. That’s it. That’s the eval.

If you can’t write down 10 examples, you don’t understand the feature well enough to ship it.

tl;dr

The pattern. Teams skip evals on exactly the features where “correct” is hardest to define — summarization, tone, helpfulness — and those are precisely the features that silently rot in production. The fix. Before you ship any LLM feature, write 10 representative input/output pairs, run your system against them after every change, and treat any regression as a blocker. The outcome. What once took two weeks of PM complaints and guesswork to diagnose takes 20 minutes, and your prompts can finally evolve instead of staying frozen at v1 quality forever.