Why AI Spec-Driven Development Was Born — Its Originator on the Origin and the Road to Proof

When I thought “letting AI build software is impossible”

Not long ago, I was cold on the whole idea of “having AI write code.”

I kept hearing the same things from other engineers: “You end up spending all your time fixing it, so what’s the point?” “Pull requests get so huge they’re impossible to review.” “AI-written code never fits the existing architecture.”

I had lived through it too. I’d hand something to AI with high hopes, and back would come an implementation that was subtly off in a dozen ways. Fixing it ate my time, and moment after moment I felt it would have been faster to write it myself.

“Letting AI build software is impossible” — there was a stretch where I had nearly concluded exactly that.

This article is a look back at why I changed my mind, and what I discovered.

The first AI coding attempt that let me down

The first time I was really disappointed was when I handed a small feature to AI.

I just said, “Build a login feature.” The code that came back was hard to look at: a bespoke implementation that ignored our existing middleware, error responses in a format different from our internal conventions, and the project’s directory conventions completely disregarded.

I tried to review it, but there were so many changed files I had no idea where to start. There was no one to ask, “Why did you implement it this way?” I’d have the AI redo it, and it would come back off in some new direction instead.

After two or three rounds of that, honestly, it felt faster to just write it myself.

This is where I almost arrived at the conclusion that “AI can’t be trusted.” I suspect a lot of engineers have stalled out in the same place.

The moment the same AI behaved like a different one

The turning point came on another project, when I changed how I scoped the work.

Instead of “implement the authentication feature,” I wrote this:

Issue #42: Implement a JWT token verification middleware

Acceptance criteria:

Extract the Bearer token from the Authorization header

Verify the token signature and check expiry

Return 401 on verification failure

Constraints:

Extend the existing authMiddleware.ts

Error responses must follow the format in errors/auth.ts

That tiny issue was all I handed to the AI. The code that came back was, like a different person entirely, clean: it correctly extended the existing middleware, returned 401 in the conventional error format, and even wrote tests.

Same AI. The only thing that changed was how we handed it the work.

Leave room for the AI to guess, and it will drift on its guesses. Remove the room to guess, and it builds to spec. It should be obvious — yet the first time I felt it, there was a quiet astonishment.

The problem was never the AI’s ability. It was the resolution of the input we were giving it.

Testing the hypothesis — the numbers we saw

Holding the hypothesis that “the resolution of the spec decides the accuracy of the AI,” we tried it for real on our own SaaS product.

We chose the feature-build phase of a RAG chatbot, set up issue-driven, and put AI agents in charge of development.

The measurements came out like this:

Code productivity: 115–460× the industry average (50–200 lines/day)
PR merge rate: 98% (315 of 321)
Code-review effort: down 97% (humans handled only 3.3% of the total)

Here’s what I want to stress — it isn’t that “AI is fast.” What I want to convey is that when the spec is in order, AI can move fast and accurately.

And more important than the numbers was the fact that we achieved this without sacrificing quality. A state where 98% of PRs merge as-is means something completely different from churning out piles of throwaway code.

When the hypothesis took on a “visible-in-numbers” form, I was convinced this could be preserved as a methodology.

From a personal knack to a procedure a team can reproduce

From there, the work was turning a personal discovery into an organizational procedure.

“I can do it” and “the team can do it” are different problems. I had to break down the issue granularity I was writing on intuition into a form anyone could write.

After trial and error, we settled on structuring development tasks into four layers:

Epic: the unit of a business problem
User Story: value from the user’s point of view
Use Case: the unit of concrete behavior
Scenario: acceptance criteria precise enough that AI never has to guess

Even just this dramatically changes the resolution of the spec you can hand to AI.

What we also needed was a mechanism to turn review feedback into knowledge instead of consuming it on the spot. Leave a “we keep getting the same comment” state unaddressed and the AI never learns. Systematize each comment so it’s prevented up front next time — that loop is what continuously lifts output quality.

Treating the spec as a living deliverable. Narrowing scope. Opening a PR at 70% completeness and converging it in review. Turning feedback into knowledge. We integrated all of this into something we could transplant to clients outside our own walls — and that is the methodology we now call AI Spec-Driven Development. (Our services lay out the full picture.)

Not a magic prompt, but a spec as the OS

Looking back over all of it, the core comes down to one realization.

What you need in order to delegate to AI is not a “magic prompt” but a “spec as the operating system.”

AI coding tools return vague output for vague input, and clear output for clear input. That isn’t a limitation of AI — it’s the very principle AI follows.

The change between the me who thought “letting AI build software is impossible” and the me of today isn’t about my expectations of AI. It’s about my awareness of what I’m handing to it.

Finally, I’ll leave a question for anyone who read this far.

Is the “spec” you’re handing AI right now at a resolution where the AI never has to guess?

If your answer feels close to “no,” then the next time you delegate something to AI, try just one thing: spell out three acceptance criteria and two constraints before you ask. That alone changes the quality of the code that comes back, astonishingly so.

From “letting AI build software is impossible” to “delegating to AI genuinely made things easier.”

The distance between them is closer than you think.