11 min read

The Spec Is the Program

Code is a disposable artifact. The spec is the program.

Everything you think of as "the software" — the functions, the classes, the tests, the
implementation — is one possible compilation of the real program, which lives in the
specification. You can regenerate the code from the spec at any time. You cannot
regenerate the spec from the code. The spec is the investment. Code is what you have
until you regenerate it.

This is not how software development works today. It is how it should work — and that
is the argument this document makes. Accepting it changes almost everything about how
software gets built, how quality gets measured, how failures get diagnosed, and what
development looks like when AI agents are doing the coding.

Part 1: What's Actually Broken

The frustration with AI-assisted development is real and it's valid. The code looks
right. It compiles. Sometimes it even passes the tests. And then it does the wrong thing,
in ways you didn't anticipate, for reasons that are hard to trace.

The standard diagnosis is model quality — better models will fix it. The standard
prescription is better prompting — clearer instructions will fix it. Both are wrong.
The failure is structural, and it happens upstream of code generation.

When you give an AI system an underspecified input, it fills the gap. Not randomly — it
fills it with whatever is statistically most plausible given its training data. The
output looks correct because it is the most probable thing that looks like what you
asked for.

And honestly, this is part of what makes it feel magical. You sketch an idea and the
model fills it in with something believable — something you didn't have to think through,
that you wouldn't have gotten to that fast on your own. That's real. You want some of
that. The problem is that you get it everywhere, including in places where, if you'd
put more thought in, you would have made a different call. "Most probable completion of
a vague description" and "correct implementation of your actual intent" are different
things, and the gap between them is invisible. The code compiles. The tests pass. The
intent is violated — in ways that felt like the right call at the time.

Call this what it is: vibe speccing. We've been blaming vibe coding when the actual
problem is that we've been vibe coding our specifications. The model is filling gaps you
left. You left them because specifying intent rigorously is hard and slow. The model
does its best with what it has. The result is plausible but wrong.

The goal isn't to eliminate the magic. It's to confine it to exactly where you're okay
with it.

The math makes this concrete. Call the ambiguity at each layer of your development
process a residual probability that the output doesn't match intent. In a conventional
AI-assisted pipeline, that ambiguity multiplies:

P(correct) ≈ (1 - A₀)(1 - A₁)(1 - A₂)(1 - A₃)

At a conservative 20% residual ambiguity per layer — prompt to spec, spec to tests,
tests to code — the probability that the final code faithfully implements the original
intent is roughly 41%. Less than half. Not because anyone failed. Because ambiguity
compounds. And because nothing fails explicitly — code compiles, tests pass — the
failure is invisible until it matters.

No model improvement fixes this. A better model makes better guesses. It cannot
eliminate the gap when the spec leaves room for interpretation. The problem is not the
generator. The problem is the input.

The fix is structural: remove the ambiguity before generation begins.

Part 2: Semantic Lowering

The fix is called semantic lowering. The idea is not to eliminate what makes AI
generation useful — it's to control where the ambiguity lives. Decompose intent
hierarchically until the leaf nodes are tight enough that the model's gap-filling
happens only inside the bounds you've chosen. Where you've specified precisely, the
model executes precisely. Where you've deliberately left room, the model fills it in.
You get the magic exactly where you want it.

You start with a goal — something you want the system to do, expressed in natural
language. That's Level 0. You don't generate code from Level 0. You lower it.

Lowering a level means expanding it into something more detailed and more constrained,
such that every node at the new level is a faithful refinement of the node above it. A
human reviews each lowering step with a single question: "Is this a faithful refinement
of my intent? Did anything get lost?" That's the only human judgment required per step.
Then you lower again.

In practice, a fully coherent spec doesn't start at a feature. It starts at the product.
Level 0 is a business goal or a customer goal — the reason the product exists. The login
feature lives somewhere in the middle of the tree, not at the top.

Level 0 (business goal):
  "Users trust us with sensitive data because accessing it feels
   secure and effortless."

Level 1 (customer scenario):
  "A user returns to the product after being away for a week,
   logs in from a new device without friction, and immediately
   reaches their work."

Level 2 (feature requirements):
  "Users authenticate with email and password.
   Failed attempts are rate-limited.
   All authentication events are audit-logged."

Level 3 (acceptance criteria):
  "POST /auth/login returns 200 + JWT on success.
   Returns 400 :invalid-email on malformed email.
   Returns 401 :wrong-password on incorrect password.
   Returns 429 after 5 failures from the same IP in 15 minutes."

Level 4 (behavioral invariants):
  "JWT payload: {sub, exp, iat, scope}.
   Rate limit: per source IP, rolling 15-minute window.
   Audit log schema: {event, ts, user-id, ip, success, reason}."

The technical detail at Levels 3 and 4 is constrained by the customer scenario at Level
1, which is constrained by the business goal at Level 0. A feature that implements
Level 3 correctly but violates Level 1 — say, a login that requires too many steps and
feels like friction — is semantically wrong even if computationally correct. The tree
makes that visible.

At each level, ambiguity decreases. Each node is more constrained than the one above.
By Level 3, the solution space is small — not because you've described the code, but
because you've described what the code must do with enough precision that most wrong
implementations are excluded.

You are done lowering when every leaf node can be expressed as a discrete operation:
success or failure, accept or reject, this classification or that one. When you can
compose a decision tree or computation graph from the leaves and traverse it statically
to check feasibility — before any code runs — you have reached terminal depth.

The intent tree you've built is now the fitness function for implementation. An
implementation is semantically sound if it satisfies every node at every level. This
is separate from computational correctness — a system can compile, pass all its tests,
and still be semantically wrong. The intent tree makes that distinction visible. Each
node is a check: does this implementation satisfy the intent expressed here?

Building a good intent tree is expensive, and it is supposed to be. You go up and down
the tree, cross-checking, catching contradictions, filling gaps. A spec of 50,000 tokens
might take a million tokens of computation to produce. That ratio is not waste. It is
the distillation. The meaning per token in the final artifact is what matters. Meaning
comes from the energy invested in creating it.

A vibe-coded spec spends almost no tokens and encodes almost no semantic meaning. A
fully lowered intent tree encodes high semantic meaning precisely because the work was
done to put it there. You cannot shortcut the distillation and get the density.

This also sets a hard ceiling. The spec is the upper bound on correctness for everything
derived from it. You cannot generate semantically correct code from a semantically
incomplete spec. No model, no matter how capable, can exceed the ceiling set by the
input — it can only approach it. The only way to raise the ceiling is to improve the
spec. Every token invested in distillation raises the ceiling for everything downstream,
permanently.

Re-roll convergence is the empirical test for whether you've lowered far enough. Run
the same spec through code generation multiple times — different models, different days,
different runs. High convergence means the solution space is narrow enough to reliably
hit. High variance means the model is still filling gaps statistically; you got lucky
on the passes. That's not a good spec. Lower further.

Part 3: From Spec to Running Code

A good intent tree is necessary but not sufficient. You still need to get to running
code. The path has three phases, and the parallel to an optimizing compiler is exact.

Phase 1 is the source code — the canonical, human-readable artifact. Phase 2 is the
intermediate representation — the plannable form that sits between what you meant and
what will execute, and where optimization happens. Phase 3 is machine code — ephemeral,
regenerable, the output of compiling the IR. In a classical compiler, you don't rewrite
source code based on the machine code; you regenerate machine code from the source. You
don't debug machine code if the wrong thing happens; you fix the source or the IR.
Exactly the same discipline applies here.

The phases must not mix.

Phase 1 is semantic: build the intent tree, no code, no side effects, human-gated at
each lowering step. When the leaves can be composed into a plannable action graph —
discrete inputs, classifiable outputs — Phase 1 is done. Before any code runs, you can
check feasibility statically: does a sound path exist through the action graph? If no
path exists, there is no solution. You have discovered infeasibility in the spec, not in
the code — and it cost you nothing to discover it except spec work.

Phase 2 is plannable: the action graph. Each node has discrete inputs and a classifiable
output condition. A planner verifies that a feasible path exists before proceeding. Not
"will the code work" but "is there any code that could work?"

Phase 3 is tactical: code generation. The coding model's only concern is whether it can
meet the exit criteria for the current action node. It does not exercise semantic
judgment — that was done in Phase 1. It is not deciding what the system should do —
those decisions are crystallized in the graph. It is solving a well-bounded problem:
given these inputs and this output condition, produce code that satisfies them.

This separation is what tightens error bounds. When a model is asked to "build a login
system," it must simultaneously reason about intent, requirements, security posture, and
implementation mechanics. The solution space is enormous. When it is asked to "implement
a function that takes an email and password, returns a Result, succeeds when credentials
are valid, returns :invalid-email when email is malformed" — it is solving a specific,
bounded problem. The space it must search is tactics, not strategy.

When things go wrong, the hierarchy tells you exactly where. Tactical failure: the code
doesn't satisfy the exit criteria — try different tactics. If the solution space is
exhausted, surface to Phase 2: the action node may be infeasible as specified. If the
action graph has no implementable path, surface to Phase 1: the spec needs revision.
Debugging is counterfactual reasoning up the tree. What change at which level would have
foreclosed this failure mode? Find that level. Fix it. Re-lower the subtree.

The error structure after semantic lowering looks like this:

P(correct) ≈ (1 - E_spec)(1 - E_code)

E_spec is controlled and deliberate — the ambiguity you chose to leave by stopping at
a given depth. E_code is bounded by the action graph — tactical errors in a
well-defined solution space. These are independent and additive, not multiplicative.
With a tight spec and a competent coding model:

P(correct) ≈ 0.95 × 0.90 = 0.855

More importantly: the failure modes are different in kind. What remains are bugs —
"code that doesn't implement the intent correctly." That is categorically different from
"code that implements the wrong intent." Bugs are fixable by better code or better
generation. Wrong intent requires going back to the spec. Semantic lowering ensures you
never encounter the second class of failure silently.

Because you're now searching only the space of functional bugs within a correctly-
specified intent, that space is itself reducible. The intent is right; the question is
only whether the implementation satisfies the contract. You can apply further
constraints to shrink the search.

This is where skills come in. A skill is a reusable policy for code generation: when
generating authentication code, use these patterns; when touching a database, use
parameterized queries; when handling errors in this domain, follow this convention.
Skills belong in Phase 3 — they constrain how code is written, not what it must
accomplish. The spec determines that. Skills just shrink the solution space the coding
model must search, steering it toward well-understood, validated implementations.

Skills are safe to apply aggressively in Phase 3 because Phases 1 and 2 are the
authority. A skill recommendation that conflicts with the intent tree or the action
graph is simply not applicable — the prior phases already define what correct means.
If the action graph specifies a read-only operation and a skill's preferred pattern
would allow writes, the conflict is immediately visible and the skill doesn't apply.
The worst a misapplied Phase 3 skill can do is produce a bug, not violate intent.

This matters because today, skills are typically used the other way around: as a bridge
from a vague spec directly to code, substituting for the Phase 1 and 2 work that was
never done. That's vibe speccing with extra steps — a more structured way of letting
the model guess. Applied correctly, after a well-built intent tree, a skill constrains
execution within an already-bounded space. Applied too early, it reintroduces exactly
the ambiguity you were trying to eliminate. You can use a skill as a reference while
building the intent tree — a checklist to consult, domain knowledge to draw on — but
never to make decisions. Intent is a human judgment. A skill can inform; it cannot
decide.

Part 4: What Falls Out

If the spec is the program, some things follow without separate argument.

Waterfall at agile speed

The conventional wisdom is that waterfall is bad. The actual argument against waterfall
was never that spec-first is wrong. It was that the cycle was too slow and changing
course was too expensive. By the time your product made contact with real customers,
months had passed. When they told you the spec was wrong, you had to rework months of
code.

Those objections disappear. With agent-assisted semantic lowering, building a coherent
intent tree takes hours, not months. With AI code generation, producing code from the
spec takes minutes. Acceptance tests derived from the spec are continuous. Code
regenerable from a revised spec makes changing course cheap. The cycle compresses to
a day.

Spec quality Cycle time Change cost
Waterfall (pre-AI) High Months Very high
Agile (pre-AI) Low Weeks Low
Semantic lowering (post-AI) High Days Near-zero

You get spec rigor at agile cadence. The traditional tradeoff between them was about
managing a constraint given fixed costs. When those costs collapse, the debate ends.
What remains is just: how good does your spec need to be? The answer depends on your
risk tolerance — and you can now get to "good enough" quickly.

Specs compound; code is disposable

A spec produced today will generate better code when better models exist next year. You
don't regenerate the spec. You regenerate the code. The ratio inverts from what most
teams are used to: invest in the spec, treat the code as ephemeral.

This is also where the ceiling argument pays off. Every improvement to the spec
permanently raises the ceiling for every future compilation. Better model ships —
regenerate. Spec gets tighter — regenerate. The spec accumulates value. The code is
always today's best answer to what the spec asks.

Specs get cheaper to run over time

A spec with high re-roll convergence has something more than working code. It has a
verified input/output distribution. Model outputs that satisfy the spec's exit criteria
are positive examples of correct behavior; outputs that fail are negative. No human
labeling needed.

You can train a specialized model on that signal. Once it matches the foundation model's
convergence rate on the spec, you swap it in without changing the application. Over
time, stable patterns crystallize into faster, cheaper execution: full model inference
for novel problems, specialist models for familiar ones, compiled code for fully
deterministic operations. The spec is what determines which is which — the boundary
between "still needs reasoning" and "just run it."

It's the same thing the brain does during sleep — novel experiences handled slowly
during the day, consolidated into fast pattern-matching overnight. The spec is what
makes something stable enough to consolidate.

The spec as a product map you can query

The tree's highest levels aren't technical. They are business goals and customer goals.
The full tree is a structured model of what the product is and why — queryable,
diffable, searchable.

Agents can mine it for improvements beyond internal coherence: "if this product did X
instead of Y, here is the projected impact on the tree and the customer." Run the
counterfactual. Diff the resulting tree. Present it to a human: do you agree? Do you
want to test it? Click yes and the system generates a new spec subtree, new code, new
acceptance tests — a testable variant ready for customer validation.

The human's role shifts. Not "what should we build?" — an exhausting, open-ended
question. Instead: "do you agree with this direction?" — a judgment call on a concrete
proposal. It's a less exhausting question, and probably a better one — because the
answer is grounded in the product's actual structure, not a gut feeling about the
backlog.

All of this follows from one decision: treating the spec as the program. Are you?