By Shyam Verma·

The Bottleneck Moved Up the Stack

The Bottleneck Moved Up the Stack

A few months ago I asked AI to add link previews to a feed. Type a URL, see a card. The AI wrote it in under an hour. I shipped.

Then I tested with a Facebook URL. Nothing. Instagram? Nothing. Previews worked for blogs, news sites, GitHub. The two domains my users would actually paste returned blank.

I spent four hours debugging. Different parsers, different libraries, headers, user agents, retries. The AI kept rewriting code that wasn't broken.

The problem wasn't in the code. Facebook and Instagram gate previews behind login walls and bot detection. You can scrape with proxies and auth flows, but not from the parser library I was using, and not without an ongoing operational cost. The real fix was "scope this feature to platforms that allow it" or "buy a scraping service." That's not a parser problem. It's a product decision sitting one layer up, and I couldn't see it until I'd spent half a day asking AI to fix code that wasn't broken.

The AI did exactly what I asked. I just kept asking it code questions about what was a product decision. For four hours.

The bottleneck moved

For most of the last decade, writing code was the bottleneck. You knew what you wanted, you knew how to verify it, you just had to write it. The methodology debates I sat through (agile vs waterfall, monoliths vs microservices, TDD vs not) were really debates about how to organize work around the fact that writing the code was slow.

AI made writing the code cheaper. It did not make the judgment cheaper.

What it didn't help with: knowing whether the thing works, discovering what the thing should be, understanding what users actually want, or seeing the unknowns at the edge of the system before you build into them. Those four feel about the same as they did three years ago.

METR ran a randomized controlled trial in 2025 with experienced developers on real, mature codebases. With AI tools, they were 19% slower. They felt 20% faster. That result is annoying because it matches my own feeling. Code appears quickly, so the work feels faster. Then I lose the time figuring out whether any of the code was the right code. (I tracked the same shift in my own work in Two years of real AI development.)

METR's sample was a specific case: devs on codebases they already knew well. So don't read the 19% as universal. But I see the same felt-vs-measured gap in my own work and in teammates'. I notice the felt-faster part every day. I notice the measured-slower part when a release I thought was a week away turns out to be a month away.

Two objections worth pre-empting before going further.

Maybe AI does help on the upper layers. Fair. It generates tests, drafts specs, summarizes logs, writes migrations. The claim isn't "AI can't touch the layers above code." It's narrower: AI helps once I've correctly named the layer. I generate tests at Layer 2 after I know what "works" means. I probe the system at Layer 3 after I know there's an edge to find. Misclassify, and AI burns cycles on the wrong target.

Maybe the bottleneck didn't move, it just became visible. Also fair. Senior engineers always knew product decisions sat above code. The new thing is that bad layer-judgment now shows up in my AI bill within hours, not as a missed quarter.

The Unknowns Stack

I've started thinking about the work as five layers. Don't get hung up on the names. The habit I'm trying to build is asking, when I'm stuck, what kind of stuck this is. You don't see the next layer up until you walk into it.

The Unknowns Stack: Desire, Vision, Surface, Truth, Code. AI collapsed Layer 1; layers 2-5 didn't move.

Layer 1, Code. Writing the thing. AI made this much cheaper, especially when the spec is clear.

Layer 2, Truth. Verification. Do you even know what "works" means for this feature? When I shipped link previews, what was I checking? Image renders? URL resolves? Domain whitelist? Truth is the criteria for "done." When the criteria are clear (does the test pass, does the request return 200), AI helps a lot. When they aren't, AI generates plausible code for the wrong target. Kent Beck's "augmented coding" work since mid-2025 covers a lot of the same ground from the engineering-discipline side.

Layer 3, Surface. Unknowns at the boundary. The Facebook block was Surface. From inside the code, you can't see it. The system has edges, and the edges have rules you discover by hitting them. Cynefin (Snowden, 2007) calls this the complex domain, where you sense the system's response by probing it. AI can't probe what it can't see.

Layer 4, Vision. What the product is becoming. Vision sharpens with shipping. You don't know which features matter until users use them, and you don't know which users matter until you've shipped to enough of them. Teresa Torres has codified the discovery side of this for product teams.

Layer 5, Desire. What people actually want. Path-dependent. Formed by living. The thing your founder wants on Monday is not the thing they wanted last Tuesday, because they spent the week experiencing the product they thought they wanted last Tuesday. AI can't predict this. Neither can you.

None of these layers are mine. Tests, discovery, Cynefin, build-measure-learn, all of it predates AI by years. What I need is the list in front of me before I prompt, so I remember to ask which layer I'm trying to move.

When the bottom layer got cheap, the other four stopped hiding.

Software is millions of decisions

Strip the abstraction. Building software is making millions of small decisions. Variable name. Endpoint shape. Database type. Auth flow. Pricing model. Onboarding step order. The question that actually matters now is which of these deserve real attention.

In the era I trained in, every decision got roughly the same level of attention. Waterfall front-loaded all of it. Agile spread it out. Neither one really priced decisions differently, because redoing the work was always the same kind of expensive. Most of it was code to write.

AI changed how much each one costs to redo. Decisions split into four types now:

Type Verify cost Velocity Layer it sits on Tool to try
Non-negotiable High, irreducible Slow. Audit. 2-3 ADRs (Nygard), Working Backwards
Negotiable Low, taste Fast. Ship default. 4 Bezos Type-2 doors, Shape Up appetites
Contextual Medium, needs cases Defer. Wait for N. 3 Last Responsible Moment (Poppendieck)
Fact-verified Cheap, measurement Measure. Move on. 1-2 Trunk + flags + observability

Non-negotiable: data model, security, compliance, public API contract. Wrong now equals expensive later. Slow down.

Negotiable: button copy, color choice, dashboard layout, whether settings live in a modal or a page. Wrong now equals a few hours to fix later. Ship a default.

Contextual: which queue to pick, which framework, which auth provider. The right answer depends on facts you don't have yet. Wait until you have N use cases. Don't pre-commit.

Fact-verified: any decision where success is machine-checkable. Did latency drop? Did error rate fall? Did the conversion rate move? Ship the change behind a flag, measure, decide.

The hard part isn't always making the decision. It's noticing what kind of decision you're making. I've lost days treating button-copy decisions like architecture decisions. I've also done the opposite, which is worse: shipping a non-negotiable as a quick default and paying for it for six months.

If I look at the layer I'm stuck on, the bucket usually picks itself. Layer 1 stuck is mostly fact-verified work; AI helps. Layer 2 stuck is contextual plus non-negotiable; I slow down. Layer 4 or 5 stuck — I stop pretending methodology can resolve it upfront. I ship something small and find out.

The indie hacker loop, scaled

Indie hackers were already living in this world. Ship something small, watch what breaks, ship the next version. Eric Ries named the loop fifteen years ago — build, measure, learn — but the people who actually ran it were the ones whose writing-cost was already low enough.

What changed is that small teams now have roughly that writing-cost. A small team can probe the system at a speed that used to take ten engineers. Probing surfaces edges (Layer 3). Shipping sharpens what you're aiming at (Layer 4). Watching users tells you what they actually want (Layer 5).

This isn't "indie hacking is now enterprise." Compliance, multi-tenancy, ten-year data retention still need teams. The narrower claim: the loop those people ran under solo constraints now works at small-team scale.

EcoMitram, the broad-then-narrow lesson

I build EcoMitram, an environmental awareness app for rural India. We were on v2 (which had its own data-quality archaeology before we even shipped). I started v3.

The plan was simple. Migrate v2 features to v3 with cleaner architecture. AI wrote most of the migration. It went fast. I added features I'd been wanting to add: gamification, multi-language, offline mode, group reporting. Scope ballooned. The AI kept up. Features kept coming.

Then I tried to ship. Edges everywhere. Multi-language fallbacks broke: users on Gujarati saw English labels mid-flow because translations existed for happy paths, not error states. Offline reports submitted from low-connectivity villages collided when devices reconnected, two field workers editing the same submission with no merge strategy. The photo-upload pipeline retried indefinitely on 2G, draining batteries that take a full day to charge from solar. AI couldn't fix the edges. I assumed the AI was the failure.

It wasn't. Most of those edges weren't real yet.

I'd built elaborate handling for use cases that didn't exist at my scale. Three field workers in three months. None of them were triggering the edge cases I was patching, but I'd written code for a hypothetical fleet of fifty. I'd misclassified contextual decisions (which conflict-resolution rule, which fallback locale, which retry policy) as non-negotiable. I'd built infrastructure for traffic that wasn't there.

I pulled back to trunk. Deleted half the v3 code. Shipped a narrow version. Three features. One language. No gamification. The next week I added one thing. The week after, another. Each week was faster than the previous one, not because I got better at AI prompting, but because there was less surface area for the AI to confuse itself on.

The actual lesson was simpler than I want to make it sound: I had built for a version of EcoMitram that didn't exist yet. Three users. Fifty users' worth of edge cases.

Coming back used to be expensive. A second migration was real work. AI made the second migration cheap, which is what changes the calculus. Broad scope no longer costs only time. It costs confusion: phantom edges, AI cycles burned on problems that aren't real yet, debt accreting around use cases that never arrive.

Now I ship narrower. Not as a slogan. Because broad-v3 nearly buried me in fake edge cases.

What to try Monday morning

Three things I'd run for the next week. Each one matches a different decision type from the table. None of the techniques are new. What changed for me is that I can now ship the underlying code in an afternoon. So the same old technique gets way more out of me than it used to.

Next non-negotiable decision (data model, security, compliance, public API): write a one-page ADR. Michael Nygard wrote up the format in 2011. Title, context, decision, consequences. Write the consequences down before you write the code. AI can draft the ADR with you, but you have to read it.

Next architectural call where you're not sure if it's reversible: classify it Bezos-style before deciding. Type 1 doors (one-way, hard to reverse) get audit-grade attention. Type 2 doors (reversible) get shipped fast. Bezos described it in the 2016 shareholder letter. Most teams treat every decision like Type 1, which is why they ship slowly even with AI writing the code.

Next ship: put the change behind a flag and pick one metric before shipping. If the metric moves, keep it. If not, kill it. Ship-and-measure is the bucket where AI scales hardest, because the answer is in the data instead of in your head.

That's three. ADR for the next non-negotiable. Bezos classification for the next architectural call. One-metric-as-flag for the next ship.

The literature has more techniques. Geoffrey Huntley's Ralph Loop is the right tool when you have a tight, machine-verifiable target. Shape Up's appetites work when you can carve negotiable decisions into time-boxed bets. Teresa Torres' continuous discovery is the move at Layer 4 if you have a product team and recurring customer access. They all help. Three habits I run every week beat seven I read about and never get around to.

Five things I do differently now

After a year of running this loop on EcoMitram and a few smaller projects, here's what changed in my own workflow. Nothing on this list is new. The change is that I do them on purpose now, instead of when I happen to remember.

1. I name the layer before I prompt. Before I send a request to AI, I ask which layer this sits on. If it's Surface (the system has rules I don't know yet), no prompt will help, I need to probe the boundary. If it's Truth (I don't know what "works" looks like), I write the test first.

Snowden's Cynefin paper makes the same diagnose-first argument for decision-making in general. The classic mistake is picking the tool before you've worked out what kind of situation you're in. Simon Willison's "context engineering" framing is the same idea applied to AI: the real skill is constructing the right context for the task, not crafting a clever instruction. Naming the layer takes thirty seconds. Skipping it costs hours.

2. I time-box AI debugging. If AI can't fix a bug in two or three rounds of iteration, the bug isn't where I think it is. It's almost always one layer up. Switch contexts. Read the docs of the system at the boundary. Look at the network log. Stop typing into the AI. (I keep a Bug Investigation Protocol in my CLAUDE.md for this exact reason.)

Anthropic's own Claude Code docs codify this: "After two failed corrections, /clear and write a better initial prompt incorporating what you learned. A clean session with a better prompt almost always outperforms a long session with accumulated corrections." The trap, the one I keep falling into, is feeling like the next prompt will finally do what you're asking. It won't. The next prompt has to be on a fresh context, at a different layer.

3. I delete more than I iterate. When AI builds something that doesn't work after two rounds of feedback, the issue is usually the prompt or the scope, not the code. Delete the file. Re-scope. Re-prompt.

Joel Spolsky in 2000 called rewriting "the single worst strategic mistake any software company can make." That advice held up well for two decades of hand-written code. Simon Willison in 2025: "Writing code is cheap now." Both right, in their time. The thing that changed in between is what a rewrite actually costs.

4. I ship to one real user before I ship to many. Layer 5 (Desire) only reveals itself in production. Nielsen's classic usability-testing finding: the first user surfaces about 31% of usability problems, five users about 85%. That's specifically a usability-testing curve, not a "you've now learned everything" claim. But the same shape shows up in product testing too. One real person doing one real task tells me more than all the internal demos I've ever done put together.

What the user says and what the user does aren't the same thing. The intention-behavior gap (Sheeran & Webb) is the formal version: stated intentions explain only a fraction of actual behavior. So I treat survey data and demo reactions as weak signal. Demos and surveys produce, in Rob Fitzpatrick's framing, "bad data": opinions, hypotheticals, compliments. Only observed behavior on a real task counts.

On EcoMitram, the field worker who silently ignores half the features I built tells me more than the survey that scores the app four-out-of-five. The features I cut after watching one user are the ones I'd otherwise still be debugging.

5. I write a one-sentence success criterion before any non-trivial feature. "This is done when X happens with Y characteristics." If I can't write the sentence, I'm still at Layer 2 and AI will burn cycles producing plausible-but-wrong code.

GitHub shipped a Spec Kit toolkit under the "spec-driven development" label in September 2025: spec is the primary artifact, code is its expression. The older version is Dan North's BDD, Given-When-Then, named almost twenty years before any agent existed. Writing the criteria down isn't new. AI just made me pay for skipping it sooner, in cycles I burned on plausible-but-wrong code.

Five gaps where no methodology has solved it

These are the parts I still don't know how to handle. None of the methodology guides I've found cover them well.

1. When does a contextual choice become architecture? Code accretes around an assumption that started as contextual. By month three, ten files depend on it. It's now non-negotiable, and nobody noticed when it crossed over. The closest thing we have is "tech debt," which is a description, not a fix.

2. Auditing AI-generated structural decisions. When an AI agent generates code, it makes Type 1 calls inside that code. Schema shape. Auth assumptions. Failure modes. Code review catches syntax. Architecture review catches macro structure. The stuff in the middle gets missed: schema shape, auth assumptions, failure modes hiding inside generated code that nobody flagged because nothing about it looked suspicious at the file level.

3. Discovering verification criteria while you build. Continuous discovery happens before build. Observability happens after ship. The middle phase, where you're building and discovering what "done" means simultaneously, has no formal loop. You learn the criteria as you go, but most methodology assumes you already knew them.

4. Solo non-negotiable decisions. Indie frameworks tell you to ship fast and iterate. They don't tell a solo founder how to make a one-shot call on auth, payments, data residency, or security. Those are the exact decisions that don't get a second chance.

5. Re-anchoring vision. Vision sharpens with shipping. It also drifts. Six months in, my mental model of the product has moved, the team's hasn't, and there's no calendar event for "is the vision still the vision."

If you have a working pattern for any of these, write back. I'd use it.

What didn't change

Not everything got cheap.

Cross-team coordination is still hard. Conway's Law didn't move. Two teams shipping into each other's surface can still break each other's work faster than AI can fix it.

Data migrations are still expensive. AI rewrites application code in minutes. Schema refactors are different — the data itself is the constraint, and the data doesn't fit in any context window. Renaming a column doesn't make the old rows stop existing. A schema you get right on day one still saves you years of grief later. (I wrote up a real production migration in WordPress to Next.js + Directus — the data side was the part AI couldn't shortcut.)

API contracts to external consumers still need stability. Once a third party depends on your endpoint shape, you can't iterate on it the way you iterate on internal code. The contract is non-negotiable from day one.

Security and compliance still need upfront thought. You can't ship-fast-and-fix-later your way through SOC 2, GDPR, HIPAA, or PCI. The criteria are external. They're regulated. They don't bend.

None of this is exceptional. These are just the parts of the work where code was never the bottleneck.

Closing

I don't think we have the right methodology for this yet. Agile, ADRs, Shape Up, discovery, flags. All useful. None of them were built for a world where code is cheap and judgment is the bottleneck.

AI didn't change what kinds of decisions exist. It changed the price of the cheap ones so much that the expensive ones are now the actual work. If a methodology fits this era at all, my guess is it starts by asking which layer the team is stuck on. I haven't read anyone who's nailed that one yet.

The thing I keep wanting is a detector. When did this reversible choice become permanent? When did the vision drift? Which AI-generated structure needs a real architecture review? If you've got a working pattern for any of that, write back. I'd use it.

Comments

Interested in Collaborating?

Whether it's a startup idea, a technical challenge, or a potential partnership—let's have a conversation.

20+
Years Experience
150K+
Websites Powered
2
Successful Exits
7x
Faster with AI