There’s a lot of churn in the AI/LLM/Agent space, but I’ve landed on an approach that reliably produces production-quality code for non-trivial work. The core insight is simple: use multiple LLMs to develop a solid architecture, then implement it through a phased crawl/walk/run approach.

Getting the plan right before writing code is where most of the value is. Here’s how that plays out in practice.

Process Overview

Set up your environment
Arrive at an architecture thru multiple LLMs
Adversarially review the result
Generate a crawl/walk/run implementation plan
Implement phase by phase, reviewing as I/you go

Environment Prep

Dev Containers

The agent needs to run your tests, catch its own errors, and iterate without you babysitting it. Without that feedback loop, you’re stuck manually pasting error messages back in or clicking “yes, run bin/rails test” after every step.

Dev Containers solve this for most backend stacks. Download the .devcontainer/ setup from Claude Code and adapt it to your stack; it’s a reasonable starting point. Once you have the Dev Container plugin in VS Code or Cursor, the next time you open the project it’ll ask if you want to use Dev Containers. Say yes, and the agent gets a relatively safeDon’t sue me if Claude finds a Docker exploit and pwns your system! environment to work in without touching your host machine.

Browser Integration

If you’re on Cursor, you can wire up a browser. Without that, the Agent won’t be able to see when a change broke your front-end. However, it’s exceptionally expensive; use a cheap model for anything browser-related since all you’re looking for is basic page rendering/a 500 page etc.

Agent Instructions

AGENTS.md — Put one in the root of your project. Mine contains:

Project overview
Common development commands
Architecture notes (non-obvious things agents can’t infer from the code, like “multi-tenancy is handled through subdomains”)
Technology stack
Testing conventions and how to run them
Key directories
Development process expectations: ask questions before starting, update ChangeLog.md, run tests and lint before declaring done

Framework-specific rules — If you’re using a popular framework, someone has probably already done this work. Cursor Directory, community rules repos, and open-source apps like Maybe are all worth cribbing from. The specific content matters less than having a declared, standardized approach.

General-purpose config — I use Superpowers to give models a baseline of guidance I don’t have to type every session.

What I Don’t Do

I’ve always been skeptical in cargo-cult effort estimation. It feels even less relevant now when AI is writing all the code. Any feature that’s technically possible can be implemented with minimal human effort. I do still think about whether something is worth doing at all and I try to enforce a time-box (“feature foo is not worth doing if it takes more than X of my days working on it”).

Little things? JFDI.

If it’s small enough that I wouldn’t sit down and plan it out myself, I don’t make the agent plan it either. Adding a field to a form, tweaking how an existing behavior works, small refactors; these go straight to implementation.

Please read @AGENTS.md, and update @ChangeLog.md if appropriate. 

Our task for today is: <<the task>>

Do you have any questions before we begin?

<<If not, GO>>

Arrive at an architecture

This is where the real work happens for anything non-trivial.

When to use this process:

Larger features: “add Sign in with Google SSO to our Rails app”
Exploratory work: “how should we approach migrating from PostgreSQL to BigQuery?”
Things I don’t know well enough to spot a hallucination
Things I’m not sure are even possible (“I want to programmatically control my iPhone’s screen”tl;dr: you can’t )

How much rigor I apply scales with how unfamiliar the territory is. “Add SSO with Devise/OmniAuth”? I already know the rough shape of the solution, so I’ll run it through one LLM to validate my assumptions and generate a plan. “Build a distributed image ingestion pipeline at 30k/sec”? I’m opening three browser tabs.

For the complex or unfamiliar cases, I’ll open ChatGPT, Gemini, Claude, and Grok simultaneously and give them the same prompt. The reason to do this isn’t just to collect opinions. Hallucinations become detectable through consensus: when three LLMs agree and one says something different, the outlier is almost always the one that made something up. The false confidence of a single LLM giving a plausible-sounding wrong answer is a real danger when you’re in territory you don’t know well.

As I work through each conversation, I’ll refine the prompt based on clarifying questions one LLM raises, then carry those refinements into the others.

The goal at this stage is high-level structure, no code:

Claude|ChatGPT|Gemini|Grok

I'm working on a new thing. It is: 
* high level description
* key things it has to do
* any functional requirements
* any non-functional requirements

I am wondering how best to conceptually architect this.

No code, but:
* What components should I have?
* What should the architecture look like?
* Are there examples like this that you know of we can crib from?

Adversarial Review

Once I have responses from all the models, I synthesize a single architecture document, picking the best ideas from each and discarding the wrong stuff (and there’s always some). Sometimes one LLM is a clear winner. Usually it’s a mix.

If one LLM has a suggestion I’m genuinely uncertain or just curious about, I’ll send it to the others for a hard critique:

Claude

ChatGPT had a suggestion. 

I am not sure if this is a good idea or not, and I need your help. 

Please critique it as strongly as possible. If it holds up, incorporate it. If not, tell me what to tell ChatGPT.

If Claude pushes back, I’ll take its objection back to ChatGPT:

ChatGPT

Claude says <<foo>>. What do you think? Be as critical as possible. Is Claude right? If so, let's incorporate it. If not, why not?

At that point I have enough to make the call myself. The LLMs are surfacing trade-offs I might have missed; the decision/responsibility is still mine.

Once I’m satisfied, I ask for a final architecture document:

Let's recap everything as an architecture document we can start implementing from. 

Don't forget: testability, what done looks like, functional and non-functional requirements, goals, budgetary and cost constraints.

Detailed enough to implement from, but keep code minimal unless a specific implementation detail is critical to nail down. Focus on areas like (pick what is necessary for this feature and add any you think I missed):
 
* Design guidelines and principles
* Core concepts and metaphors
* Functional and non-functional requirements
* Budgetary/cost considerations
* Technology stack
* Domain models
* Services
* Background jobs/workers
* UX flow
* Implementation phases

Break it up into different Implementation Phases that are grouped logically for this project/feature.

The Plan: Crawl, Walk, Run

This is the part that makes everything else work.

The LLM takes the architecture document and breaks it into three, uh, subphasesSubphasii? Phases-sub? :

Crawl: Skeletonize. Class definitions, relationships, attributes, just the framing. No real implementation. The goal is to verify that the architecture makes sense before any meaningful code exists.

Walk: Flesh it out with the simplest thing that could possibly work. Write tests that validate the core assumptions.

Run: Full implementation.

The value is that wrong assumptions surface during Crawl, when nothing has been built on top of them yet. If I try to one-shot a non-trivial feature, I routinely get deep into implementation before discovering that some gem doesn’t do what we thought, or an API doesn’t work quite like we expected. Working things back out is expensive in both tokens and time.

I paste the architecture into ClaudeClaude is consistently better at plan generation than the alternatives, in my experience, but by the time you’re reading this is may have changed. Constant experimentation with new models is key! and say:

This is a high-level architecture for [description of problem]. We’re going to implement this with a Cursor Agent.

We need to take a crawl/walk/run approach to expose uncertainty early and avoid building a lot of code on wrong assumptions.

Crawl: Skeletonize: class definitions, relationships, attributes. Just the framing. No implementation.

Walk: Flesh out the implementation, simplest thing that could possibly work. Write tests that exercise this functionality and validate deeper assumptions.

Run: Full implementation.

Take the architecture document and break Phase 1, Phase 2, etc.Depending on how uncertain I feel about earlier Phases, I might have it just do Phase 1 if I’m expecting to go back to it with changes. into small, discrete chunks covering all three sub-phases (Crawl, Walk, Run). Keep the end goal in mind so we don’t back ourselves into a corner with short-sighted thinking, but don’t contort Crawl/Walk into a pretzel either.

I briefly scan the plan headings/high level steps before implementation. I suspect by the time you’re reading this (and/or the next model rolls around) I won’t even review this as I never uncover anything major here, but I find it good situational awareness gathering for what the Agent is about to build.

Implementation

Review as You Go

One smart model did the hardExpensive. thinking. A cheaper model now does the heavy lifting. The key is keeping each unit of work small enough that I can actually review it.

It’s easy for you to look at a 500-line architecture.md and convince yourself you understand it. If you review it at all (I bet most people don’t, they just skim it, if that!). Reviewing working code or clicking through a running skeleton is a different kind of understanding, and it’s where “wait, this is totally wrong” moments happen while they’re still cheap.

I put the architecture and phase files in a predictable spot:

docs/features/[feature-name]/
-- architecture.md
-- phase1.md
-- phase2.md
-- phase3.md
-- and so on.

Then I start a fresh agent session for each logical chunk (I choose each chunk size by looking at the size and complexity of each one and gauge my subjective opinion on how much I can reasonably review in one go. As models improve I expect the chunk sizes to grow larger and larger over time):

Claude

Please read @AGENTS.md and @architecture.md.

We will be implementing @phase1.md, steps [1-4][5-7][8] etc.

Ensure tests pass and lint is clean, then I'll review. If review passes, commit the code.

Cursor manages the TODO list internally, which means if the agent breaks partway through I can restart without losing my place.

I typically just look at the test suite - does it seem like it covers what we want - and the UI. Do the flows “feel” right? Does it behave like I expect/want?

Once a chunk is done and reviewed, I archive that agent session and start a new one for the next chunk. Fresh context means the LLM isn’t carrying around all the reasoning from steps it already completed, and at each chunk we have a working system (it might not do the thing we eventually want it to do, but it’s not broken) so I don’t have to keep a bunch of “where are we again?” context in my brain.

What Actually Goes Wrong

The process breaks in two ways, and both are easily recoverable.

The first is a bad assumption baked into the architecture. This almost always surfaces in Crawl. Since no meaningful code has been written, the fix is easy: go back to the main LLM window, correct the architecture and regenerate the plan, and restart.

This is the whole point of having a Crawl phase - we can fix things early before we’ve written gobs of code and have to spend a lot of tokens to undo all that work.

The second is an agent that’s gotten stuck or starts spinning its wheels (you get a feel for it by watching the thinking results and sometimes it starts going in circles). That’s a sign the context has gotten polluted or we’re biting off more than we can chew. I just stop the Agent, ask it to summarize where we’re at, and then open a new Agent. This is exactly why the discrete phase files and the “commit after each chunk” discipline pays off - we have a restart/checkpoint where we can easily pick back up where we left off.

I’m quite happy with how this has held up. What does your process look like?

Matt Rogish

How I Write Software With Agents