4 days ago i wrote about ralph, the AI coding loop that builds software while you sleep. zero lines written manually. that post blew up. but here's the thing: after watching dozens of people try to implement it, i realized most are doing it wrong.
even anthropic's official plugin misses the point. the creator, jeff huntley, has called this out publicly.
so what's the actual canonical way? and why does getting it wrong turn a $30 build into a $300 mess that still doesn't work?
let me break it down.
the one insight that makes ralph work
ralph isn't about running AI in a loop. any script can do that.
ralph works because it keeps the AI operating in its smartest mode.
here's what most people don't understand about large language models: they get worse as context grows.
think of the context window like a whiteboard. at the start, it's clean. the AI reads your instructions clearly. executes precisely. makes smart decisions.
but as the conversation continues, that whiteboard fills up. old code. previous attempts. failed approaches. random tangents.
by the time you're mostly full, the AI is wading through noise to find signal. it forgets things. contradicts itself. makes obvious mistakes.
ralph solves this by wiping the whiteboard after every single task.
fresh start. full brainpower. every time.
why anthropic's plugin gets it wrong
the official ralph wiggum plugin from anthropic does something the creator never intended.
instead of wiping context completely, it compacts.
compaction means the AI looks at everything that happened and picks out what it thinks is important to carry forward.
sounds reasonable. but there's a fatal flaw.
the AI doesn't know what's actually important. it guesses. and when it guesses wrong, critical information disappears. bugs compound. features break in ways that don't make sense.
the whole point of ralph vanishes.
the plugin also adds max iterations and completion conditions. but sometimes you want ralph to keep running indefinitely. it finds bugs you didn't know existed. adds improvements you didn't think of. surfaces edge cases that would have broken production.
when you cap iterations, you cut off that discovery.
the growing file problem
another common mistake: letting the AI modify its own instructions on each loop.
one popular approach has the agent update an agents.md file every iteration. learning from what it did. adding notes for next time.
sounds smart. breaks everything.
models are verbose by default. each loop adds tokens. ten iterations in, you've stuffed the context window before the actual task even starts.
you've pushed the AI into the zone where it starts making mistakes while trying to make it smarter.
the canonical ralph keeps the prompt static. the only thing that changes is a simple flag marking tasks complete. nothing else grows.
what jeff huntley actually intended
the original implementation is brutally simple.
one bash while loop.
it reads a prompt file. runs claude. waits for it to finish. loops again.
that's it.
no compaction. no growing memory files. no clever additions.
the prompt tells the AI: read the plan, pick the most important incomplete task, implement it, test it, commit it, mark it done. exit when everything passes.
the key is that the loop lives outside the model's control. the AI can't decide when to stop. it can't modify the loop itself. it just executes tasks until the external script sees everything marked complete.
this keeps the AI focused on one thing at a time with maximum context available for that one thing.
the task structure that works
your plan file needs specific structure. vague tasks produce vague results.
each task should have:
a clear category (frontend, backend, database, etc.)
a specific description of what done looks like
concrete validation steps the AI can check
a passes flag (true or false)
the AI reads the plan, finds tasks where passes is false, picks the highest priority one, implements it, runs the validation steps, and only marks passes true if everything checks out.
your exit condition: the loop only stops when every single task shows passes true.
if you're lazy with task definitions, ralph is lazy with implementation. garbage in, garbage out.
variations that don't break the core
some people have built on ralph correctly.
one approach runs multiple ralphs in parallel. different tasks, different instances, all feeding into the same codebase. dramatically faster for large projects.
another uses browser automation for validation. instead of just running tests, the AI actually opens the app, clicks through flows, verifies things work like a real user would.
a third treats tasks as github issues. ralph picks the most important open issue, implements it, closes the issue when done. clean integration with existing workflows.
these all work because they add capabilities without breaking the fundamental principle: fresh context every loop.
the moment you start accumulating state inside the context window, you're back to the original problem.
when ralph makes sense
ralph is a proof of concept machine.
you have an idea. you want to see if the architecture works. you want to validate your tech stack choices before committing to a full build.
ralph can build the entire thing overnight. multiple versions if you want. you wake up, review what it created, and know whether your approach is sound.
this is incredibly valuable. the exploratory phase that used to take weeks now takes hours.
but ralph is not production engineering.
the code needs review. edge cases need human judgment. architectural decisions need context ralph doesn't have.
use ralph to validate. use your actual engineering process to ship.
the setup in 10 minutes
ralph is free and open source: https://github.com/snarktank/ralph
you need four files.
prompt.md: static instructions that never change. tells the AI how to read the plan, pick tasks, implement, validate, and mark complete.
prd.md: your task list. every feature broken into specific, validatable chunks with passes flags.
activity.md: a log file. each loop appends what happened. this gives you visibility without bloating context (the AI reads it fresh each time).
settings.json: sandbox configuration. limits what commands the AI can run. you're giving it permissions to act autonomously, so constrain the damage radius.
create the prd by describing your project to claude. have it generate the task breakdown. review and adjust. then run the loop.
most projects need 15-25 iterations. budget accordingly.
the cost math
typical ralph run: 10-20 iterations at roughly $2-3 each. call it $30-50 for a complete proof of concept.
one builder shipped an entire app that would have cost $50,000 to hire out. spent under $300.
the economics only work if you're getting clean iterations. if your implementation is wrong and the AI is spinning in circles, you burn tokens accomplishing nothing.
correct setup: $30 and a working proof of concept.
wrong setup: $300 and a broken mess you have to rebuild manually.
the difference is fresh context every loop.
the bottom line
ralph works because it respects how AI actually functions. limited context, maximum focus, clean resets.
the moment you add compaction, growing files, or accumulated state, you're fighting against the architecture instead of working with it.
one bash loop. static prompt. clean task list. fresh start every iteration.
that's it. that's the whole thing.
part one showed you what ralph can build. this showed you how to actually build it right.
i put together a one-pager with my exact claude code setup. the interview prompt. the progress file template. the feature testing checklist. everything i covered here in a copy-paste format.
RT + comment "RALPH" and i'll send it over.
if you want more breakdowns like this:
youtube (launching soon): https://www.youtube.com/@damianplayer (deep dives and tutorials)
linkedin: https://www.linkedin.com/in/damianplayer (daily posts on AI and business)


