Absolutely LLMs are great for greenfield projects. They can get you to a prototype for a new idea faster than any tool yet invented. Where they start to break down, I find, is when you ask it to make changes/refactors to existing code and mature projects. They usually lack context, so they doesn't hesitate to introduce lots of extra complexity, add frameworks you don't need, and in general just make the situation worse. Or if they get you to some solution it will have taken so long that you might as well have just done the heavy lifting yourself. LLMs are still no substitute for actually understanding your code.
The first part of this, where you told it to ask YOU questions, rather than laboriously building prompts and context yourself was the magic ticket for me. And I doubt I would have stumbled on that sorta inverse logic on my own. Really great write up!
That lonely/downtime section at the end is a giant red flag for me.
It looks like the sort of nonproductive yak-shaving you do when you're stuck or avoiding an unpleasant task--coasting, fooling around incrementally with your LLM because your project's fucked and you psychologically need some sense of progress.
The opposite of this is burnout--one of the things they don't tell you about successful projects with good tools is they induce much more burnout than doomed projects. There's a sort of Amdahl's Law in effect, where all the tooling just gives you more time to focus on the actual fundamentals of the product/project/problem you’re trying to address, which is stressful and mentally taxing even when it works.
Fucking around with LLM coding tools, otoh, is very fun, and like constantly clean-rebuilding your whole (doomed) project, gives you both some downtime and a sense of forward momentum--look how much the computer is chugging!
The reality testing to see if the tool is really helping is to sit down with a concrete goal and a (near) hard deadline. Every time I've tried to use an LLM under these conditions it just fails catastrophically--I don't just get stuck, I realize how basically every implicit decision embedded in the LLM output has an unacceptably high likelihood of being wrong, and I have an amount of debug cycles ahead of me exceeding the time to throw it all away and do it without the LLM by, like, an order of magnitude.
I'm not an LLM-coding hater and I've been doing AI stuff that's worked for decades, but current offerings I've tried aren't even close to productive compared to searching for code that already exists on the web.
This is the first article I’ve come across that truly utilizes LLMs in a workflow the right way. I appreciate the time and effort the author put into breaking this down.
I believe most people who struggle to be productive with language models simply haven’t put in the necessary practice to communicate effectively with AI. The issue isn’t with the intelligence of the models—it’s that humans are still learning how to use this tool properly. It’s clear that the author has spent time mastering the art of communicating with LLMs. Many of the conclusions in this post feel obvious once you’ve developed an understanding of how these models "think" and how to work within their constraints.
I’m a huge fan of the workflow described here, and I’ll definitely be looking into AIder and repomix. I’ve had a lot of success using a similar approach with Cursor in Composer Agent mode, where Claude-3.5-sonnet acts as my "code implementer." I strategize with larger reasoning models (like o1-pro, o3-mini-high, etc.) and delegate execution to Claude, which excels at making inline code edits. While it’s not perfect, the time savings far outweigh the effort required to review an "AI Pull Request."
Maximizing efficiency in this kind of workflow requires a few key things:
- High typing speed – Minimizing time spent writing prompts means maximizing time generating useful code.
- A strong intuition for "what’s right" vs. "what’s wrong" – This will probably become less relevant as models improve, but for now, good judgment is crucial.
- Familiarity with each model’s strengths and weaknesses – This only comes with hands-on experience.
Right now, LLMs don’t work flawlessly out of the box for everyone, and I think that’s where a lot of the complaints come from—the "AI haterade" crowd expects perfection without adaptation.
For what it’s worth, I’ve built large-scale production applications using these techniques while writing minimal human code myself.
Most of my experience using these workflows has been in the web dev domain, where there's an abundance of training data. That said, I’ve also worked in lower-level programming and language design, so I can understand why some people might not find models up to par in every scenario, particularly in niche domains.
Has anyone who evolved from a baseline of just using Cursor chat and freestyling to a proper workflow like this got any anecdata to share on noticeable improvements?
Does the time invested into the planning benefit you? Have you noticed less hallucinations? Have you saved time overall?
I’d be curious to hear because my current workflow is basically
Something I quickly learned while retooling this past week is that it’s preferable not to add opinionated frameworks to the project as they increase the size of the context the model should be aware of. This context will also not likely be available in the training data.
For example, rather than using Plasmo for its browser extension boilerplate and packaging utilities, I’ve chosen to ask the LLM to setup all of that for me as it won’t have any blindspots when tasked with debugging.
This is all fine for a solo dev, but how does this work with a team / squad, working on the same code base?
Having 7 different instances of an LLM analyzing the same code base and making suggestions would not just be economically wasteful, it would also be unpractical or even dangerous?
Outside of RAG, which is a different thing, are there products that somehow "centralize" the context for a team, where all questions refer to the same codebase?
I fix errors myself, because LLMs are capable of producing large chunks of really stupid/wrong code, which needs to be reverted, and thats why it makes sense to see the code at least once.
Also I used to find myself in a situation when I tried to use LLM for the sake of using LLM to write code (waste of time)
Would be great if there were more details on the costs of doing this work - especially when loading lots of tokens of context via repo mix and then generating code with context (context-loaded inference API calls are more expensive, correct?). A dedicated post discussing this and related considerations would be even better. Are there cost estimations in the tools like aider (vs just refreshing the LLM platform’s billing dashboard?)
Am I the only one that doesn’t see the hype with Claude? I recently tried it, hit the usage limit, read around found tons of blogs and posts from devs saying Claude is the best code assistant LLM… so I purchased Claude pro… and I hate it. I have been asking it surface level questions about Apache spark (configuring the number of tasks retries, errors, error handling, etc.) and it hallucinated so much, so frequently. It reminds me of like ChatGPT 3…
What am I doing wrong or what am I missing? My experience has been so underwhelming I just don’t understand the hype for why people use Claude over something else.
Sorry I know there are many models out there, and Claude is probably better than 99% of them. Can someone help me understand the value of it over o1/o3? I honestly feel like I like 4o better.
I think LLM codegen still requires a mental model of the problem domain. I wonder how many upcoming devs will simply never develop one. Calculators are tools for engineers /and/ way too many people can't even do basic receipt math.
I find making the LLMs think and plan the project a bit worrying, I understand this helps with procrastination but when these systems eventually get better and more integrated, the most likely thing to happen to software devs is them moving away from purely coding to more of a solution architect role (aka Planning stuff), not taking into account the negative impact of giving up critical thinking to LLMs.
This is pretty much my flow that I landed on as well. Dump existing relevant files into context, explain what we are trying to achieve, and ask it to analyze various approaches, considerations, and ask clarifying questions. Once we both align on direction, I ask for a plan of all files to be created/modified in dependency order with descriptions of the required changes. Once we align on the plan I say lets proceed one file at a time, that way I can ensure each file builds on the previous one and I can adjust as needed.
If I have to go to this much effort, what is AI buying us here? Why don't we just put the effort in to learn to write code ourselves? Instead of troubleshooting AI problems and coming up with clever workarounds for those problems, troubleshoot your code, solve those problems directly!
This is a great article -- I really appreciate the author giving specific examples. I have never heard of mise (https://mise.jdx.dev/) before either, and the integration with the saved prompts is a nifty idea -- excited to try it out!
I have been using LLM for a long time but these prompts ideas are fantastic; they really opened up a world for me.
Because a lot of the benefits of LLM is bringing ideas or questions I am not thinking of right now, and this really does that. Typically this would happen as I dig through a topic, not before hand. So that's a net benefit.
I also tried it and it worked a charm, the LLM did respect context and the step by step approach, poking holes in my ideas. Amazing work.
I still like writing codes and solving puzzles in my mind so I won't be doing the "execution" part. From there on, I mostly use LLM as auto complete and I'm stuck here or obscure bug solving. Otherwise, I don't get any satisfaction from programming, having learnt nothing.
Great write up! Roughly how many Claude tokens are you using per month with this workflow? What’s your monthly API costs?
Also what do you mean by “I really want someone to solve this problem in a way that makes coding with an LLM a multiplayer game. Not a solo hacker experience.” ?
Something I’ve started to do recently is mob programming with LLM’s.
I act as the director, creativity and ideas person. I have one LLM that implements, and a second LLM that critiques and suggests improvements and alternatives.
About the first step, you probably also need some kind of context that the LLM has most information to iterate with you about the new feature idea.
so either you put the whole codebase into the context (will mostly lead to problems as tokens are limited) or you have some kind of summary with your current features etc.
Or you do some kind of "black box" iterations, which I feel won’t be that useful for new features, as the model should know about current features etc?
Given a 3 648 318 tokens repository (number from Repomix), I'm not sure what would be the cost of using a leading LLM to analyse it and ask improvements.
Isn't the input token number way more limited than that ?
This is part is unclear to me in the "non-Greenfield" part of the article.
Iterating with aider on very limited scopes is easy, I've used it often. But what about understanding a whole repository and act on it ? Following imports to understand a Typescript codebase as a whole ?
> I really want someone to solve this problem in a way that makes coding with an LLM a multiplayer game. Not a solo hacker experience. There is so much opportunity to fix this and make it amazing.
This i think is the grand vision -- what could it look like?
in my mind programming should look like a map -- you can go anywhere, and there'll be things happening. and multiple people.
If anyone wants to work on this (or have comments, hit me up!)
In our company we are only allowed to use GitHub Copilot with GPT or Claude, but not Claude directly. I'm quite struggling with getting good results from it, so I'll try to adapt your workflow into that setup. To the community: Do you have some additional guidance for that setup?
I don’t mind LLMs, but what irks me is the customer noncompete, you have these systems that can do almost anything and the legal terms explicitly say you’re not allowed to use the thing for anything that competes with the thing. But if the things can do almost anything then you really can’t use it for anything. Making a game with Grok? No, that competes with the xAI game studio. Making an agents framework with ChatGPT? No, that competes with Swarm. Making legal AI with Claude? No, that competes with Claude. Seems like the only American companies making AI we can actually use for work are HuggingFace and Meta.
This is effing great...thanks for sharing your experience.
I was just wondering how to give my edits back to in-browser tools like Claude or ChatGPT, but the idea of repo mix is great, will try!
Although I have been flying bit with copilot in vscode, so right now I have essentially two AI, one for larger changes (in the browser), and then minor code fixes (in vscode).
I'm curious to see his mise tasks. He lists a few of them near the end but I'm not sure what his LLM CLI is. Is that an actual tool or is he using it as a placeholder for "insert your LLM CLI tool here"?
Also, don't forget that your favorite AI tools can be of great help with the factors that cause us to make software: research, subject expertise, marketing, business planning, etc.
My LLM codegen workflow
(harper.blog)485 points by lolptdr 18 February 2025 | 157 comments
Comments
It looks like the sort of nonproductive yak-shaving you do when you're stuck or avoiding an unpleasant task--coasting, fooling around incrementally with your LLM because your project's fucked and you psychologically need some sense of progress.
The opposite of this is burnout--one of the things they don't tell you about successful projects with good tools is they induce much more burnout than doomed projects. There's a sort of Amdahl's Law in effect, where all the tooling just gives you more time to focus on the actual fundamentals of the product/project/problem you’re trying to address, which is stressful and mentally taxing even when it works.
Fucking around with LLM coding tools, otoh, is very fun, and like constantly clean-rebuilding your whole (doomed) project, gives you both some downtime and a sense of forward momentum--look how much the computer is chugging!
The reality testing to see if the tool is really helping is to sit down with a concrete goal and a (near) hard deadline. Every time I've tried to use an LLM under these conditions it just fails catastrophically--I don't just get stuck, I realize how basically every implicit decision embedded in the LLM output has an unacceptably high likelihood of being wrong, and I have an amount of debug cycles ahead of me exceeding the time to throw it all away and do it without the LLM by, like, an order of magnitude.
I'm not an LLM-coding hater and I've been doing AI stuff that's worked for decades, but current offerings I've tried aren't even close to productive compared to searching for code that already exists on the web.
I believe most people who struggle to be productive with language models simply haven’t put in the necessary practice to communicate effectively with AI. The issue isn’t with the intelligence of the models—it’s that humans are still learning how to use this tool properly. It’s clear that the author has spent time mastering the art of communicating with LLMs. Many of the conclusions in this post feel obvious once you’ve developed an understanding of how these models "think" and how to work within their constraints.
I’m a huge fan of the workflow described here, and I’ll definitely be looking into AIder and repomix. I’ve had a lot of success using a similar approach with Cursor in Composer Agent mode, where Claude-3.5-sonnet acts as my "code implementer." I strategize with larger reasoning models (like o1-pro, o3-mini-high, etc.) and delegate execution to Claude, which excels at making inline code edits. While it’s not perfect, the time savings far outweigh the effort required to review an "AI Pull Request."
Maximizing efficiency in this kind of workflow requires a few key things:
- High typing speed – Minimizing time spent writing prompts means maximizing time generating useful code.
- A strong intuition for "what’s right" vs. "what’s wrong" – This will probably become less relevant as models improve, but for now, good judgment is crucial.
- Familiarity with each model’s strengths and weaknesses – This only comes with hands-on experience.
Right now, LLMs don’t work flawlessly out of the box for everyone, and I think that’s where a lot of the complaints come from—the "AI haterade" crowd expects perfection without adaptation.
For what it’s worth, I’ve built large-scale production applications using these techniques while writing minimal human code myself.
Most of my experience using these workflows has been in the web dev domain, where there's an abundance of training data. That said, I’ve also worked in lower-level programming and language design, so I can understand why some people might not find models up to par in every scenario, particularly in niche domains.
Does the time invested into the planning benefit you? Have you noticed less hallucinations? Have you saved time overall?
I’d be curious to hear because my current workflow is basically
1. Have idea
2. create-next-app + ShadCN + TailwindUI boilerplate
3. Cursor Composer on agent mode with Superwispr voice transcription
I’m gonna try the author’s workflow regardless, but would love to hear others opinions.
For example, rather than using Plasmo for its browser extension boilerplate and packaging utilities, I’ve chosen to ask the LLM to setup all of that for me as it won’t have any blindspots when tasked with debugging.
Having 7 different instances of an LLM analyzing the same code base and making suggestions would not just be economically wasteful, it would also be unpractical or even dangerous?
Outside of RAG, which is a different thing, are there products that somehow "centralize" the context for a team, where all questions refer to the same codebase?
I ended up finishing my side projects when I kept these in mind, rather than focusing on elegant code for elegant code's sake.
It seems the key to using LLMs successfully is to make them create a specification and execution plan, through making them ask /you/ questions.
If this skill--specification and execution planning--is passed onto LLMs, along with coding, then are we essentially souped-up tester-analysts?
> if it doesn’t work, Q&A with aider to fix
I fix errors myself, because LLMs are capable of producing large chunks of really stupid/wrong code, which needs to be reverted, and thats why it makes sense to see the code at least once.
Also I used to find myself in a situation when I tried to use LLM for the sake of using LLM to write code (waste of time)
Your workflow is much more polished, will definitely try it out for my next project
What am I doing wrong or what am I missing? My experience has been so underwhelming I just don’t understand the hype for why people use Claude over something else.
Sorry I know there are many models out there, and Claude is probably better than 99% of them. Can someone help me understand the value of it over o1/o3? I honestly feel like I like 4o better.
/frustration-rant
https://news.ycombinator.com/item?id=43057907
Other than that a great article! Very insightful.
Because a lot of the benefits of LLM is bringing ideas or questions I am not thinking of right now, and this really does that. Typically this would happen as I dig through a topic, not before hand. So that's a net benefit.
I also tried it and it worked a charm, the LLM did respect context and the step by step approach, poking holes in my ideas. Amazing work.
I still like writing codes and solving puzzles in my mind so I won't be doing the "execution" part. From there on, I mostly use LLM as auto complete and I'm stuck here or obscure bug solving. Otherwise, I don't get any satisfaction from programming, having learnt nothing.
Also what do you mean by “I really want someone to solve this problem in a way that makes coding with an LLM a multiplayer game. Not a solo hacker experience.” ?
I act as the director, creativity and ideas person. I have one LLM that implements, and a second LLM that critiques and suggests improvements and alternatives.
so either you put the whole codebase into the context (will mostly lead to problems as tokens are limited) or you have some kind of summary with your current features etc.
Or you do some kind of "black box" iterations, which I feel won’t be that useful for new features, as the model should know about current features etc?
What’s the way here?
As opposed to Vintage Pioneer code?
Isn't the input token number way more limited than that ?
This is part is unclear to me in the "non-Greenfield" part of the article.
Iterating with aider on very limited scopes is easy, I've used it often. But what about understanding a whole repository and act on it ? Following imports to understand a Typescript codebase as a whole ?
This i think is the grand vision -- what could it look like?
in my mind programming should look like a map -- you can go anywhere, and there'll be things happening. and multiple people.
If anyone wants to work on this (or have comments, hit me up!)
I was just wondering how to give my edits back to in-browser tools like Claude or ChatGPT, but the idea of repo mix is great, will try!
Although I have been flying bit with copilot in vscode, so right now I have essentially two AI, one for larger changes (in the browser), and then minor code fixes (in vscode).
Also, don't forget that your favorite AI tools can be of great help with the factors that cause us to make software: research, subject expertise, marketing, business planning, etc.
“Over my skis” ~ in over my head.
“Over my skies” ~ very far overhead. In orbit maybe?