On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?
The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.
Example usage from that README (and the blog post):
% go run main.go \
--spec 'develop a function to take in a large text, recognize and parse any and all ipv4 and ipv6 addresses and CIDRs contained within it (these may be surrounded by random words or symbols like commas), then return them as a list' \
--sig 'func ParseCidrs(input string) ([]*net.IPNet, error)'
In Rust, there's a controversial practice around putting unit tests in the same file as the actual code. I was put off by it at first, but I'm finding LLM autocomplete is able to be much more effective just being able to see the tests.
> For best results, our project structure needs to be set up with LLM workflows in mind. Specifically, we should carefully manage and keep the cognitive load required to understand and contribute code to a project at a minimum.
What's the main barrier to doing this all the time? Sounds like a good practice in general.
In my experience, I let the LLM help me produce code and tests. Most of my human effort is dedicated to verifying the tests, and then using the tests to verify the code.
Automation doesn't seem like a good idea. I feel it's mandatory to carefully guard the LLM, not only to verify that the LLM-generated tests (functions) are as expected, but also to modify some code that, while not affecting the correctness of the function, has low performance or poor readability.
We implemented something similar for our Java backend project based on my rant here: https://testdriven.com/testdriven-2-0-8354e8ad73d7 Works great! I only look at generated code if it passes the tests. Now, can we use LLMs to generate tests from requirements? Maybe, but tests are mostly declarative and are easier to write than production code most of the time. This approach also allows us to use cheaper models, because the tool will automatically tell the model about compile error and failed tests. Usually, we give it up to five attempts to fix the code.
Super interesting approach! We've been working on the opposite - always getting your Unit tests written with every PR. The idea is that you don't have to bother running or writing them, you just get them delivered in your Github repo. You can check it out here https://www.codebeaver.ai
If you want better tests with more cases exercising your code: write property based tests.
Tests form an executable, informal specification of what your software is supposed to do. It should absolutely be written by hand, by a human, for other humans to use and understand. Natural language is not precise enough for even informal specifications of software modules, let alone software systems.
If using LLM's to help you write the code is your jam, I can't stop you, but at least write the tests. They're more important.
As an aside, I understand how this antipathy towards TDD develops. People write unit tests, after writing the implementation, because they see it as boilerplate code that mirrors what the code they're testing already does. They're missing the point of what makes a good test useful and sufficient. I would not expect generating more tests of this nature is going to improve software much.
Hey, yeah, this is a fun idea. I built a little toy llm-tdd loop as a Saturday morning side project a little while back: https://github.com/zephraph/llm-tdd.
This doesn't actually work out that well in practice though because the implementations the llm tended to generate were highly specific to pass the tests. There were several times it would cheat and just return hard coded strings that matched the expects of the tests. I'm sure better prompt engineering could help, but it was a fairly funny outcome.
Something I've found more valuable is generating the tests themselves. Obviously you don't wholesale rely on what's generated. Tests can have a certain activation energy just to figure out how to set up correctly (especially if you're in a new project). Having an LLM take a first pass at it and then ensuring it's well structured and testing important codepaths instead of implementation details makes it a lot faster to write tests.
It basically starts from some model or controller, and then parses the Ruby code into an AST, and load all the references, and then parses that code into an AST, up to X number of files, and ships them all off to GPT4-o1 for writing a spec.
I found sometimes, without further prompting, the LLM would write specs that were so heavily mocked that it became almost useless like:
```
mock(add_two_numbers).and_return(3)
...
expect(add_two_numbers(1, 2)).to_return(3)
```
(Not that bad, just an illustrating example)
But the tests it generates is quite good overall, and sometimes shockingly good.
> recognize and parse any and all ipv4 and ipv6 addresses and CIDRs contained within it (these may be surrounded by random words or symbols like commas), then return them as a list'
Did I miss the generated code and test cases? I would like to see how complete it was.
For example, for IPv4 does it only handle quad-dotted IP addresses, or does it also handle decimal and hex formats?
For that matter, should it handle those, and if so, where there clarification of what exactly 'all ipv4 ... addresses' means?
I can think of a lot of tricky cases (like 1.2.3.4.5 and 3::2::1 as invalid cases, or http://[2001:db8:4006:812::200e] to test for "symbols like commas"), and would like to see if the result handles them.
I’m not going to claim I’ve solved this and figured out “the way” to use LLMs for tests, but I’ve found that copy-and-pasting code + tests and then providing a short essay about my own reasoning of edge cases followed with something along the lines of
“your job is to find out what edge cases my reasoning isn’t accounting for, cases that would expose latent properties of the implementation not exposed via its contract, cases tested for by other similar code, domain exceptions I’m not accounting for, cases that test unexplored code paths, cases that align exactly with chunking boundaries or that break chunking assumptions, or any other edge cases I’m neglecting to mention that would be useful both to catch mistakes in the current code and to handle foreseeable mistakes that could arise from refactoring in the future. Try to understand how the existing test cases are defined to catch possibly problematic inputs and extend accordingly. Take into account both the api contract and the underlying implementation and approach this matter from an adversarial perspective where the goal of the tests is to challenge the author’s assumptions and break their code” has been useful.
Test-driven development with an LLM for fun and profit
(blog.yfzhou.fyi)207 points by crazylogger 16 January 2025 | 84 comments
Comments
1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.
2. Coding agents do massively better when they have a test-driven reward signal.
3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.
4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.
5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
Sure enough, I see HN projects evolving in that direction.
The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.
Example usage from that README (and the blog post):
The all important prompts it uses are in https://github.com/yfzhou0904/tdd-with-llm-go/blob/main/prom...Is the label "TDD" being hijacked for something new? Did that already happen? Are LLMs now responsible for defining TDD?
No clunky loop needed.
It's gotten me back into TDD.
What's the main barrier to doing this all the time? Sounds like a good practice in general.
Automation doesn't seem like a good idea. I feel it's mandatory to carefully guard the LLM, not only to verify that the LLM-generated tests (functions) are as expected, but also to modify some code that, while not affecting the correctness of the function, has low performance or poor readability.
If you want better tests with more cases exercising your code: write property based tests.
Tests form an executable, informal specification of what your software is supposed to do. It should absolutely be written by hand, by a human, for other humans to use and understand. Natural language is not precise enough for even informal specifications of software modules, let alone software systems.
If using LLM's to help you write the code is your jam, I can't stop you, but at least write the tests. They're more important.
As an aside, I understand how this antipathy towards TDD develops. People write unit tests, after writing the implementation, because they see it as boilerplate code that mirrors what the code they're testing already does. They're missing the point of what makes a good test useful and sufficient. I would not expect generating more tests of this nature is going to improve software much.
Edit added some wording for clarity
That's what the software industry has been trying and failing at for more than a decade.
This doesn't actually work out that well in practice though because the implementations the llm tended to generate were highly specific to pass the tests. There were several times it would cheat and just return hard coded strings that matched the expects of the tests. I'm sure better prompt engineering could help, but it was a fairly funny outcome.
Something I've found more valuable is generating the tests themselves. Obviously you don't wholesale rely on what's generated. Tests can have a certain activation energy just to figure out how to set up correctly (especially if you're in a new project). Having an LLM take a first pass at it and then ensuring it's well structured and testing important codepaths instead of implementation details makes it a lot faster to write tests.
https://gist.github.com/czhu12/b3fe42454f9fdf626baeaf9c83ab3...
It basically starts from some model or controller, and then parses the Ruby code into an AST, and load all the references, and then parses that code into an AST, up to X number of files, and ships them all off to GPT4-o1 for writing a spec.
I found sometimes, without further prompting, the LLM would write specs that were so heavily mocked that it became almost useless like:
``` mock(add_two_numbers).and_return(3) ... expect(add_two_numbers(1, 2)).to_return(3) ``` (Not that bad, just an illustrating example)
But the tests it generates is quite good overall, and sometimes shockingly good.
Did I miss the generated code and test cases? I would like to see how complete it was.
For example, for IPv4 does it only handle quad-dotted IP addresses, or does it also handle decimal and hex formats?
For that matter, should it handle those, and if so, where there clarification of what exactly 'all ipv4 ... addresses' means?
I can think of a lot of tricky cases (like 1.2.3.4.5 and 3::2::1 as invalid cases, or http://[2001:db8:4006:812::200e] to test for "symbols like commas"), and would like to see if the result handles them.