In my experience, the most pernicious temptation is to take the buggy, non-working code you have now and to try to modify it with "fixes" until the code works. In my experience, you often cannot get broken code to become working code because there are too many possible changes to make. In my view, it is much easier to break working code than it is to fix broken code.
Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.
But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.
Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.
Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.
For #4 (divide and conquer), I've found `git bisect` helps a lot. If you have a known good commit and one of dozens or hundreds of commits after that is bad, this can help you identify the bad commit / code in a few steps.
I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).
Some additional rules:
- "It is your own fault". Always suspect your code changes before anything else. It can be a compiler bug or even a hardware error, but those are very rare.
- "When you find a bug, go back hunt down its family and friends". Think where else the same kind of thing could have happened, and check those.
- "Optimize for the user first, the maintenance programmer second, and last if at all for the computer".
> #1 Understand the system: Read the manual, read everything in depth, know the fundamentals, know the road map, understand your tools, and look up the details.
Maybe I'm mis-understand but "Read the manual, read everything in depth" sounds like. Oh, I have bug in my code, first read the entire manual of the library I'm using, all 700 pages, then read 7 books on the library details, now that a month or two has passed, go look at the bug.
I'd be curious if there's a single programmer that follows this advice.
I have been bitten more than once thinking that my initial assumption was correct, diving deeper and deeper - only to realize I had to ascend and look outside of the rabbit hole to find the actual issue.
The article is a 2024 "review" (really more of a very brief summary) of a 2002 book about debugging.
The list is fun for us to look at because it is so familiar. The enticement to read the book is the stories it contains. Plus the hope that it will make our juniors more capable of handling complex situations that require meticulous care...
The discussion on the article looks nice but the submitted title breaks the HN rule about numbering (IMO). It's a catchy take on the post anyway. I doubt I would have looked at a more mundane title.
Also sometimes: the bug is not in the code, its in the data.
A few times I looked for a bug like "something is not happening when it should" or "This is not the expected result", when the issue was with some config file, database records, or thing sent by a server.
For instance, particularly nasty are non-printable characters in text files that you don't see when you open the file.
"simulate the failure" is sometimes useful, actually. Ask yourself "how would I implement this behavior", maybe even do it.
Also: never reason on the absence of a specific log line. The logs can be wrong (bugged) too, sometimes. If you printf-debugging a problem around a conditional for instance, log both branches.
I also think it is worthwhile stepping thru working code with a debugger. The actual control flow reveals what is actually happening and will tell you how to improve the code. It is also a great way to demystify how other's code runs.
One good timesaver: debug in the easiest environment that you can reproduce the bug in. For instance, if it’s an issue with a website on an iPad, first see if you reproduce in chrome using the responsive tools in web developer. If that doesn’t work, see if it reproduces in desktop safari. Then the iPad simulator, and only then the real hardware. Saves a lot of frustration and time, and each step towards the actual hardware eliminates a whole category of bugs.
I just spent a whole day trying to figure out what was going on with a radio. Turns out I had tx/rx swapped. When I went to check tx/rx alignment I misread the documentation in the same way as the first. So, I would even add "try switching things anyways" to the list. If you have solid (but wrong) reasoning for why you did something then you won't see the error later even if it's right in front of you.
I used to manage a team that supported an online banking platform and gave a copy of this book to each new team member. If nothing else, it helped create a shared vocabulary.
It's useful to get the poster and make sure everyone knows the rules.
Over twenty five odd years, I have found the path to a general debugging prowess can best be achieved by doing it. I'd recommend taking the list/buying the book, using https://up-for-grabs.net to find bugs on github/bugzilla, etc. and doing the following:
1. set up the dev environment
2. fork/clone the code
3. create a new branch to make changes and tests
4. use the list to try to find the root cause
5. create a pull request if you think you have fixed the bug
> Make it fail: Do it again, start at the beginning, stimulate the failure, don't simulate the failure, find the uncontrolled condition that makes it intermittent, record everything and find the signature of intermittent bugs
Unfortunately, I found many times this is actually the most difficult step. I've lost count of how many times our QA reported an intermittent bug in their env, only to never be able to reproduce it again in the lab. Until it hits 1 or 2 customer in the field, but then when we try to take a look at customer's env, it's gone and we don't know when it could come back again.
Take the time to speed up my iteration cycles has always been incredibly valuable. It can be really painful because its not directly contributing to determining/fixing the bug (which could be exacerbated if there is external pressure), but its always been worth it. Of course, this only applies to instances where it takes ~4+ minutes to run a single 'experiment' (test, startup etc). I find when I do just try to push through with long running tests I'll often forget the exact variable I tweaked during the course of the run. Further, these tweaks can be very nuanced and require you to maintain a lot of the larger system in your head.
I know it is the best route, I do know the system (maybe I wrote it) and yet time and again I don’t take the time to read what I should… and I make assumptions in hopes of speeding up the process/ fix, and I cost myself time…
> Check that it's really fixed, check that it's really your fix that fixed it, know that it never just goes away by itself
I wish this were true, and maybe it was in 2004, but when you've got noise coming in from the cloud provider and noise coming in from all of your vendors I think it's actually quite likely that you'll see a failure once and never again.
I know I've fixed things for people without without asking if they ever noticed it was broken, and I'm sure people are doing that to me also.
> Quit thinking and look (get data first, don't just do complicated repairs based on guessing)
From my experience, this is the single most important part of the process.
Once you keep in mind that nothing paranormal ever happens in systems and everything has an explanation, it is your job to find the reason for things, not guess them.
I tell my team: just put your brain aside and start following the flow of events checking the data and eventually you will find where things mismatch.
I've had trouble keeping the audit trail. It can distract from the flow of debugging, and there can be lots of details to it, many of which end up being irrelevant; i.e. all the blind rabbit holes that were not on the maze path to the bug. Unless you're a consultant who needs to account for the hours, or a teller of engaging debugging war stories, the red herrings and blind alleys are not that useful later.
I had the incredible luck to stumble upon this book early in my career and it helped me tremendously in so many ways. If I could name only one it would be that it helped me get over the sentiment of being helpless in front of a difficult situation. This book brought me to peace with imperfection and me being an artisan of imperfection.
One thing I have been doing is to create a directory called "debug" from the software and write lots of different files when the main program has executed to add debugging information but only write files outside of hot loops for debugging and then visually inspect the logs when the program is exited.
For intermediate representations this is better than printf to stdout
I can't comment further on David A. Wheeler's review because his words were from 2004 (He said everything true), and I can't comment on the book either because I haven't read it yet.
Thank you for introducing me to this book.
One of my favorite rules of debugging is to read the code in plain language. If the words don't make sense somewhere, you have found the problem or part of it.
#7 Check the plug: Question your assumptions, start at the beginning, and test the tool.
I have found that 90% of network problems, are bad cables.
That's not an exaggeration. Most IT folks I know, throw out ethernet cables immediately. They don't bother testing them. They just toss 'em in the trash, and break a new one out of the package.
Each comment: "..and this is my 10th rule: <insert witty rule>"
Total number of rules when reaching the end of the post: 9 + n + n * m, with n being number of users commenting, m being the number of users not posting but still mentally commenting on the other users' comments.
Review was good enough to make me snag the entire book. I'm taking a break from algorithmic content for a bit and this will help. Besides, I've got an OOM bug at work and it will be fun to formalize the steps of troubleshooting it. Thanks, OP!
Nice classic that sticks to timeless pricniples. the nine rules are practical with war stories that make them stick. but agree that "don't panic" should be added
Personally, I’d start with divide and conquer.
If you’re working on a relevant code base chances are that you can’t learn all the API spec and documentation because it’s just too much.
I love the "if you didn't fix it, it ain't fixed". It's too easy to convince yourself something is fixed when you haven't fully root-caused it. If you don't understand exactly how the thing your seeing manifested, papering over the cracks will only cause more pain later on.
As someone who has been working on a debugging tool (https://undo.io) for close to two decades now, I totally agree that it's just weird how little attention debugging as a whole gets. I'm somewhat encouraged to see this topic staying near the top of hacker news for as long as it has.
Debugging: Indispensable rules for finding even the most elusive problems (2004)
(dwheeler.com)524 points by omkar-foss 13 January 2025 | 225 comments
Comments
Suppose you have a complete chain of N Christmas lights and they do not work when turned on. The temptation is to go through all the lights and to substitute in a single working light until you identify the non-working light.
But suppose there are multiple non-working lights? You'll never find the error with this approach. Instead, you need to start with the minimal working approach -- possibly just a single light (if your Christmas lights work that way), adding more lights until you hit an error. In fact, the best case is if you have a broken string of lights and a similar but working string of lights! Then you can easily swap a test bulb out of the broken string and into the working chain until you find all the bad bulbs in the broken string.
Starting with a minimal working example is the best way to fix a bug I have found. And you will find you resist this because you believe that you are close and it is too time-consuming to start from scratch. In practice, it tends to be a real time-saver, not the opposite.
Really, that's important. You need to think clearly, deadlines and angry customers are a distraction. That's also when having a good manager who can trust you is important, his job is to shield you from all that so that you can devote all of your attention to solving the problem.
Here's a walk through on using it: https://nickjanetakis.com/blog/using-git-bisect-to-help-find...
I jumped into a pretty big unknown code base in a live consulting call and we found the problem pretty quickly using this method. Without that, the scope of where things could be broken was too big given the context (unfamiliar code base, multiple people working on it, only able to chat with 1 developer on the project, etc.).
The Martian by Andy Weir https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)
https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_...
https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel)
To Engineer Is Human - The Role of Failure in Successful Design By Henry Petroski https://pressbooks.bccampus.ca/engineeringinsociety/front-ma...
https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Fe...!
Maybe I'm mis-understand but "Read the manual, read everything in depth" sounds like. Oh, I have bug in my code, first read the entire manual of the library I'm using, all 700 pages, then read 7 books on the library details, now that a month or two has passed, go look at the bug.
I'd be curious if there's a single programmer that follows this advice.
Julia Evans also has a very nice zine on debugging: https://wizardzines.com/zines/debugging-guide/
> Assumption is the mother of all screwups.
1. Is this mistake somewhere else also?
2. What next bug is hidden behind this one?
3. What should I do to prevent bugs like this?
The list is fun for us to look at because it is so familiar. The enticement to read the book is the stories it contains. Plus the hope that it will make our juniors more capable of handling complex situations that require meticulous care...
The discussion on the article looks nice but the submitted title breaks the HN rule about numbering (IMO). It's a catchy take on the post anyway. I doubt I would have looked at a more mundane title.
A few times I looked for a bug like "something is not happening when it should" or "This is not the expected result", when the issue was with some config file, database records, or thing sent by a server.
For instance, particularly nasty are non-printable characters in text files that you don't see when you open the file.
"simulate the failure" is sometimes useful, actually. Ask yourself "how would I implement this behavior", maybe even do it.
Also: never reason on the absence of a specific log line. The logs can be wrong (bugged) too, sometimes. If you printf-debugging a problem around a conditional for instance, log both branches.
Rule 1 should be: Reproduce with most minimal setup.
99% you’ll already have found the bug.
1% for me was a font that couldn’t do a combination of letters in a row. life ft, just didn’t work and thats why it made mistakes in the PDF.
No way I could’ve ever known that if I wouldn’t have reproduced it down to the letter.
Just split code in half till you find what’s the exact part that goes wrong.
I just spent a whole day trying to figure out what was going on with a radio. Turns out I had tx/rx swapped. When I went to check tx/rx alignment I misread the documentation in the same way as the first. So, I would even add "try switching things anyways" to the list. If you have solid (but wrong) reasoning for why you did something then you won't see the error later even if it's right in front of you.
It's useful to get the poster and make sure everyone knows the rules.
https://debuggingrules.com/download-the-poster/
10) Enable frame pointers [1].
[1] The return of the frame pointers:
https://news.ycombinator.com/item?id=39731824
https://www.udacity.com/course/debugging--cs259
1. set up the dev environment
2. fork/clone the code
3. create a new branch to make changes and tests
4. use the list to try to find the root cause
5. create a pull request if you think you have fixed the bug
And use Rule 0 from GuB-42: Don't panic
(edited for line breaks)
"Dave was asked as the author of Debugging to create a list of 5 books he would recommend to fans, and came up with this.
https://shepherd.com/best-books/to-give-engineers-new-perspe..."
Unfortunately, I found many times this is actually the most difficult step. I've lost count of how many times our QA reported an intermittent bug in their env, only to never be able to reproduce it again in the lab. Until it hits 1 or 2 customer in the field, but then when we try to take a look at customer's env, it's gone and we don't know when it could come back again.
I know it is the best route, I do know the system (maybe I wrote it) and yet time and again I don’t take the time to read what I should… and I make assumptions in hopes of speeding up the process/ fix, and I cost myself time…
I wish this were true, and maybe it was in 2004, but when you've got noise coming in from the cloud provider and noise coming in from all of your vendors I think it's actually quite likely that you'll see a failure once and never again.
I know I've fixed things for people without without asking if they ever noticed it was broken, and I'm sure people are doing that to me also.
From my experience, this is the single most important part of the process. Once you keep in mind that nothing paranormal ever happens in systems and everything has an explanation, it is your job to find the reason for things, not guess them.
I tell my team: just put your brain aside and start following the flow of events checking the data and eventually you will find where things mismatch.
Don't be too embarassed to scatter debug logmessages in the code. It helps.
My second rule:
Don't forget to remove them when you're done.
For intermediate representations this is better than printf to stdout
Thank you for introducing me to this book.
One of my favorite rules of debugging is to read the code in plain language. If the words don't make sense somewhere, you have found the problem or part of it.
I have found that 90% of network problems, are bad cables.
That's not an exaggeration. Most IT folks I know, throw out ethernet cables immediately. They don't bother testing them. They just toss 'em in the trash, and break a new one out of the package.
Each comment: "..and this is my 10th rule: <insert witty rule>"
Total number of rules when reaching the end of the post: 9 + n + n * m, with n being number of users commenting, m being the number of users not posting but still mentally commenting on the other users' comments.
Only thing that I dont agree is the book cost US$ 4.291,04 on Amazon
Wheeler gets close to it by suggesting to locate which side of the bug you're on, but often I find myself doing this recursively until I locate it.
I've followed https://debugbetter.com/ for a few weeks and the content has been great!
You can’t trust a thing this person says if they’re not recommending a duck.
As someone who has been working on a debugging tool (https://undo.io) for close to two decades now, I totally agree that it's just weird how little attention debugging as a whole gets. I'm somewhat encouraged to see this topic staying near the top of hacker news for as long as it has.
Title is: David A. Wheeler's Review of Debugging by David J. Agans
Yeah, ain't nobody got time for that. If e.g. debugging a compile issue meant we read the compiler manual, we'd get nothing done...