War story: the hardest bug I ever debugged

(clientserver.dev)

Comments

aetimmes 27 March 2025
(disclaimer: I know OP IRL.)

I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:

At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.

Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.

jdwithit 27 March 2025
I wish I could recall the details better but this was 20+ years ago now. In college I had an internship working at Bose, doing QA on firmware in a new multi CD changer addon to their flagship stereo. We were provided discs of music tracks with various characteristics. And had to listen to them over and over and over and over and over and over, running through test cases provided by QA management as we did. But also doing random ad-hoc testing once we finished the required tests on a given build.

At one point I found a bug where if you hit a sequence of buttons on the remote at a very specific time--I want to say it was "next track" twice right as a new track started--the whole device would crash and reboot. This was a show stopper; people would hit the roof if their $500 stereo crashed from hitting "next". Similar to the article, the engineering lead on the product cleared his schedule to reproduce, find, and fix the issue. He did explain what was going on at the time, but the specifics are lost to me.

Overall the work was incredibly boring. I heard the same few tracks so many times I literally started to hear them in my dreams. So it was cool to find a novel, highest severity bug by coloring outside the lines of the testcases. I felt great for finding the problem! I think the lead lost 20% of his hair in the course of fixing it, lol.

I haven't had QA as a job title in a long time but that job did teach me some important lessons about how to test outside the happy path, and how to write a reproducible and helpful bug report for the dev team. Shoutout to all the extremely underpaid and unappreciated QA folks out there. It sucks that the discipline doesn't get more respect.

BobbyTables2 27 March 2025
Interesting writeup, but 2 days to debug “the hardest bug ever”, while accurate, seems a bit overdone.

Though abs() returning negative numbers is hilarious.. “You had one job…”

To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.

I’m not just talking about concurrency issues either…

The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

2 days is cute though.

nneonneo 27 March 2025
FWIW: this type of bug in Chrome is exploitable to create out-of-bounds array accesses in JIT-compiled JavaScript code.

The JIT compiler contains passes that will eliminate unnecessary bounds checks. For example, if you write “var x = Math.abs(y); if(x >= 0) arr[x] = 0xdeadbeef;”, the JIT compiler will probably delete the if statement and the internal nonnegative array index check inside the [] operator, as it can assume that x is nonnegative.

However, if Math.abs is then “optimized” such that it can produce a negative number, then the lack of bounds checks means that the code will immediately access a negative array index - which can be abused to rewrite the array’s length and enable further shenanigans.

Further reading about a Chrome CVE pretty much exactly in this mold: https://shxdow.me/cve-2020-9802/

perihelions 27 March 2025
My own story: I spent >10 hours debugging an Emacs project that would occasionally cause a kernel crash on my machine. Proximate cause was a nonlocal interaction between two debug-print statements. (Wasn't my first guess). The Elisp debug-print function #'message has two effects: it appends to a log, and also does a small update notification in the corner of the editor window. If that corner-of-the-window GUI object is thrashed several hundred times in a millisecond, it would cause the GPU driver on my specific machine to lock up, for a reason I've never root-caused.

Emacs' #'message implementation has a debounce logic, that if you repeatedly debug-print the same string, it gets deduplicated. (If you call (message "foo") 50 times fast, the string printed is "foo [50 times]"). So: if you debug-print inspect a variable that infrequently changes (as was the case), no GUI thrashing occurs. The bug manifested when there were *two* debug-print statements active, which circumvented the debouncer, since the thing being printed was toggling between two different strings. Commenting out one debug-print statement, or the other, would hide the bug.

jason_tko 27 March 2025
Reminds me of the classic bug story where users couldn’t send emails more than 500 miles.

https://web.mit.edu/jemorris/humor/500-miles

friendzis 27 March 2025
My hardest bug story, almost circling back to the origin of the word.

An intern gets a devboard with a new mcu to play with. A new generation, but mostly backwards compatible or something like that. Intern gets the board up and running with embedded equivalent of "hello world". They port basic product code - ${thing} does not work. After enough hair are pulled, I give them some guidance - ${thing} does not work. Okay, I instruct intern to take mcu vendor libraries/examples and get ${thing} running in isolation. Intern fails.

Okay, we are missing something huge that should be obvious. We start pair programming and strip the code down layer by layer. Eventually we are at a stage where we are accessing hand-coded memory addresses directly. ${thing} does not work. Okay, set up a peripheral and read state register back. Assertion fails. Okay, set up peripheral, nop some time for values to settle, read state register back. Assertion fails. Check generated assembly - nopsled is there.

We look at manual, the bit switching peripheral into the state we care about is not set. However we poke the mcu, whatever we write to control register, the bit is just not set and the peripheral never switches into the mode we need. We get a new devboard (or resolder mcu on the old one, don't remember) and it works first try.

"New device - must be new behavior" thinking with lack of easy access to the new hardware led us down a rabbit hole. Yes, nothing too fancy. However, I shudder thinking what if reading the state register gave back the value written?

latexr 27 March 2025
It’s amusing how so many of the comments here are like “You think two days is hard? Well, I debugged a problem which was passed down to me by my father, and his father before him”. It reminds me of the Four Yorkshiremen sketch.

https://youtube.com/watch?v=sGTDhaV0bcw

The author’s “error”, of course, was calling it “the hardest bug I ever debugged”. It drives clicks, but comparisons too.

jandrese 27 March 2025
At least the author worked for Google. It's another layer of fun to go through the work of tracking down a bug like that as a third party and then trying to somehow contact a person at the company who can fix it, especially when it is a big company and doubly so if the product is older and on a maintenance only schedule.

Me: "Your product is broken for all customers in this situation, probably has been so for years, here is the exact problem and how to fix it, can I talk with someone who can do the work?"

Customer Support: "Have you tried turning your machine off and turning it back on again?"

zdc1 27 March 2025
My worst bug had me using statistics to try and correlate occurrence rates with traffic/time of day, API requests, app versions, Node.js versions, resource allocations, etc. And when that failed I was capturing Prod traffic for examination in Wireshark...

Turned out that Node.js didn't gracefully close TCP connections. It just silently dropped the connection and sent a RST packet if the other side tried to reuse it. Fun times.

sandos 27 March 2025
Complaining about "slow to reproduce" and talking _seconds_. Dear, oh dear those are rookie numbers!

Currently working a bug where we saw file system corruption after 3 weeks of automated testing, 10s of thousands of restarts. We might never see the problem again, even? Only happened once yet.

TrackerFF 27 March 2025
My hardest debug was actually not software related, it was my first car - late 80s VW Passat. The problem was that the battery would simply not charge, and I had to jump-start it every time I used it, or park at the top of a hill/street and start it rolling down.

Bought a brand new battery, but the problem persisted. Started looking at all the various parts in the car, that were connected to the electrical system. Took them out, troubleshooting the parts to my best ability, even ended up buying a new alternator AND solenoid just out of sheer desperation.

3 months went by, countless hours in the garage, and I thought to myself...could it be...could it be the new battery I bought? Bought yet another battery, and everything worked. Just like that.

Turns out the battery I had in my car originally had degraded, and couldn't store enough charge. And the second (brand new) I bought turned out to also be defect, having the very same fault.

Those faulty batteries would charge up to measure the correct voltage, but didn't get the correct charge capacity - and thus the car couldn't draw enough current to start the engine.

And don't get me started on the weird wacky world of electronics...but the car debugging was by far the longest I've spent, at one point I had almost every component out of the car, going over the wiring.

kevincox 27 March 2025
We had a fun bug where our VPN was crashing on macOS. The error was pretty clear, we were subtracting two timestamps and getting a negative, which should never happen as these were from a monotonic clock. We spent lots of time analyzing all of the code to make sure that the arguments were all in the right order and being subtracted from the right values and everything looked fine.

However we still saw these crash reports from one device (conveniently the partner of the CEO, so we got full debug reports). However the system logs were suspicious, lots of clock jumps especially when coming out of sleep. At the end of the day we concluded it was bad hardware (an M1 Max) and the OS was trusting it too much, returning out-of-order values for a supposedly monotonic clock. We updated the code to use saturating arithmetic to mitigate the problem.

donatj 27 March 2025
In the late 1990s my friend was writing a game for his TI-83 calculator in TI-Basic. He was running into this bizarre bug we boiled down to a single IF after almost an hour of back and forth over a single calculator. The IF was not behaving as you would expect and it made zero sense. In the early version of TI-Basic, operators are actually single symbols, rather than made from text characters. In frustration I delete the IF symbol, insert a new one, and fire the game up. Everything works, and my friend just about dies in disbelief. It's probably my most frustrating bug fix.

I was telling someone the story a couple years ago and they said the opcodes linked to the symbols could get corrupted or something like that.

wasabiketchup 27 March 2025
Fun one:

I work on a server software of online backups for customers. We do daily thousands of mount/umount of a particular filesystem. Once every month or so, we get an issue where a file timestamp fails to save, the error happens at the filesystem level.

Hard to reproduce! It's a filesystem bug! So it's full theorical, reading code and seeing how it would happen.

Found out after a while, the conditions were fun. I don't remember exactly, but it was like, you need to follow these steps : 1/ Create a folder 2/ Create in it 99 files (no more no less) 3/ Create a new folder 4/ Copy the first of the 99 files in the new folder

The issue was linked to some data structure caching, and cache eviction.

Had fun finding it out!

epolanski 27 March 2025
The worst bug I've ever encountered was a JS file that kept not running, with very cryptic and hard to understand trace that made no sense. TypeScript and others parsed it fine without any issues.

After 3 days of literally trying everything, I don't know why, I thought of rewriting the file character by character by hand and it worked. What was happening?

Eventually opened the two files side by side in a hex editor and here it is: several exotic unicode characters for "empty" space.

bambax 27 March 2025
> When doing the refactoring, they needed to provide new implementations for every opcode. Someone accidentally turned Math.abs() into the identity function for the super-optimized level. But nobody noticed because it almost never ran — and was right half of the time when it did.

That's the perfect optimization: extremely fast, and mostly right -- probably more often than 50% if there are more positive numbers than negative ones.

noduerme 27 March 2025
Amazing war story. Very well told.

Honestly, of all the stupid ideas, having your engine switch to a completely untested mode when under heavy load, a mode that no one ever checks and it might take years to discover bugs in, is absolutely one of most insane things I can think of. That's at best really lazy, and at worst displays a corporate culture that prizes superficial performance over reliability and quality. Thankfully no one's deploying V8 in, like, avionics. I hope.

At least this is one of those bugs you can walk away from and say, it really truly was a low-level issue. And it takes serious time and energy to prove that.

farazbabar 27 March 2025
One of the interesting ones we encountered was in the JDBC driver of our chosen database at the time. Under load, the application core dumped. Mind you this is java, running a native jdbc driver, no JNI in sight. It took some gdb stepping to figure out that under load, the JIT compiler got a little aggressive and inlined a little more code than there was room in the JIT buffer - result? a completely random core dump. Once I did find it, it was a simple matter of increasing JIT buffer size and adding more heap and ram. Tracing assembler generated from byte code generated from java was just part of the issue, the fact that the code itself had nothing to do with the issue is what made it interesting as the buffer size is set in a completely different area by the jvm. Fun times.
gpvos 27 March 2025
It seems to me that V8 had very bad unit tests if this wasn't caught before release. Making sure all operators act the same way when optimized and not is a no-brainer.
richardw 27 March 2025
Funkiest for me was a random crash in a C# app. No pattern whatsoever. No function or user role or part of the software or time of day. I had to learn crash dump analysis and bought my first Kindle books (on desktop, no kindle because I needed it asap), one of which had a trick to make a memory issue crash closer to the source, rather than leave it around to be stumbled over hours later. Which was the source of the randomness. Click button, crash. Move mouse, crash.

This had worked perfectly for many years but windows was upgraded underneath it, and some smartass had used clever tricks for a hover menu that didn’t work in a future (safer) version of the OS. A rarely triggered hover menu.

Thank you, authors of advanced windows debugging and advanced .net debugging.

ramshanker 27 March 2025
Seems it is a story time thread. Here goes my strangest one.

Back in 2005, when I had only paid-by-cash internet cafe access to computer, one of the shopkeeper offered me free time on computer IF I typed and ran a 15 page of class 12 computer project printed on A4 sheets, onto the compiler. TurboC++. I gladly accepted the offer and typed things.

When I finished typing, taking out all the compile error, the program didn't work as expected. Few hours latter, I find out that 1 or 2 pages of printed source codes were not in original order. :-O . So had to swap code from one function to another to finally get it working. That was one hell of a lesson!

Shopkeeper must have sold that project to many students, and I got some Free internet access.

taneq 27 March 2025
I had one that took literally years to reproduce. It was in PLC code, on a touchscreen controller running a soft PLC with Busybox under the hood. These devices were used 24/7 and usually absolutely bullet proof. Every now and then I’d get a comment that sometimes they’d crash on startup but a power cycle usually fixed it. Finally managed to get it to happen in the workshop, and dropped everything to try and figure it out.

The ultimate cause was in the network initialisation using a network library that was a tissue-paper-thin wrapper around Linux sockets. When downloading a new software version to the device, it would halt the PLC but this didn’t cleanly shut down open sockets, which would stay open, preventing a network service from starting until the unit was restarted. So I did the obvious thing and wrote the socket handle to a file. On startup I’d check the file and if it existed, shut that socket handle. This worked great during development.

Of course this file was still there after a power cycle. 99% of the time nothing would happen, but very occasionally, closing this random socket handle on startup would segfault the soft PLC runtime. So dumb, but so hard to actually catch in the wild.

protocolture 27 March 2025
I had something like this once.

Vendor provided an outlook plugin (ew) that linked storage directly in outlook (double ew) and contained a built in pdf viewer (disgusting) for law firms to manage their cases.

One user, regardless of PC, user account or any other isolation factor, would reliably crash the program and outlook with it.

She could work for 40 minutes on another users logged in account on another PC and reproduce the issue.

Turns out it was a memory allocation issue. When you open a file saved in the addons storage, via the built in pdf viewer, it would allocate memory for it. However, when you close the pdf file, it would not deallocate that memory. After debugging her usage for some time, I noted that there was a memory deallocation, but it was performed at intervals.

If there were 20 or so pdf allocations and then she switched customer case file before a deallocation, regardless of available memory, the memory allocation system in the addon would shit the bed and crash.

This one user, an absolute powerhouse of a woman I must say, could type 300 wpm and would rapidly read -> close -> assign -> allocate -> write notes faster than anyone I have ever seen before. We legitimately got her to rate limit herself to 2 files per 10 minutes as an initial workaround while waiting for a patch from the vendor.

I had to write one hell of a bug report to the vendor before they would even look at it. Naturally they could not reproduce the error through their normal tests and tried closing the bug on me several times. The first update they rolled out upped it to something like 40 pdfs viewed every 15 minutes. But she still managed to touch the new ceiling on occasion (I imagine billing each of those customers 7 minutes a pop or whatever law firms do) and ultimately they had to rewrite the entire memory system.

jonnycoder 27 March 2025
I’m not even close to being on par with other faang engineers but this is far from being a very difficult bug in my experience. The hardest bugs are the ones where the repro takes days to repro. But nonetheless the op’s tenacity is all that matters and I would trust them to solve any of the hard problems Ive faced in the past.
GarnetFloride 27 March 2025
I didn’t fix this bug but I did reproduce it so it could be fixed, but it took years. At one company I worked for we have an email archive and we were seeing an uptick in customers having issues with deleting expired emails. Most companies have a retention policy of about 7 years, and the company was now 10 years old and early customers were beginning to deleted old emails. But developers couldn’t find the bug, but reducing the scope of the deletion usually worked, so it was usually marked as not reproducible. While devs tried to debug it, no one would let us poke around their prod email server every much, for obvious reasons.

I had been promoted to technical writer and I needed a better test system that didn’t have customer data for screenshots. Something I needed was unique data because the archive used single instance storage, so I put together a bash script to create and send emails generated from random lines of public domain books I got from Gutenberg.

This worked great for me and at one point I had it fire off 1 million emails just for fun. I let my test email server and archive server chew on them over the weekend. It worked great but I had nearly maxed out my storage. No problem, use the deletion function. And it didn’t work.

It’s Didn’t Work. I had reproduced the bug in-house on a system we had full control over. Engineering and QA both took copies of my environments and started working on the bug.

I also learned the lore of the deletion feature. The founding developer didn’t think anyone wanted a deletion feature because it made no sense to him. But after pressure from the CEO, Board of Directors and customers he banged out some code over a weekend and shipped it. It was no 10 years later and he was long gone, and it was finally beginning to bite us.

After devs banged no the code for a while they found there was a design flaw, it failed if the number of items to delete was more than 500. QA had tested the feature, repeatedly, but their test data set just happened to be just smaller than 500 items so the bug never triggered. I only exceeded that because Austin Powers is funny.

Now that we could reproduce it, and knew there was a design flaw. The code for deletion needed to be replaced. It needed taking over two years to replace the code, because project management never thought it was all that important compared to new features, even though customers were complaining about it.

MrMcCall 27 March 2025
The early-to-mid-90s "High C/C++" compiler had a bug in its floating point library for basic math functions. It ended up being a bit of a Heisenbug to track down, and I didn't initially believe it wasn't my code, but it actually ended up being in their supplied library.

It took me maybe three days to track down, from first clues to final resolution, on a 486/50 luggable with the orange on black monochrome built-in screen.

wglb 27 March 2025
This is a very fun post, not only on its own merits, but also how it spurs many other hard-to-debug stories.

I like the hard-earned lessons that are often taken away from such sessions.

While nowhere on the scale of this story, I helped a fellow student while I was at the University where his program was outputting highly bogus numbers from punched card deck input. I ultimately suggested that he print out the numbers that were being read by the program and presto the field alignments were off. This has now become my first step in debugging.

During a co-op stint during my EE degree program was at a pulp bleach plant in Longview Washington. They were implementing instrumentation of various metrics in the bleach tower. The engineers told of a story about one of their instruments to measure flow or temperature or acidity. The instrument was failing but the manufacturer couldn't find any flaw, shipped it back. The cycle repeated several times until one of the engineers accompanied the instrument to the repair lab. The technicians were standing the instrument on its side, not flat as it was in the instrument rack back at the plant. Lying it flat exposed the error.

Another bug sticks in my mind from reading Coders At Work by Peter Seibel. Guy Steele is telling about a bug Bill Gosper reported in the bignum library. One thing caught is eye was a conditional step he didn't quite understand. Since it was based on the division algorithms from Knuth: "And what caught my eye in Knuth was a comment that this step happens rarely—with a probability of roughly only one in two to the size of the word." The error was in a rarely-executed piece of code. The lesson here helped him find similar bugs.

While three of us were building a compiler at Sycor, we kept a large lab notebook in which we wrote brief release notes, and a one-line note about each bug we found and fixed.

My most recent bug was a new emacs snippet was causing errors in eval_buf. Made no sense, so ultimately decided to clear out the .emacs.d directory and start over. There were files that were over 20 years old--I just copied the directory when I built a new machine.

koliber 27 March 2025
And somewhere out there is a person reading this post and coming to the conclusion "How can Google be stupid enough to hire people stupid enough to have abs() return a negative value."

Love the story! There is so much complexity in the world around as that seemingly obviously wrong things happen through the most unlikely chains of dependency.

nopurpose 27 March 2025
> It didn’t correspond to a Google Docs release. The stack trace added very little information. There wasn’t an associated spike in user complaints

Where mere mortals can complain about Google product?

Taniwha 27 March 2025
In interviews I've never forced anyone to code, what I do is try to get them to tell me these sorts of war stories - I want to hear how you fixed it, why it was cooly bizarre, and I'm hoping for some enthusiasm when you talk about it.

I couldn't always get people to talk this way, but people who did usually worked out well

__turbobrew__ 27 March 2025
So far my record is 3 weeks. It was a hiesenbug triggered when two different ebpf based systems raced with each other. Ebpf is a great tool in the right place but is it ever a pain in the ass to debug.

The fix ended up being one character -> change the priority of an ebpf tc filter from 0 to 1.

yard2010 27 March 2025
> Someone accidentally turned Math.abs() into the identity function for the super-optimized level.

Oh my god.

AnimalMuppet 27 March 2025
I've told my personal worst here a couple of times. So this time I'm going to talk about a co-worker named Ed.

On an embedded system, we had this bug that we couldn't find. It was around for a month or two. Random crashes that we couldn't reproduce, couldn't even debug. We started calling it "the phantom".

Finally Ed said, "I think the phantom showed up after we made that change to the ethernet driver." We reverted it, and the bug disappeared.

We never found the bug in the source code. But Ed debugged it using the calendar.

high_na_euv 27 March 2025
>There you have it: 2 days to find an issue that was already fixed and would have been resolved with no interaction

I work with LLVM and huge % of my work is fixing bugs that are already fixed in upstream

animal531 27 March 2025
As far as I'm concerned if you can use a debugger it automatically shouldn't qualify as the most difficult ever.

As per the compute shader post from a few days ago, currently I'm "debugging" some pretty advanced code that's being ported to a shader, and the only way to do it is by creating an array of e.g. ints and inserting values into it in both the original and the shader code to see where they diverge. Its not the most difficult but its quite time consuming.

almostdeadguy 27 March 2025
I often read these stories about hard to debug problems because I enjoy debugging (call it a love for software true crime) and this is the first one I’ve read I had an “oh god no” reaction when the author described where they needed to look for the culprit. The description of the layout engine and all of the browser specific tweaks makes it sound like an absolutely tedious nightmare to debug.
latexr 27 March 2025
> I do it a few more times. It’s not always the 20th iteration, but it usually happens sometime between the 10th and 40th iteration. Sometimes it never happend. Okay, the bug is nondeterministic.

That’s an incorrect assumption. Just because your test case isn’t triggering the bug reliably, it does not mean the bug is nondeterministic.

That is like saying the “OpenOffice can’t print on Tuesdays” is non deterministic because you can’t reproduce it everyday. It is deterministic, you just need to find the right set of circumstances.

https://beza1e1.tuxen.de/lore/print_on_tuesday.html

From the writing it appears the author found one way to reproduce the bug sometimes and then relied on it for every test. Another approach would have been to tweak their test case until they found a situation which reproduced the bug more or less often, trying to find the threshold that causes it and continuing to deduce from there.

hatmanstack 27 March 2025
I'll clear my schedule.

the best line of the piece.

TrayKnots 27 March 2025
When I heard about abs neg value my mind immediately jumped to abs(INT.min())... But then again, JS...
danielodievich 27 March 2025
When I was 12 I was just learning stuff and wrote something in C, which crashed at unpredictable intervals and I could not explain it. I took it to my 14 year old uncle who was better than me at coding for help. Now mind you this is ~ 40 years ago but I seem to remember that Borland Turbo C (I still love that IDE blue color) had debugging with breakpoints (mind blowing!) which eventually led to "duh you didn't dispose of your pointer and are reusing it and the memory there is now garbage" or something like that. I vaguely recall * or * being somewhere nearby. This was my first intro to RTFM and debugging and what a powerful intro.
jiehong 27 March 2025
Worst debugging issues are always things I can't access directly, on top of being rare.

Think network appliance in the middle that don't log or not at the level you need (and sometimes they can't log what you need).

Those usually mean that no reproduction is possible, except in production or very close to it, with tools you don't always control.

Annoying ones are those of "This http request is sometimes slow", and chasing each boxes in the middle shows a new box that is supposed to be transparent but isn't, or some rare timing issues due to boxes interacting in a funny way.

Rygian 27 March 2025
> What can I even do from here as the newsletter author? Normally I like finding a teachable lesson. But it was 2 days of grueling debugging and somehow there aren’t any teachable lessons there.

A lesson to learn seems obvious to me: the V8 team did not communicate upfront sufficiently on the "oops our Math.abs() may return negative numbers, we fixed that in version X, be warned".

Which the V8 should be able to do in a "advisory for Google developers that work on high-performance client-side view rendering stuff" sort of weekly newsletter.

cromulent 27 March 2025
We were building an app to sign up for a toll road that shipped electronic tags to the users, circa 2008.

  <input name="tag" id="tag"
this failed in IE with very strange results. Took a long time to realize we had hit a browser bug and change it to:

  <input name="tagx" id="tagx"
which worked fine.
ZaoLahma 27 March 2025
It's amazing how often it happens in large companies that different people from different organizations are troubleshooting or fixing the same fault, independent from each other, without even knowing. Sometimes you don't even realize until you've implemented a fix which causes a merge conflict with the fix that someone else is working on.
chromanoid 27 March 2025
Great writeup :)

It's like this https://geek-and-poke.com/geekandpoke/2017/8/13/just-happene... but actually true, which is really bad for mental health :D

cellular 27 March 2025
If this was a regression, could a binary search be done on check-ins? Or is the code too distributed?
sofixa 27 March 2025
My hardest bug to debug was related to broken drivers and a useless vendor. In total I spent around 2 months on and off trying to chase that one, and by the end was starting to go crazy.

A new customer comes in and we deploy a new VMware vSphere private cloud platform for them (first using this type of hardware). Nothing special or too fancy, but fist ones 10G production networking.

After a few weeks, integration team complains that a random VM stopped being able to communicate with another VM, but only one other specific VM. Moving the "broken" VM to a different ESXi fixed things, so we suspected a bad cable/connection/port/switch. Various tests turned up nothing, so we just waited for something to happen again.

A few days later, same thing. Some more debugging, packet capture, nothing. Rebooting the ESXi fixed the issue, so it was not the cables/switch, probably. Support ticket was opened at VMware for them to throw all sorts of useless "advice" (update drivers, firwmare, OS, etc etc).

This kept happening more and more, at some point there were multiple daily occurrences of this - again, just specific VMs to other specific VMs, but could always SSH, and communicate with other things, for which we had to reboot the hypervisor to fix it. VMware are completely and utterly useless, even with all the logs, timelines, etc.

A few weeks in, customer is getting pissed. We say that we've tried all sorts of debugging of everything (packet capture on the ESX, switch stuff, in the guest OSes, etc etc), and there's no rhyme nor reason - all sorts of VMs, of different virtual hardware versions, on different guest OSes, different virtual NIC types, different ESXes, and we're trying stuff with the vendor, it probably being a software bug.

One morning I decided to just go and read all of the logs on one of the ESX, trying to see if I can spot something weird (early on we tried greping for errors, warns yielded just VMware vomit and nothing of use). There's too much of them, and I don't see anything. In desperation, I Googled various combinations of "vmware" "nic type" "network issues", and boom, I stumble upon Intel forums with months of people complaining that the Intel X710 NIC's drivers are broken, throw a "Malicious Driver Detected" message (not error) in the logs, and just shut down traffic on that specific port. And what do you know, that's the NICs we're using, and we have those messages. The piece of shit of a driver had been known to not work for months (there was either that, or it crashing the whole machine), but was proudly sitting on VMware's compatibility list. When I told VMware's support about it, they said they were aware internally, but refused to remove it from the compatibility list. But if we upgraded to the beta release of the next major vSphere, there's a newer driver that supposedly fixes everything. We did that and everything was then finally fixed, but there were machines with similar issues where the driver wasn't updated for years after that.

This is the event that taught me that enterprise vendors don't know that much even about their own software, VMware's support is useless, hardware compatibility lists are also useless. So you actually need to know what you're doing and can't rely on support saving you.

rozumbrada 27 March 2025
I have no doubts that V8 has a rich test suites - including tests for the absolute value function.

But then a production optimized build apparently contains different code? This sounds to me like a system flaw

prinny_ 27 March 2025
«Math.abs() is returning negative values for negative inputs.», man I would have reached for the bible if that happened to me. Fascinating in hindsight.
rustybolt 27 March 2025
> How it took me 2 days

That can't possibly be the hardest bug ever

gmm1990 27 March 2025
Does anyone have insight into how the v8 abs val function wasn't tested for negative values?
nixpulvis 27 March 2025
Wouldn't the fact it only occurred on some specific browser be a big hint?
Thaxll 27 March 2025
Hardest problems are the one you can't repro and often network related.
octernion 27 March 2025
excellent post. i think the lesson is a good one: it's better to have less bugs than more bugs, and for some users, it would still have had an annoying bug.
Centigonal 27 March 2025
> Next, the reproduction was slow. It took probably 20 seconds just to load the dev version of the editor, and another 40 seconds to trigger the issue.

60 seconds to reproduce? Slow!? Laughs in enterprise software

qwertytyyuu 27 March 2025
Damn this would have taken me much more than 2 days
Arch-TK 27 March 2025
The worst bugs I've ever dealt with were a result of working at a company which was using the Clarion programming language.

The language compiler was most likely written by someone who had never read a book about compilation, it was basically just like if you had written a compiler using macros. I don't think it had anything like an optimisation pass. This combined with it being a higher level language meant that debugging with a debugger was just infeasible. Even if you had figured out the issue, you wouldn't know what exactly caused it from the code side as most lines of code would get turned into pages of assembly. Not only that, I believe the format for the debug symbols was custom so line number information was something you would only get if you used the terrible debugger which shipped with the language. Windows is also a terrible development environment due to the incredible lack of any good documentation for almost anything at the WinAPI level.

The applications I was working on were multi-threaded Windows applications. Concurrency issues were everywhere. Troubleshooting them sometimes took months. In many cases the fixes made absolutely no sense.

The IDE (which you were basically forced to use) was incessantly buggy. You could reliably crash it in many contexts by simply clicking too fast. After 5 years of working with that tooling, I had gained an intuition for where I needed to slow down my clicks to prevent a crash.

The IDE also operated on these binary blobs which encapsulated the entire project. I never put in the time to investigate the format of these blobs but, unsurprisingly, given the quality of the IDE, it was possible to put these opaque binary blobs in erroneous states. You could either just revert to a previous version of the blob and copy paste all your work (no way of easily accessing the raw text in the IDE because of this idiotically designed templating feature which was used throughout). If your project was in a wierd state, you would get mystery compiler errors with a 32bit integer printed as hex as an error identifier.

Searching the documentation or the internet for these numbers would either produce no results or would produce forum or comp.lang.clarion results for dozens of unrelated issues.

The language itself was an insane variation of pascal and/or COBOL. It had some nice database related features (as it was effectively CRUD domain specific) but that was about it. You look on GitHub these days to see people discussing the soundness and ergonomics issues of the never type in rust for many months before even considering partially stabilising it. Meanwhile in clarion, you get a half-arsedly written document page which serves as the language specification and out of it you get a half baked feature which doesn't work half the time. The documentation would often have duplicate pages for some features which would provide you with non-overlapping, sometimes conflicting or just outright wrong information.

When dealing with WINAPI you would need to deal with pointer types, and sometimes you would need to do pointer type conversions. The language wouldn't let you just do something like `void *p = &foo;` (this is C, actually very sane compared to Clarion). You had to do the language equivalent of `void *p = 1 ? &foo : NULL;` which magically lost enough type information for the language to let you do it. There was no documented alternative to this (there was casting, it just didn't work in this case), this wasn't even itself documented and was just a result of frustration and trial and error.

Not only this, the people I was working with had all entered this terrible proprietary language (oh wait, did I mention, you had to pay for a license for this shit) at a time where you were writing pure winapi code in C or C++. So for them, the fact that it had a forms editor was so amazing that they literally never considered for the next 25 years looking at alternative options. So when I complained about the complete insanity of using this completely ridiculous language I would get told that the alternatives were worse.

Do you want to experience living hell when debugging? Find a company writing Clarion, apparently it's still popular in the US government.