IMHO, if something isn’t part of the contract, it should be randomized. Eg if iteration order of maps isn’t guaranteed in your language, then your language should go out of its way to randomize it. Otherwise, you end up with brittle code: code that works fine until it doesn’t.
> Not ignore the compilation warnings – this code most likely threw a warning in the original code that was either ignored or disabled!
What compiler error would you expect here? Maybe not checking the return value from scanf to make sure it matches the number of parameters? Otherwise this seems like a data file error that the compiler would have no clue about.
Knowing C/C++, I more or less guessed what's happening (uninitialized variable) early in the blog post.
It blows my mind that the languages allow you to leave variables uninitialized which has caused countless bugs (including production bugs that I have seen first hand), and you often need to rely on additional compiler flags or static analysis tools/valgrind etc to catch them. Even though newer languages often use a different solution (default zero value or must initialize a variable before use), people still go back to C/C++ all the time.
My takeaway, speaking as someone who leans towards functional programming and immutability, is "this is yet another example of a mutability problem that could never happen in a functional context"
(so, for example, this bug would have never been created by Rust unless it was deeply misused)
> all these findings prove that the bug is NOT an issue with Windows 11 24H2, as things like the way the stack is used by internal WinAPI functions are not contractual and they may change at any time, with no prior notice. The real issue here is the game relying on undefined behavior (uninitialized local variables), and to be honest, I’m shocked that the game didn’t hit this bug on so many OS versions, although as I pointed out earlier, it was extremely close
This sentence is the real takeaway point of the article. Undefined behavior is extremely insidious and can lull you into the belief that you were right, when you already made a mistake 1000 steps ago but it only got triggered now.
I emphasized this point in my article from years ago (but after the game was released):
> When a C or C++ program triggers undefined behavior, anything is allowed to happen in the program execution. And by anything, I really mean anything: The program can crash with an error message, it can silently corrupt data, it can morph into a colorful video game, or it can even give the right result.
> If you’re lucky, the program triggering UB will show an appropriate error message and/or crash, making you immediately aware that something went wrong. If you’re unlucky, the program will quietly mangle data, and by the time you notice the problem (via effects such as crashes or incorrect output) the root cause has been buried in the past execution history. And if you’re very unlucky, the program will do exactly what you hoped it should do, until you change some unrelated code / compiler versions / compiler vendors / operating systems / hardware platforms – and then a new bug becomes visible, and you have no clue why seemingly correct code now fails to work properly.
As I wrote in my article, this point really got hammered into me when a coworker showed me a patch that he made - which added a couple of innocuous, totally correct print statements to an existing C++ program - and that triggered a crash. But without his print statements, there was no crash. It turned out that there was a preexisting out-of-bounds array write, and the layout of the stack/heap somehow masked that problem before, and his unlucky prints unmasked the problem.
Okay so then, how can we do better as developers today?
0) Read, understand, and memorize what actions in C or C++ are undefined behavior. Avoid them in your code at all costs. Also obey the preconditions of any API you use, whether in the standard library, operating system, etc.
1) Compile your application in Debug mode and compare its behavior to Release mode. If they differ by anything other than speed, then you have a serious problem on your hands.
2) Compile and run with sanitizers like -fsanitize=undefined,address to catch undefined behavior at runtime.
3) Use managed languages like Java, C#, Python, etc. where you basically don't have to worry about UB in normal day-to-day code. Or use very well-designed low-level languages like Rust that are safe by default and minimize your exposure to UB when you really need to do advanced things. Whereas C and C++ have been a bonanza of UB like we have never seen before in any other language.
> all these findings prove that the bug is NOT an issue with Windows 11 24H2, as things like the way the stack is used by internal WinAPI functions are not contractual and they may change at any time, with no prior notice.
This reminds me of an excellent article I read a while back, the gist of it was that, given sufficient success, there's no such thing as a private API.
Once this category of error is raised to your attention, you start to notice it more and more.
A little piece of technology made sense in the original context, but then it got moved to a different context without realizing that move broke the contract. Specifically in this case a flying boat became an airplane.
---
I recently worked a bug that feels very similar:
A linux cups printer would not print to the selected tray, instead it always requested manual feed.
Ok. Try a bunch of command line options, same issue.
Ok. Make the selection directly in the PPD (postscript printer definition) file. Same issue.
Ok! Decompile the PXL file. Wrong tray is set in pxl file... why?
Check Debug2 log level for cups - Wrong MediaPosition is being sent to ghostscript (which compiles the printer options into the print job) by a cups filter... why?
Cups filter is translating the MediaPosition from the PPD file... because the philosophy of cups is to do what the user intended. The intention inferred from MediaPosition in the PPD file (postscript printer definition) is that the MediaPosition corresponds to the PWG (Printer Working Group) MediaPosition, NOT the vendor MediaPosition (or local equivalent - in this case MediaSource).
AHA!! My PPD file had been copied from a previous generation of server, from a time when that cups filter did NOT translate the MediaPosition, so the VENDOR MediaSource numbers were used. Historically, this makes sense. The vendor tray number is set in the vendor ppd file because cups didn't know how to translate that.
Fast forward to a new execution context, and cups filters have gotten better at translating user intention, now it's translating a number that doesn't need to be translated, and silently selecting the wrong tray.
TLDR; There is no such thing as a printer command, only printer suggestions.
I hope someone can figure out the Red Dead Redemption 2 bug where random animals and characters disappear silently if you have too many texture mods installed.
tl;dr of the explanation: the Skimmer vehicle is missing a wheel scale definition, so its wheel scale gets read from uninitialized memory. On previous versions of Windows, this happened to be the wheel scale of the previously-loaded vehicle, so things happened to work fine. Starting on Windows 11 24H2, LeaveCriticalSection (which gets called between loading vehicles) uses more stack space than before, so it now overwrites that memory with a gigantic value, resulting in the Skimmer spawning so high up that it may as well not exist at all.
On Windows 11 24H2, more stack space was modified by a new implementation of Critical Sections.
IMHO this shows the downfall of Microsoft. Why did they do that? Critical sections have been there for many decades and should be basically bug-free by now. My best guess is someone thought they'd "improve" things and rewrote it, then made some microbenchmark that maybe showed the dubious improvement.
The other comment here mentions Raymond Chen, who wrote this article about why backwards-compatibility is very important (and arguably what got Microsoft into the position it's in today):
Surprised to see the return value of sscanf being ignored, that seems like a pretty rookie mistake, and this bug would never have made it out of the original programmer's system if they had bothered to check it.
It has always been too easy to read & write beyond the stack. This should fail, plain and simple.
Mitigations exist - ASLR, NX pages, stack-smashing protection etc. but nothing comprehensively stops reads of stale data beyond the stack.
Thought experiment for a moment. What if the hardware ensures the unused part of a stack region cannot be read or written.
There are many ways to skin this cat, here’s one based around tracking each stack’s start address A, size S, and current depth D
1. Add an instruction to inform the CPU there is a stack at address A of size S. Its depth D is initially 0.
2. Add a jump instruction which reserves N bytes on the stack at address A, growing depth D to (D+N). Maybe this can be its own “reserve” instruction so as not to need a new jump instruction.
3. Give existing return instructions stack awareness. If returning to an address inside a stack, un-reserve the bytes reserved by the most recent jump, making the new depth (D-N).
4. Fail reads or writes to the stack region beyond its current depth. In other words fail all reads and writes between A+S-D and A+S.
5. The arithmetic is reversed on architectures whose stacks grow downwards.
Downsides I can see:
It cements one calling convention. The CPU memory manager will need a lot of state per stack, of which there are many per process: address A, size S, current depth D, plus a reservation stack - ie. sizes of each frame’s stack memory. That’s a lot of bookkeeping! It’s far from zero cost. The limits of how much bookkeeping the CPU can do impose limits on how deep a stack can go and how many stacks are supported - so when there are too many stacks or one goes too deep, either the CPU needs to signal failure or engage a fallback mode and revert to behaving as CPUs do today. And of course fallback puts things back to the start. It’d therefore only mitigate situations in which an attacker cannot control the depth of the stack / a bug always happens inside the max depth the CPU can bookkeep for.
That said, stacks are ubiquitous! Hardware stack awareness opens up all kinds of new mitigations.
The core problem is some compilers initialising memory to zero in Debug mode, masking behaviour of unitialised data, since in most cases zero is a legit value. In Release mode, this zeroing doesn’t happen.
Devs need to be aware that the following C++ initisliser exists which zeros data structures for you:
How a 20 year old bug in GTA San Andreas surfaced in Windows 11 24H2
(cookieplmonster.github.io)1346 points by yett 23 April 2025 | 295 comments
Comments
I'm glad they tracked it down even further to figure out exactly why.
What compiler error would you expect here? Maybe not checking the return value from scanf to make sure it matches the number of parameters? Otherwise this seems like a data file error that the compiler would have no clue about.
It blows my mind that the languages allow you to leave variables uninitialized which has caused countless bugs (including production bugs that I have seen first hand), and you often need to rely on additional compiler flags or static analysis tools/valgrind etc to catch them. Even though newer languages often use a different solution (default zero value or must initialize a variable before use), people still go back to C/C++ all the time.
https://web.archive.org/web/20250423144746/https://cookieplm...
while (this->m_fBladeAngle > 6.2831855) { this->m_fBladeAngle = this->m_fBladeAngle - 6.2831855; }
Like, "let's just write a while loop that could turn into an infinite loop coz I'm too lazy to do a division"
(so, for example, this bug would have never been created by Rust unless it was deeply misused)
This sentence is the real takeaway point of the article. Undefined behavior is extremely insidious and can lull you into the belief that you were right, when you already made a mistake 1000 steps ago but it only got triggered now.
I emphasized this point in my article from years ago (but after the game was released):
> When a C or C++ program triggers undefined behavior, anything is allowed to happen in the program execution. And by anything, I really mean anything: The program can crash with an error message, it can silently corrupt data, it can morph into a colorful video game, or it can even give the right result.
> If you’re lucky, the program triggering UB will show an appropriate error message and/or crash, making you immediately aware that something went wrong. If you’re unlucky, the program will quietly mangle data, and by the time you notice the problem (via effects such as crashes or incorrect output) the root cause has been buried in the past execution history. And if you’re very unlucky, the program will do exactly what you hoped it should do, until you change some unrelated code / compiler versions / compiler vendors / operating systems / hardware platforms – and then a new bug becomes visible, and you have no clue why seemingly correct code now fails to work properly.
-- https://www.nayuki.io/page/undefined-behavior-in-c-and-cplus...
As I wrote in my article, this point really got hammered into me when a coworker showed me a patch that he made - which added a couple of innocuous, totally correct print statements to an existing C++ program - and that triggered a crash. But without his print statements, there was no crash. It turned out that there was a preexisting out-of-bounds array write, and the layout of the stack/heap somehow masked that problem before, and his unlucky prints unmasked the problem.
Okay so then, how can we do better as developers today?
0) Read, understand, and memorize what actions in C or C++ are undefined behavior. Avoid them in your code at all costs. Also obey the preconditions of any API you use, whether in the standard library, operating system, etc.
1) Compile your application in Debug mode and compare its behavior to Release mode. If they differ by anything other than speed, then you have a serious problem on your hands.
2) Compile and run with sanitizers like -fsanitize=undefined,address to catch undefined behavior at runtime.
3) Use managed languages like Java, C#, Python, etc. where you basically don't have to worry about UB in normal day-to-day code. Or use very well-designed low-level languages like Rust that are safe by default and minimize your exposure to UB when you really need to do advanced things. Whereas C and C++ have been a bonanza of UB like we have never seen before in any other language.
This reminds me of an excellent article I read a while back, the gist of it was that, given sufficient success, there's no such thing as a private API.
A little piece of technology made sense in the original context, but then it got moved to a different context without realizing that move broke the contract. Specifically in this case a flying boat became an airplane.
---
I recently worked a bug that feels very similar:
A linux cups printer would not print to the selected tray, instead it always requested manual feed.
Ok. Try a bunch of command line options, same issue.
Ok. Make the selection directly in the PPD (postscript printer definition) file. Same issue.
Ok! Decompile the PXL file. Wrong tray is set in pxl file... why?
Check Debug2 log level for cups - Wrong MediaPosition is being sent to ghostscript (which compiles the printer options into the print job) by a cups filter... why?
Cups filter is translating the MediaPosition from the PPD file... because the philosophy of cups is to do what the user intended. The intention inferred from MediaPosition in the PPD file (postscript printer definition) is that the MediaPosition corresponds to the PWG (Printer Working Group) MediaPosition, NOT the vendor MediaPosition (or local equivalent - in this case MediaSource).
AHA!! My PPD file had been copied from a previous generation of server, from a time when that cups filter did NOT translate the MediaPosition, so the VENDOR MediaSource numbers were used. Historically, this makes sense. The vendor tray number is set in the vendor ppd file because cups didn't know how to translate that.
Fast forward to a new execution context, and cups filters have gotten better at translating user intention, now it's translating a number that doesn't need to be translated, and silently selecting the wrong tray.
TLDR; There is no such thing as a printer command, only printer suggestions.
I spent hours looking for a badger.
https://static.wikia.nocookie.net/gta-myths/images/f/f1/Egg_...
IMHO this shows the downfall of Microsoft. Why did they do that? Critical sections have been there for many decades and should be basically bug-free by now. My best guess is someone thought they'd "improve" things and rewrote it, then made some microbenchmark that maybe showed the dubious improvement.
The other comment here mentions Raymond Chen, who wrote this article about why backwards-compatibility is very important (and arguably what got Microsoft into the position it's in today):
https://devblogs.microsoft.com/oldnewthing/20031224-00/?p=41...
and also this memorable case: https://news.ycombinator.com/item?id=2281932
Mitigations exist - ASLR, NX pages, stack-smashing protection etc. but nothing comprehensively stops reads of stale data beyond the stack.
Thought experiment for a moment. What if the hardware ensures the unused part of a stack region cannot be read or written.
There are many ways to skin this cat, here’s one based around tracking each stack’s start address A, size S, and current depth D
1. Add an instruction to inform the CPU there is a stack at address A of size S. Its depth D is initially 0.
2. Add a jump instruction which reserves N bytes on the stack at address A, growing depth D to (D+N). Maybe this can be its own “reserve” instruction so as not to need a new jump instruction.
3. Give existing return instructions stack awareness. If returning to an address inside a stack, un-reserve the bytes reserved by the most recent jump, making the new depth (D-N).
4. Fail reads or writes to the stack region beyond its current depth. In other words fail all reads and writes between A+S-D and A+S.
5. The arithmetic is reversed on architectures whose stacks grow downwards.
Downsides I can see:
It cements one calling convention. The CPU memory manager will need a lot of state per stack, of which there are many per process: address A, size S, current depth D, plus a reservation stack - ie. sizes of each frame’s stack memory. That’s a lot of bookkeeping! It’s far from zero cost. The limits of how much bookkeeping the CPU can do impose limits on how deep a stack can go and how many stacks are supported - so when there are too many stacks or one goes too deep, either the CPU needs to signal failure or engage a fallback mode and revert to behaving as CPUs do today. And of course fallback puts things back to the start. It’d therefore only mitigate situations in which an attacker cannot control the depth of the stack / a bug always happens inside the max depth the CPU can bookkeep for.
That said, stacks are ubiquitous! Hardware stack awareness opens up all kinds of new mitigations.
Why isn’t this a common idea? Has it been tried?
Devs need to be aware that the following C++ initisliser exists which zeros data structures for you:
MyStruct s = { };
LOL!