I would ignore this benchmark because it’s not going to predict anything for real world code.
In real world code, your caches and the CPU’s pipeline are influenced by some complex combination of what happens at the call site and what else the program is doing. So, a particular kind of call will perform better or worse than another kind of call depending on what else is happening.
The version of this benchmark that would have had predictive power is if you compared different kinds of call across a sufficiently diverse sampling of large programs that used those calls and also did other interesting things.
> Passing structs up to size 256 is very cheap, and uses SIMD registers.
Presumably this means for all arguments combined? If for example you pass four pointers each pointing to a 256-byte struct, you probably don’t want to pass all four structs (or even just one or two of the four?) by value instead.
There is no pass-by-value overhead. There are only implementation decisions.
Pass by value describes the semantics of a function call, not implementation. Passing a const reference in C++ is pass-by-value. If the user opts to pass "a copy" instead, nothing requires the compiler to actually copy the data. The compiler is required only to supply the actual parameter as if it was copied.
I usually use ChatGPT for such microbenchmarks (of course I design it myself and use LLM only as dumb code generator, so I don't need to remember how to measure time with nanosecond precision. I still have to add workarounds to prevent compiler over-optimizing the code). It's amazing, that when you get curious (for example, what is the fastest way to find an int in a small sorted array: using linear, binary search or branchless full scan?) you can get the answer in a couple minutes instead of spending 20-30 minutes writing the code manually.
By the way, the fastest way was branchless linear scan up to 32-64 elements, as far as I remember.
Quantifying pass-by-value overhead
(owen.cafe)116 points by todsacerdoti 28 October 2025 | 28 comments
Comments
What a fascinating CPU bug. I am quite curious as to how that came to pass.
In real world code, your caches and the CPU’s pipeline are influenced by some complex combination of what happens at the call site and what else the program is doing. So, a particular kind of call will perform better or worse than another kind of call depending on what else is happening.
The version of this benchmark that would have had predictive power is if you compared different kinds of call across a sufficiently diverse sampling of large programs that used those calls and also did other interesting things.
Presumably this means for all arguments combined? If for example you pass four pointers each pointing to a 256-byte struct, you probably don’t want to pass all four structs (or even just one or two of the four?) by value instead.
Pass by value describes the semantics of a function call, not implementation. Passing a const reference in C++ is pass-by-value. If the user opts to pass "a copy" instead, nothing requires the compiler to actually copy the data. The compiler is required only to supply the actual parameter as if it was copied.
By the way, the fastest way was branchless linear scan up to 32-64 elements, as far as I remember.