Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).
If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.
Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.
Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.
(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")
/// Encode a UTF-8 codepoint.
/// […]
/// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values);
/// it's up to the caller to turn that into U+FFFD, or return an error.
It's not a "UTF-8 codepoint", that's horridly mangling the terminology. Code points are just code points.
The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)
(In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)
If you have access to the BMI2 instruction set I can do branchless UTF-8 encoding like in the article using only 9 instructions and 73 bytes of lookup tables:
I'm surprised there are no UTF-8 specific decode instructions yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero"
>So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”. Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. This might look nicer on another architecture!
Yes, RISC-V for example defines the instructions for counting leading / trailing zeros (clz, clzw, ctz, ctzw) such that an N-bit zero value has N of them.
I don't know if I can show it on Rust Godbolt because none of the default RISC-V targets that Rust has support the Zbb extension, but I checked with a custom target that I use locally for my emulator, and `leading_zeros()` indeed compiles to just one `clz` without any further branches. Here's a C demonstration though: https://gcc.godbolt.org/z/cKx3ajsjh
I love weird little tricks with popcnt/leading/trailing zero instructions.
I recently had a lot of fun getting Swift’s OptionSet bitset interface to iterate over active members.
(Unfortunately, because of weird specifics of Swift protocol associated types, I wasn’t able to actually conform OptionSet to Collection like I wanted to originally. I find it amusing that one of the first examples in the official documentation for Swift macros is to make OptionSet used an associated enum like it should.)
Branchless UTF-8 Encoding
(cceckman.com)103 points by vortex_ape 9 hours ago | 28 comments
Comments
First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:
https://github.com/llvm/llvm-project/pull/102885
Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).
If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.
Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.
Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.
(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")
The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)
(In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)
Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and Piledriver (AMD) introduced the LZCNT instruction that doesn't have this problem.
Yes, RISC-V for example defines the instructions for counting leading / trailing zeros (clz, clzw, ctz, ctzw) such that an N-bit zero value has N of them.
I don't know if I can show it on Rust Godbolt because none of the default RISC-V targets that Rust has support the Zbb extension, but I checked with a custom target that I use locally for my emulator, and `leading_zeros()` indeed compiles to just one `clz` without any further branches. Here's a C demonstration though: https://gcc.godbolt.org/z/cKx3ajsjh
https://gist.github.com/Validark/457b6db8aa00ded26a6681d4d25...
Only if you don't consider conditional move instructions branching/cheating :)
I recently had a lot of fun getting Swift’s OptionSet bitset interface to iterate over active members.
(Unfortunately, because of weird specifics of Swift protocol associated types, I wasn’t able to actually conform OptionSet to Collection like I wanted to originally. I find it amusing that one of the first examples in the official documentation for Swift macros is to make OptionSet used an associated enum like it should.)