Branchless UTF-8 Encoding

(cceckman.com)

Comments

comex 17 January 2025
Incidentally, this automatic branch-if-zero from LLVM is being improved.

First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:

https://github.com/llvm/llvm-project/pull/102885

Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).

If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.

Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.

Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.

(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")

orlp 17 January 2025
If you have access to the BMI2 instruction set I can do branchless UTF-8 encoding like in the article using only 9 instructions and 73 bytes of lookup tables:

    branchless_utf8:
        mov     rax, rdi
        lzcnt   ecx, esi
        lea     rdx, [rip + .L__unnamed_1]
        movzx   ecx, byte ptr [rcx + rdx]
        lea     rdx, [rip + example::DEP_AND_OR::h78cbe1dc7fe823a9]
        pdep    esi, esi, dword ptr [rdx + 8*rcx]
        or      esi, dword ptr [rdx + 8*rcx + 4]
        movbe   dword ptr [rdi], esi
        mov     qword ptr [rdi + 8], rcx
        ret

The code:

    static DEP_AND_OR: [(u32, u32); 5] = [
        (0, 0),
        (0b01111111_00000000_00000000_00000000, 0b00000000_00000000_00000000_00000000),
        (0b00011111_00111111_00000000_00000000, 0b11000000_10000000_00000000_00000000),
        (0b00001111_00111111_00111111_00000000, 0b11100000_10000000_10000000_00000000),
        (0b00000111_00111111_00111111_00111111, 0b11110000_10000000_10000000_10000000),
    ];

    const LEN: [u8; 33] = [
        // 0-10 leading zeros: not valid.
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        // 11-15 leading zeros: 4 bytes.
        4, 4, 4, 4, 4,
        // 16-20 leading zeros: 3 bytes.
        3, 3, 3, 3, 3,
        // 21-24 leading zeros: 2 bytes.
        2, 2, 2, 2,
        // 25-32 leading zeros: 1 byte.
        1, 1, 1, 1, 1, 1, 1, 1,
    ];

    pub unsafe fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
        let leading_zeros = codepoint.leading_zeros() as usize;
        let bytes = LEN[leading_zeros] as usize;
        let (mask, or) = *DEP_AND_OR.get_unchecked(bytes);
        let ret = core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
        (ret.swap_bytes().to_le_bytes(), bytes)
    }
deathanatos 18 January 2025

  /// Encode a UTF-8 codepoint.
  /// […]
  /// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values);
  /// it's up to the caller to turn that into U+FFFD, or return an error.
It's not a "UTF-8 codepoint", that's horridly mangling the terminology. Code points are just code points.

The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)

(In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)

xeeeeeeeeeeenu 17 January 2025
> So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”.

Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and Piledriver (AMD) introduced the LZCNT instruction that doesn't have this problem.

koala_man 17 January 2025
I'm surprised there are no UTF-8 specific decode instructions yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero"
Arnavion 17 January 2025
>So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”. Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. This might look nicer on another architecture!

Yes, RISC-V for example defines the instructions for counting leading / trailing zeros (clz, clzw, ctz, ctzw) such that an N-bit zero value has N of them.

I don't know if I can show it on Rust Godbolt because none of the default RISC-V targets that Rust has support the Zbb extension, but I checked with a custom target that I use locally for my emulator, and `leading_zeros()` indeed compiles to just one `clz` without any further branches. Here's a C demonstration though: https://gcc.godbolt.org/z/cKx3ajsjh

purplesyringa 18 January 2025
Instead of

    let surrogate_mask = surrogate_bit << 2 | surrogate_bit << 1 | surrogate_bit;
    len &= !surrogate_mask;
consider

    len &= surrogate_bit.wrapping_sub(1);
This should still work better. Alternatively, invert the conditions and do

    len &= non_surrogate_bit.wrapping_neg();
emilfihlman 18 January 2025
I mean, isn't the trivial answer to just collapse the if else tree into just math that's evaluated always?

  u32 a = (code <= 0x7F);
  u32 b = (code <= 0x07FF);
  u32 c = ((code < 0xD800) || 
  (0xDFFF < code));
  u32 d = (code <= 0xFFFF) * c;
  u32 e = (code <= 0x10FFFF);
  u32 v = (c && e);
  return(-1 * !v + v * (4 - a - b - d));
Highly likely easy to optimise.
Validark 17 January 2025
marxisttemp 18 January 2025
I love weird little tricks with popcnt/leading/trailing zero instructions.

I recently had a lot of fun getting Swift’s OptionSet bitset interface to iterate over active members.

(Unfortunately, because of weird specifics of Swift protocol associated types, I wasn’t able to actually conform OptionSet to Collection like I wanted to originally. I find it amusing that one of the first examples in the official documentation for Swift macros is to make OptionSet used an associated enum like it should.)

lxgr 17 January 2025
> Compiler explorer confirms that, with optimizations enabled, this function is branchless.

Only if you don't consider conditional move instructions branching/cheating :)

RenThraysk 17 January 2025
Or-ing 1 onto codepoint before calling leading_zeroes() should get a decent compiler to remove the branch.
Dwedit 17 January 2025
Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?
sylware 18 January 2025
CPU hardware ISA can load tables into regs (RISC-V for instance), then on sufficently big data under UTF-8 processing, you get branchless without cache memory request.
ThatGuyRaion 17 January 2025
So is this potentially performance improving?.
decafbad 18 January 2025
Checkout Erlang bit parsing.