BIO: The Bao I/O Coprocessor

(bunniestudios.com)

Comments

bunnie 23 March 2026
Hello again HN, I'm bunnie! Unfortunately, time zones strike again...I'll check back when I can, and respond to your questions.
MayeulC 4 hours ago
Hey, glad to see you here. I'm a huge fan of your projects, and the Baochip was one I didn't see coming. Very nice surprise!

I ordered a few, thinking it would make a good logic analyzer (before the details of the BIO were published). Obviously, it's going to be a stretch with multiple cycles per instructions, and a reduced instruction set. I'll see how far I can push it if I rely on multiple BIOs, perhaps with some tricks such as relying on an external clock signal. At first glance, they seemed to be perfect for doing some basic RLE or Huffman compression on-the-fly, but I am less sure now, I will have to play with it. Bit-packing may be somewhat expensive to perform, too.

One thing stood out to me in this design: that liberal use of the 16 extra registers. It's a very clever trick, but wouldn't some of these be better exposed as memory addresses? Or do you foresee applications where they are in the hot path (where the inability to write immediate values may matter). Stuff like core ID, debug, or even GPIO direction could be hard-wired to memory addresses, leaving space for some extra features (not sure which? General purpose registers? More queues? More GPIOs? A special purpose HW block?).

I really like the "snap to quantum" mechanism: as you wrote, it is good for portability, though there should be a way to query frequency, if portability is really a goal.

Anyway, it's plenty for a v1, plenty of exciting things to play with, including the MMU of the main core!

Lerc 8 hours ago
I'm currently elbow deep in making a PIO+DMA sprite and tile display renderer.

Losing the high maximum data rate is quite a cost, but in my use case BIO would be the clear winner, indexed pixel format conversion on PIO is shifting out the high bits of palette address, then the index, then some zeros. Which goes to a FIFO which is read by a DMA simply to write it to the readaddr+trigger of another DMA which feeds into another FIFO (which is the program doing the transparency)

That I suspect becomes a much simpler task with BIO

It is an interesting case, where just knowing that the higher potential rate of the PIO is there is a kind of comfort even when you don't currently need it.

Although for those higher rates it is very rarely reactive and most often just wiggling wires in a predetermined fashion.

I wonder if having a register that can be DMA'd to could perform the equivalent function of side-set to play a fixed sequence to some pins at full clock speed. Like playing macros.

I guess another approach a 32 bit register could shift out 4 bits of side set per clock cycle. Then you could pre program for the next 8 cycles in a single 32 bit write. It would give you breathing space to drive the main data while the side set does fixed pattern signaling.

jononor 9 hours ago
Very much looking forward to play with the BIO functionality on the Baochips that I have ordered. Thanks for the nice write up! It is fascinating to see how widely applicable the "just throw a RISC-V core or 4 in there" design pattern is. The wide range of CPU designs that are standardized, the number oc mature open source implementations, and the lack of royalty fees, and the ready-to-run programming toolchains really drives this to a new level. And CPUs are small in die area anyway compared to SRAM! Was cool to see on the RPI2350 how they just threw in another two RISC-V cores next to the ARMs.

For these reasons specified above, I think that this trend will continue. For example, in my specialization of edge machine learning, we are seeing MEMS sensors that integrate user programmable DSP+ML+CPU right there on the sensor chip.

mrlambchop 23 March 2026
I loved this article and had wanted to play with PIO for a long time (or at least, learn from it through playing!).

One thing jumped out here - I assumed CISC inside PIO had a mental model of "one instruction by cycle" and thus it was pretty easy to reason about the underlying machine (including any delay slots etc...).

For this RISC model using C, we are now reasoning about compiled code which has a somewhat variable instruction timing (1-3 cycles) and that introduces an uncertainty - the compiler and understanding its implementation.

I think this means that the PIO is timing-first, as timing == waveform where BIO is clarity-first with C as the expression and then explicit hardware synchronization.

I like both models! I am wondering about the quantum delays however that are being used to set the deadlines - here, human derived wait delays are utilized knowledge of the compiled instructions to set the timing.

Might there not be a model of 'preparing the next hardware transaction' and then 'waiting for an external synchronization' such as an external signal or internal clock, so we don't need to count the instruction cycles so precisely. On the external signal side, I guess the instruction is 'wait for GPIO change' or something, so the value is immediately ready (int i = GPIO_read_wait_high(23) or something) and the external one is doing the same, but synchronizing (GPIO_write_wait_clock( 24, CLOCK_DEF)) as an alternative to the explicit quantum delays.

This might be a shadow register / latch model in more generic terms - prep the work in shadow, latch/commit on trigger.

Anyway, great work Bunnie!

throwa356262 12 hours ago
The large area usage was a surprise. But is the real PIO also this huge?

My point is, maybe this is one of those designs that blow up in FPGA. Or maybe the open source version of the PIO is simply not as area efficient as the rpi version?

alex7o 23 March 2026
This is actually super cool, you can use those as both math accelerators and as io, and them being in lockstep you can kind of use them as int only shader units. I don't know how this is useful yet.

Btw I am curious what about edge cases. Maybe I have missed that from the article but what is the size of the FIFO?

Or the more dangerous part that is you have complex to determine timing now for complex cases like each reqd from FIFO is and ISR and you have until the next read from the FIFO amount of instructions otherwise you would stall the system and that looks to me too hard to debug.

guenthert 23 March 2026
I appreciate the intro, motivation and comparison to the PIO of the RP2040/2350. How would this compare to the (considerably older, slower, but more flexible) Parallax P8X32A ("Propeller")?
genxy 21 hours ago
Thanks for making this a blog post!

Have some on the way! Can't wait!

jauntywundrkind 23 March 2026
3 comments on this, from 2d ago, https://news.ycombinator.com/item?id=47469744
dmitrygr 23 March 2026
> Above is the logic path isolated as one of the longest combination paths in the design, and below is a detailed report of what the cells are.

which is an argument that "fpga_pio" is badly implemented or that PIO is unsuitable for FPGA impls. Real silicon does not need to use a shitton of LUT4s to implement this logic and it can be done much more efficiently and closes timing at higher clocks (as we know since PIO will run near a GHz)

RS-232 21 hours ago
> The build script compiles C code down to a clang intermediate assembly, which is then handed off to a Python script that translates it into a Rust macro which is checked into Xous as a buildable artifact using its pure-Rust toolchain.

Ah yes, the good ol “we solved the C problem by turning it into four other problems” pipeline