Efficient Computer's Electron E1 CPU – 100x more efficient than Arm? Hackernews Viewer

Efficient Computer's Electron E1 CPU – 100x more efficient than Arm?

(morethanmoore.substack.com)

215 points by rpiguy 21 hours ago | 83 comments

Comments

pclmulqdq 18 hours ago

This is a CGRA. It's like an FPGA but with bigger cells. It's not a VLIW core.

I assume that like all past attempts at this, it's about 20x more efficient when code fits in the one array (FPGAs get this ratio), but if your code size grows past something very trivial, the grid config needs to switch and that costs tons of time and power.

gamache 16 hours ago

Sounds a lot like GreenArray GA144 (https://www.greenarraychips.com/home/documents/greg/GA144.ht...)! Sadly, without a bizarre and proprietary FORTH dialect to call its own, I fear the E1 will not have the market traction of its predecessor.

Imustaskforhelp 17 hours ago

Pardon me but could somebody here explain to me like I am 15? Because I guess Its late night and I can't go into another rabbithole and I guess I would appreciate it. Cheers and good night fellow HN users.

kendalf89 19 hours ago

This grid based architecture reminds me of a programming game from zactronics, TIS-100.

archipelago123 17 hours ago

It's a dataflow architecture. I assume the hardware implementation is very similar to what is described here: https://csg.csail.mit.edu/pubs/memos/Memo-229/Memo-229.pdf. The problem is that it becomes difficult to exploit data locality, and there is so much optimization you can perform during compile time. Also, the motivation for these types of architectures (e.g. lack of ILP in Von-Neumann style architectures) are non-existent in modern OoO cores.

pedalpete 13 hours ago

Though I'm sure this is valuable in certain instances, thinking about many embedded designs today, is the CPU/micro really the energy hog in these systems?

We're building an EEG headband with bone-conduction speaker so in order of power, our speaker/sounder and LEDs are orders of magnitude more expensive than our microcontroller.

In anything with a screen, that screen is going to suck all the juice, then your radios, etc. etc.

I'm sure there are very specific use-cases that a more energy efficient CPU will make a difference, but I struggle to think of anything that has a human interface where the CPU is the bottleneck, though I could be completely wrong.

ZiiS 19 hours ago

Percentage chance this is 100X more efficent at the general purpose computing ARM is optimized for: 1/100%

nubinetwork 2 hours ago

100x more efficient at what cost? If it's slower than a Pentium 2, nobody's gonna want it, except for the embedded users...

pbhjpbhj 3 hours ago

From a position of naive curiosity -- Would this work as a coprocessor, take the most inefficient/most optimisable procedures and compile (#) them for the fabric? It would you lose all your gains in turn being extra processes to ship data between cores/processors?

How 2D is it: compiling to a fabric sounds like it needs lots of difficult routing. 3D would seem like it would make the routing much more compact?

gchadwick 7 hours ago

> The interconnect between tiles is also statically routed and bufferless, decided at compile time. As there's no flow control or retry logic, if two data paths would normally collide, the compiler has to resolve it at compile time.

This sounds like the most troublesome part of the design to me. It's very hard to do this static scheduling well. You can end having to hold up everything waiting for some tiny thing to complete so you can proceed forward in lock step. You'll also have situations where 95% of the time the static scheduling can work but 5% of cases where something fiddly happens. Without any ability for dynamic behaviour and data movement small corner cases dominate how the rest of the system behaves.

Interestingly you see this very problem in hardware design! All paths between logic gates need to be some maximum length to reach a target clock frequency. Often you get long fiddly paths relating to corner cases in behaviour that require significant manual effort to resolve and achieve timing closure.

mellosouls 2 hours ago

Sidenote/curio: Arm as a main processor was pioneered in the Acorn Archimedes. Its (non-BBC) predecessor in the Acorn product range was the ... Electron.

Grosvenor 19 hours ago

Is this the return if Itanium? static scheduling and pushing everything to the compiler it sounds like it.

rpiguy 21 hours ago

The architecture diagram in the article resembles the approach Apple took in the design of their neural engine.

https://www.patentlyapple.com/2021/04/apple-reveals-a-multi-...

Typically these architectures are great for compute. How will it do on scalar tasks with a lot of branching? I doubt well.

artemonster 18 hours ago

As a person who is highly vested and interested in CPU space, especially embedded, I am HIGHLY skeptical of such claims. Somebody played TIS-100, remembered GA144 failed and decided to try their own. You know what can be a simple proof of your claims? No, not a press release. No, not a pitch deck or a youtube video. And NO, not even working silicon, you silly. A SIMPLE FUCKING ISA EMULATOR WITH A PROFILER. Instead we got bunch of whitepapers. Yeah, I call it 90% chance for total BS and vaporware

variadix 17 hours ago

Pretty interesting concept, though as other commenters have pointed out the efficiency gains likely break down once your program doesn’t fit onto the mesh all at once. Also this looks like it requires a “sufficiently smart compiler”, which isn’t a good sign either. The need to do routing etc. reminds me of the problems FPGAs have during place and route (effectively the minimum cut problem on a graph, i.e. NP), hopefully compilation doesn’t take as long as FPGA synthesis takes.

wolfi1 19 hours ago

reminds me from the architecture of transputers but on the same silicon

icandoit 17 hours ago

I wondered if this was using interaction combinators like the vine programming language does.

I haven't read much that explains how they do it.

I have been very slowly trying to build a translation layer between starlark and vine as a proof of concept of massively parallel computing. If someone better qualified finds a better solution the market it sure to have demand for you. A translation layer is bound to be cheaper than teaching devs to write in jax or triton or whatever comes next.

vendiddy 19 hours ago

I don't know much about CPUs so maybe someone can clarify.

Is this effectively having a bunch of tiny processors on a single chip each with its own storage and compute?

ACCount36 13 hours ago

I can't see this ever replacing general purpose Arm cores, but it might be viable in LP-optimized always-on processors and real time control cores.

SoftTalker 19 hours ago

> Efficient’s goal is to approach the problem by static scheduling and control of the data flow - don’t buffer, but run. No caches, no out-of-order design, but it’s also not a VLIW or DSP design. It’s a general purpose processor.

Sounds like a mainframe. Is there any similarity?

nnx 12 hours ago

Not sure about general-purposeness, but the architecture looks rather perfect for LLM inference?

Wonder why they do not focus their marketing on this.

trhway 14 hours ago

> spatial data flow model. Instead of instructions flowing through a centralized pipeline, the E1 pins instructions to specific compute nodes called tiles and then lets the data flow between them. A node, such as a multiply, processes its operands when all the operand registers for that tile are filled. The result then travels to the next tile where it is needed. There's no program counter, no global scheduler. This native data-flow execution model supposedly cuts a huge amount of the energy overhead typical CPUs waste just moving data.

should work great for NN.

lazyeye 18 hours ago

https://www.efficient.computer/technology

renewiltord 18 hours ago

Is there a dev board available? Seems hard to find. I am curious.