What does HackerNews think of Reduceron?

FPGA Haskell machine with game changing performance. Reduceron is Matthew Naylor, Colin Runciman and Jason Reich's high performance FPGA softcore for running lazy functional programs, including hardware garbage collection. Reduceron has been implemented on various FPGAs with clock frequency ranging from 60 to 150 MHz depending on the FPGA. A high degree of parallelism allows Reduceron to implement graph evaluation very efficiently. This fork aims to continue development on this, with a view to practical applications. Comments, questions, etc are welcome.

Language: Haskell

#31 in Compiler
#8 in Haskell
Hey I don't disagree. You bring up some good points, so let me address them individually:

AMD CPUs see major performance increases every generation. The 7700XT is more than 100% faster than the 2700X and that is at the same core count. Higher core counts beyond 8 cores have become much more accessible.

You can check my math, but single-threaded performance has only increased by a factor of about 3 since 2000. So a modern 8 core machine would be about 24 times faster than say a MIPS or PowerPC 601 at the same clock speed, when CPUs had 4 pipeline stages and didn't suffer the kinds of cache miss penalties we see today, so didn't need excessively complex branch prediction. A 16 core machine would be 48 times faster. But if you look at transistor count, CPUs in 2000 had 1-10 million transistors, while today they have 1-10 billion and GPUs have 10-100 billion. So CPUs should have 1,000-10,000 times the performance, not 24 or 48. There's simply no way for traditional CPUs to scale to the level I'm talking about, which I realized in the late 90s while getting my computer engineering degree from UIUC.

https://en.wikipedia.org/wiki/Transistor_count

Having thousands of dumb cores is pointless unless you want to work with sparse data. An out of order core isn't bottlenecked by compute, it is mostly bottlenecked by memory access latency and also bandwidth if you do end up using vector instructions.

I agree, so the chip I'm envisioning would have an array of local memories (one in or beside/above each CPU) which negates this argument. Then the problem becomes one of orchestrating those memories to appear as one coherent address space. I want to use a content-addressable memory with copy-on-write, so that the memory works like a hash tree (BitTorrent). A cyclic redundancy check (CRC) or similar could be used for the hashing, with a fallback to lower clash hash like SHA when a block clashes. This would all be encapsulated in an abstraction below the level of the code, following the same principles as cache coherence. We'd mainly use auto-parallelizing higher-order methods along the lines of map-reduce and scatter-gather arrays within a runtime similar to Go/Erlang, Octave/MATLAB or Haskell/Julia (or vanilla C or Lisp even) to make it appear that we are programming a single synchronous-blocking thread of execution. I've had this approach fully-formed in my head since about 2010, with the original idea coming from when I was programming games back in the 90s and Apple crippled its memory busses for cost reasons and to prevent competition between its entry-level machines and its flagship lines, so I saw that the memory bus is the only real bottleneck in computers today.

You do touch on one major limitation though: there would be a 10-100x overhead to build my design on FPGA with multiplexers and LUTs, then another 10-100x for the hash memory, and at least 50% of the die lost to local memories. So I might need 4 orders of magnitude more transistors to achieve existing performance. Then the hashing could add 10-100x the latency, putting us at 6 orders of magnitude worse performance in the worst case. I'd plan to get around that by de-prioritizing latency and optimizing the best case (sort of the non-branch prediction miss or cache miss case) and then throw hardware at it. It's much simpler to just increase the die size by 10-100x on a side than develop the next architecture or VLSI design rules. So at scale, this new chip simply wouldn't face the linear speed increase limitations that all processors today face. It might be as simple as just plugging in another chip on a PCI bus to get another 1000+ cores under the same hash memory, just like how we program the web with hash id caching at the edges with stuff like CloudFlare.

So the only remaining usecase is sparse data. The expectation is that you are going to get cache misses all the time anyway, so the benefit of a large cache is negated by the fact that the same data is rarely accessed again.

Exactly, which is why each core would have little or no cache. If some number of cores want to crunch away on sparse data, while others run the OS, there is no conflict. Note that this strategy runs counter to just about all processors since the DEC Alpha, which I recall had 3/4 of its die area devoted to cache, which it succumbed to because DEC was never able to figure out multicore, and their demise coincided with the loss of R&D funding after the Dot Bomb around 2001. Note that we also lost cluster computing (remember the Beowulf cluster jokes) because GPUs were "good enough" for the primarily graphics-driven workloads of desktop publishing and gaming. Never mind that GPUs fail for most other real-world business logic use-cases.

Oh also I am already assuming you want to tape out your chip and that the FPGA is just a stepping stone. If all you do is insist on running softcore processors with no special architecture (e.g. the Reduceron) on an FPGA then the whole exercise is meaningless.

Thank you, I hadn't heard of the Reduceron!

https://www.cs.york.ac.uk/fp/reduceron/

https://github.com/tommythorn/Reduceron

Without getting too far out in the weeds, I appreciate what they're trying to do, but wide busses won't work for that. This strikes me as perhaps an extension of SIMD and VLIW, which are what I'm trying to get away from so that we can get back to ordinary desktop computing.

--

Which brings me to my main point. You're concerned with your use-cases around machine learning, it sounds like. And that's great, I'm not knocking that, and GPUs or AI cores for TensorFlow are fine for that.

But what I need is horsepower for unpredictable workloads. I want to explore genetic algorithms, and simulated annealing and k-means clustering and a host of algorithms beyond that which don't work well within a shader paradigm. I need to interact with system calls, and disks, and the network. I need to run multiple types of workloads simultaneously. And I can't be dropping down to intrinsics from a high-level language like Python to do that. Although the rest of the world seems to enjoy that mental tax for reasons which I may never understand.

Other than internet-distributed stuff like SETI@home, Folding@home and BOINC, or EC2 on AWS, there is simply no machine available today that can do what I need. So you're talking past me with efficiency and optimization concerns, while I'm unhoused with no meal ticket.

Now, a handful of people over the last 2 decades have grokked what I'm getting at, and are trying to achieve desktop computing on GPU:

https://en.wikipedia.org/wiki/Transputer

https://en.wikipedia.org/wiki/Multiple_instruction,_multiple...

https://www.microsoft.com/en-us/research/video/mimd-on-gpu/

http://aggregate.org/MOG/

Due to an almost complete lack of industry support, these projects are destined to fail.

Which is why I've all but given up on the status quo ever changing. It's like I can see an entire alternate reality where kids could build stuff like C3P0 like Anakin did, with off-the-shelf multiprocessing hardware. But because we're focussed on SIMD and going to great lengths to achieve even the slightest parallelism, we can't even build neurons in hardware. Which is really where I'm going with this. A way to emulate 100 billion neurons, where each one has the computing power of perhaps a MOS 6502 or Zilog Z80. Then let a genetic algorithm evolve that core into something that can actually think emergently with its neighbors.

Vs the million monkeys approach we have today, where teams of scientists work frantically for decades spending billions of dollars to build stuff like large language models (LLMs). For all of the excitement around that, I just find myself tired and dejected reminiscing about an alternate history that never came to be.

Anyway, sorry for the overshare, I'll just go back to my day job building CRUD apps on the web, running as fast as I can in place to make rent like Sam on Quantum Leap, never able to exit the Matrix. Maybe Gen Z will pull off what Gen X was blocked from doing at every turn. But I digress.

> I've had a similar experience at university while using Haskell. It's a very elegant lazy pure functional language, but it's glacially slow. The natural, low-friction path is very inefficient. The efficient path is unidiomatic and verbose.

Well, if you run a Haskell program on a "C-Machine" of course a comparable program in a "C-Language" will be faster — as it don't need to bridge any gap in execution semantics.

The point is: Mostly all modern computers are "C-Machines". Modern CPUs go even a long way to simulate a "PDP-7 like computer" to the outside world, even they're working internally quite different. (The most effective optimizations like cache hierarchies, pipelineing, out-of-order execution, JIT compilation to native instructions [CPU internal "micro-ops"], and some more magic are "hidden away"; they're "transparent" to programmers and actually often not even accessible by them). So not only there's nothing than "C-Machines", those "C-Machines" are even highly optimized to most efficiently execute "C-Languages", but nothing else! If you want to feed in something that's not a "C-Language" you have to first translate it to one. That transformation will almost always make your program less efficient than writing it (by hand) in a "C-Language" directly. That's obvious.

On the other hand running Haskell on a "Haskell-Machine"¹ is actually quite efficient. (I think depending on the problem to solve it even outperforms a "C-Machine" by some factor; don't remember the details, would need to look through the papers to be sure…). On such a machine an idiomatic C or Rust program would be "glacially slow", of course, for the same reason as the other way around: The need to adapt execution semantics before such "no-native" programs could be run will obviously make the "translated" programs much slower compared to programs build in languages much closer to the execution semantics provided by the machines hardware implemented evaluator.

That said, I understand why we can't have dedicated hardware evaluators for all kind of (significantly different) languages. Developing and optimizing hardware is just to expensive and takes to much time. At least if you'd like to compete on the status quo.

But I could imagine for the future some kind of high level "meta language" targeting FPGAs which could be compiled down to efficient hardware-based evaluators for programs written in it. Maybe this could even end the predominance of "C-Languages" when it comes to efficient software? Actually the inherently serial command-stream semantics of "C-Languages" aren't well fitted to the parallel data-flow hardware architectures we're using at the core now since some time (where we even do a lot of gymnastics to hide the true nature of the metal by "emulating a PDP-7" to the outside world — as that's what "C-Languages" are build for and expect as their runtime).

To add to the topic of the submission: Are there any HW implementations of APLs? Maybe on GPUs? (As this seems a good fit for array processing languages).

¹ https://github.com/tommythorn/Reduceron

> I also recall reading a paper that got into implementing a GC on silicon for a graph rewriting based architecture

Maybe this one:

https://github.com/tommythorn/Reduceron

You should examine why you dream of this; what is the attraction?

This isn't my itch, but some possible answers could include having hardware enforced type safety and architectural support for efficient execution. To give a rather extreme example of what is possible with a dedicated architecture, study the Reduceron [1,2]. IMO, doing something similar for Lisp would be much much easier.

Could it be done? Yes, absolutely and it would be great fun. You'd have to work with an FPGA though unless you have a sizable fortune to fab a chip (though 28nm is almost affordable).

[1] https://www.cs.york.ac.uk/fp/reduceron [2] https://github.com/tommythorn/Reduceron

PS: Reduceron has a hardware garbage collector

To sum it up, it's a kind of chicken-and-egg. Imperative/OO is tied to the physical implementation of processors, whereas FP is tied to a mathematical construct for which conventional CPUs are not optimized.

Related:

https://github.com/tommythorn/Reduceron

https://stackoverflow.com/questions/15005670/why-is-there-no...

Some basic misunderstanding here: "However, as a stack machine architecture, there is no opportunity for the instruction level parallelism exploited by modern architectures as the exact state of the stack which is depended on by every instruction changes with every instruction. This forces all program execution to wait during slow operations like main memory reads, ..."

This is of course nonsense. Even an in-order implementation with non-blocking caches will let execution proceed in parallel with a load until the result is demanded. An actual out-of-order implementation will rename the (implicit) registers and proceed as normal.

The _only_ issue with stack architectures is that they are awkward (but not impossible) for compilers to generate optimized code for.

EDIT: I agree with basic motivation for pushing all the way to hardware (just not for Lisp). That's why I work on https://github.com/tommythorn/Reduceron