Finally! I've waited 25 years for something like this, which totally derailed the career I had envisioned for myself. I might even lose the option of making fun of all new chips for providing marginal performance increases for exponentially growing complexity, and have to find a new shtick.

A few predictions/concerns:

* I can't find a price, so I'm predicting that it will be over $1,000 which will prevent it from going mainstream (and also be ~10 times more expensive than it should be).

* There will be poison pill(s). Maybe flash memory that can only be written 1000 times, preventing its use for evolutionary hardware and genetic algorithms. Maybe the place-and-route software won't be good enough to prevent short circuits, so certain configurations will burn up. Maybe some aspect of the software will be proprietary and/or encumbered by patents, preventing hobbyists from thinking outside the box and "getting real work done".

* Any dedicated hardware like memories, ALUs, etc may be misaligned for various use-cases. I just want an array of RISC-V, Arm, DEC Alpha, PowerPC 601/603, something like that, starting with 2 and topping 100 or 1000 cores eventually. So where I'll need memories near CPUs, something in the FPGA will lack the interconnect to allow that.

I hope I'm wrong about any or all of these. Price I can live with, as long as economies of scale or competition kick in and eventually deliver something under $1,000. The rest of it.. eh, I'm not holding my breath. I've been underwhelmed by all previous FPGAs, but maybe they didn't count. Maybe this is something new that finally manifests the original vision of what FPGAs could be.

AMD CPUs see major performance increases every generation. The 7700XT is more than 100% faster than the 2700X and that is at the same core count. Higher core counts beyond 8 cores have become much more accessible.

Having thousands of dumb cores is pointless unless you want to work with sparse data. An out of order core isn't bottlenecked by compute, it is mostly bottlenecked by memory access latency and also bandwidth if you do end up using vector instructions. This means most of your core will be memory and your thousand core chip will turn back into a dozen core chip. If you need a dumb accelerator, then GPUs already exist and you don't need a custom chip.

So the only remaining usecase is sparse data. The expectation is that you are going to get cache misses all the time anyway, so the benefit of a large cache is negated by the fact that the same data is rarely accessed again. The problem with this idea is that sparse workloads are pretty rare. The only usecase that could possibly benefit from a custom chip is sublinear machine learning (SLIDE) which basically does nothing but predict which neurons are activating and ignore everything else.

Oh also I am already assuming you want to tape out your chip and that the FPGA is just a stepping stone. If all you do is insist on running softcore processors with no special architecture (e.g. the Reduceron) on an FPGA then the whole exercise is meaningless.

Hey I don't disagree. You bring up some good points, so let me address them individually:

AMD CPUs see major performance increases every generation. The 7700XT is more than 100% faster than the 2700X and that is at the same core count. Higher core counts beyond 8 cores have become much more accessible.

You can check my math, but single-threaded performance has only increased by a factor of about 3 since 2000. So a modern 8 core machine would be about 24 times faster than say a MIPS or PowerPC 601 at the same clock speed, when CPUs had 4 pipeline stages and didn't suffer the kinds of cache miss penalties we see today, so didn't need excessively complex branch prediction. A 16 core machine would be 48 times faster. But if you look at transistor count, CPUs in 2000 had 1-10 million transistors, while today they have 1-10 billion and GPUs have 10-100 billion. So CPUs should have 1,000-10,000 times the performance, not 24 or 48. There's simply no way for traditional CPUs to scale to the level I'm talking about, which I realized in the late 90s while getting my computer engineering degree from UIUC.

https://en.wikipedia.org/wiki/Transistor_count

Having thousands of dumb cores is pointless unless you want to work with sparse data. An out of order core isn't bottlenecked by compute, it is mostly bottlenecked by memory access latency and also bandwidth if you do end up using vector instructions.

I agree, so the chip I'm envisioning would have an array of local memories (one in or beside/above each CPU) which negates this argument. Then the problem becomes one of orchestrating those memories to appear as one coherent address space. I want to use a content-addressable memory with copy-on-write, so that the memory works like a hash tree (BitTorrent). A cyclic redundancy check (CRC) or similar could be used for the hashing, with a fallback to lower clash hash like SHA when a block clashes. This would all be encapsulated in an abstraction below the level of the code, following the same principles as cache coherence. We'd mainly use auto-parallelizing higher-order methods along the lines of map-reduce and scatter-gather arrays within a runtime similar to Go/Erlang, Octave/MATLAB or Haskell/Julia (or vanilla C or Lisp even) to make it appear that we are programming a single synchronous-blocking thread of execution. I've had this approach fully-formed in my head since about 2010, with the original idea coming from when I was programming games back in the 90s and Apple crippled its memory busses for cost reasons and to prevent competition between its entry-level machines and its flagship lines, so I saw that the memory bus is the only real bottleneck in computers today.

You do touch on one major limitation though: there would be a 10-100x overhead to build my design on FPGA with multiplexers and LUTs, then another 10-100x for the hash memory, and at least 50% of the die lost to local memories. So I might need 4 orders of magnitude more transistors to achieve existing performance. Then the hashing could add 10-100x the latency, putting us at 6 orders of magnitude worse performance in the worst case. I'd plan to get around that by de-prioritizing latency and optimizing the best case (sort of the non-branch prediction miss or cache miss case) and then throw hardware at it. It's much simpler to just increase the die size by 10-100x on a side than develop the next architecture or VLSI design rules. So at scale, this new chip simply wouldn't face the linear speed increase limitations that all processors today face. It might be as simple as just plugging in another chip on a PCI bus to get another 1000+ cores under the same hash memory, just like how we program the web with hash id caching at the edges with stuff like CloudFlare.

So the only remaining usecase is sparse data. The expectation is that you are going to get cache misses all the time anyway, so the benefit of a large cache is negated by the fact that the same data is rarely accessed again.

Exactly, which is why each core would have little or no cache. If some number of cores want to crunch away on sparse data, while others run the OS, there is no conflict. Note that this strategy runs counter to just about all processors since the DEC Alpha, which I recall had 3/4 of its die area devoted to cache, which it succumbed to because DEC was never able to figure out multicore, and their demise coincided with the loss of R&D funding after the Dot Bomb around 2001. Note that we also lost cluster computing (remember the Beowulf cluster jokes) because GPUs were "good enough" for the primarily graphics-driven workloads of desktop publishing and gaming. Never mind that GPUs fail for most other real-world business logic use-cases.

Oh also I am already assuming you want to tape out your chip and that the FPGA is just a stepping stone. If all you do is insist on running softcore processors with no special architecture (e.g. the Reduceron) on an FPGA then the whole exercise is meaningless.

Thank you, I hadn't heard of the Reduceron!

https://www.cs.york.ac.uk/fp/reduceron/

https://github.com/tommythorn/Reduceron

Without getting too far out in the weeds, I appreciate what they're trying to do, but wide busses won't work for that. This strikes me as perhaps an extension of SIMD and VLIW, which are what I'm trying to get away from so that we can get back to ordinary desktop computing.

--

Which brings me to my main point. You're concerned with your use-cases around machine learning, it sounds like. And that's great, I'm not knocking that, and GPUs or AI cores for TensorFlow are fine for that.

But what I need is horsepower for unpredictable workloads. I want to explore genetic algorithms, and simulated annealing and k-means clustering and a host of algorithms beyond that which don't work well within a shader paradigm. I need to interact with system calls, and disks, and the network. I need to run multiple types of workloads simultaneously. And I can't be dropping down to intrinsics from a high-level language like Python to do that. Although the rest of the world seems to enjoy that mental tax for reasons which I may never understand.

Other than internet-distributed stuff like SETI@home, Folding@home and BOINC, or EC2 on AWS, there is simply no machine available today that can do what I need. So you're talking past me with efficiency and optimization concerns, while I'm unhoused with no meal ticket.

Now, a handful of people over the last 2 decades have grokked what I'm getting at, and are trying to achieve desktop computing on GPU:

https://en.wikipedia.org/wiki/Transputer

https://en.wikipedia.org/wiki/Multiple_instruction,_multiple...

https://www.microsoft.com/en-us/research/video/mimd-on-gpu/

http://aggregate.org/MOG/

Due to an almost complete lack of industry support, these projects are destined to fail.

Which is why I've all but given up on the status quo ever changing. It's like I can see an entire alternate reality where kids could build stuff like C3P0 like Anakin did, with off-the-shelf multiprocessing hardware. But because we're focussed on SIMD and going to great lengths to achieve even the slightest parallelism, we can't even build neurons in hardware. Which is really where I'm going with this. A way to emulate 100 billion neurons, where each one has the computing power of perhaps a MOS 6502 or Zilog Z80. Then let a genetic algorithm evolve that core into something that can actually think emergently with its neighbors.

Vs the million monkeys approach we have today, where teams of scientists work frantically for decades spending billions of dollars to build stuff like large language models (LLMs). For all of the excitement around that, I just find myself tired and dejected reminiscing about an alternate history that never came to be.

Anyway, sorry for the overshare, I'll just go back to my day job building CRUD apps on the web, running as fast as I can in place to make rent like Sam on Quantum Leap, never able to exit the Matrix. Maybe Gen Z will pull off what Gen X was blocked from doing at every turn. But I digress.