Hey I don't disagree. You bring up some good points, so let me address them individually:

AMD CPUs see major performance increases every generation. The 7700XT is more than 100% faster than the 2700X and that is at the same core count. Higher core counts beyond 8 cores have become much more accessible.

You can check my math, but single-threaded performance has only increased by a factor of about 3 since 2000. So a modern 8 core machine would be about 24 times faster than say a MIPS or PowerPC 601 at the same clock speed, when CPUs had 4 pipeline stages and didn't suffer the kinds of cache miss penalties we see today, so didn't need excessively complex branch prediction. A 16 core machine would be 48 times faster. But if you look at transistor count, CPUs in 2000 had 1-10 million transistors, while today they have 1-10 billion and GPUs have 10-100 billion. So CPUs should have 1,000-10,000 times the performance, not 24 or 48. There's simply no way for traditional CPUs to scale to the level I'm talking about, which I realized in the late 90s while getting my computer engineering degree from UIUC.

https://en.wikipedia.org/wiki/Transistor_count

Having thousands of dumb cores is pointless unless you want to work with sparse data. An out of order core isn't bottlenecked by compute, it is mostly bottlenecked by memory access latency and also bandwidth if you do end up using vector instructions.

I agree, so the chip I'm envisioning would have an array of local memories (one in or beside/above each CPU) which negates this argument. Then the problem becomes one of orchestrating those memories to appear as one coherent address space. I want to use a content-addressable memory with copy-on-write, so that the memory works like a hash tree (BitTorrent). A cyclic redundancy check (CRC) or similar could be used for the hashing, with a fallback to lower clash hash like SHA when a block clashes. This would all be encapsulated in an abstraction below the level of the code, following the same principles as cache coherence. We'd mainly use auto-parallelizing higher-order methods along the lines of map-reduce and scatter-gather arrays within a runtime similar to Go/Erlang, Octave/MATLAB or Haskell/Julia (or vanilla C or Lisp even) to make it appear that we are programming a single synchronous-blocking thread of execution. I've had this approach fully-formed in my head since about 2010, with the original idea coming from when I was programming games back in the 90s and Apple crippled its memory busses for cost reasons and to prevent competition between its entry-level machines and its flagship lines, so I saw that the memory bus is the only real bottleneck in computers today.

You do touch on one major limitation though: there would be a 10-100x overhead to build my design on FPGA with multiplexers and LUTs, then another 10-100x for the hash memory, and at least 50% of the die lost to local memories. So I might need 4 orders of magnitude more transistors to achieve existing performance. Then the hashing could add 10-100x the latency, putting us at 6 orders of magnitude worse performance in the worst case. I'd plan to get around that by de-prioritizing latency and optimizing the best case (sort of the non-branch prediction miss or cache miss case) and then throw hardware at it. It's much simpler to just increase the die size by 10-100x on a side than develop the next architecture or VLSI design rules. So at scale, this new chip simply wouldn't face the linear speed increase limitations that all processors today face. It might be as simple as just plugging in another chip on a PCI bus to get another 1000+ cores under the same hash memory, just like how we program the web with hash id caching at the edges with stuff like CloudFlare.

So the only remaining usecase is sparse data. The expectation is that you are going to get cache misses all the time anyway, so the benefit of a large cache is negated by the fact that the same data is rarely accessed again.

Exactly, which is why each core would have little or no cache. If some number of cores want to crunch away on sparse data, while others run the OS, there is no conflict. Note that this strategy runs counter to just about all processors since the DEC Alpha, which I recall had 3/4 of its die area devoted to cache, which it succumbed to because DEC was never able to figure out multicore, and their demise coincided with the loss of R&D funding after the Dot Bomb around 2001. Note that we also lost cluster computing (remember the Beowulf cluster jokes) because GPUs were "good enough" for the primarily graphics-driven workloads of desktop publishing and gaming. Never mind that GPUs fail for most other real-world business logic use-cases.

Oh also I am already assuming you want to tape out your chip and that the FPGA is just a stepping stone. If all you do is insist on running softcore processors with no special architecture (e.g. the Reduceron) on an FPGA then the whole exercise is meaningless.

Thank you, I hadn't heard of the Reduceron!

https://www.cs.york.ac.uk/fp/reduceron/

https://github.com/tommythorn/Reduceron

Without getting too far out in the weeds, I appreciate what they're trying to do, but wide busses won't work for that. This strikes me as perhaps an extension of SIMD and VLIW, which are what I'm trying to get away from so that we can get back to ordinary desktop computing.

Which brings me to my main point. You're concerned with your use-cases around machine learning, it sounds like. And that's great, I'm not knocking that, and GPUs or AI cores for TensorFlow are fine for that.

But what I need is horsepower for unpredictable workloads. I want to explore genetic algorithms, and simulated annealing and k-means clustering and a host of algorithms beyond that which don't work well within a shader paradigm. I need to interact with system calls, and disks, and the network. I need to run multiple types of workloads simultaneously. And I can't be dropping down to intrinsics from a high-level language like Python to do that. Although the rest of the world seems to enjoy that mental tax for reasons which I may never understand.

Other than internet-distributed stuff like SETI@home, Folding@home and BOINC, or EC2 on AWS, there is simply no machine available today that can do what I need. So you're talking past me with efficiency and optimization concerns, while I'm unhoused with no meal ticket.

Now, a handful of people over the last 2 decades have grokked what I'm getting at, and are trying to achieve desktop computing on GPU:

https://en.wikipedia.org/wiki/Transputer

https://en.wikipedia.org/wiki/Multiple_instruction,_multiple...

https://www.microsoft.com/en-us/research/video/mimd-on-gpu/

http://aggregate.org/MOG/

Due to an almost complete lack of industry support, these projects are destined to fail.

Which is why I've all but given up on the status quo ever changing. It's like I can see an entire alternate reality where kids could build stuff like C3P0 like Anakin did, with off-the-shelf multiprocessing hardware. But because we're focussed on SIMD and going to great lengths to achieve even the slightest parallelism, we can't even build neurons in hardware. Which is really where I'm going with this. A way to emulate 100 billion neurons, where each one has the computing power of perhaps a MOS 6502 or Zilog Z80. Then let a genetic algorithm evolve that core into something that can actually think emergently with its neighbors.

Vs the million monkeys approach we have today, where teams of scientists work frantically for decades spending billions of dollars to build stuff like large language models (LLMs). For all of the excitement around that, I just find myself tired and dejected reminiscing about an alternate history that never came to be.

Anyway, sorry for the overshare, I'll just go back to my day job building CRUD apps on the web, running as fast as I can in place to make rent like Sam on Quantum Leap, never able to exit the Matrix. Maybe Gen Z will pull off what Gen X was blocked from doing at every turn. But I digress.