I'm hoping someday there will be an embedded Linux processor with this much cache. 128MB on-die SRAM means the PCB would no longer need separate DRAM. The complexity of the board routing would also go down. That much RAM ought to be enough for a lot of embedded applications.

The economics don't work out. Why would you avoid something as trivial as board routing, as cheap as $2 per gigabyte DRAM, and as performance-enhancing as having gigabytes of main memory, just to use a 128 MB on-die (or on-package) SRAM (at a price of ~$500/GB?)?

The main distinction between application processors that can run Linux and microcontrollers that use onboard RAM (and often Flash) is that the former have an MMU. It's attractive to imagine that your SBC might only need something as simple as a DIP-packaged Atmega for an Arduino, and I can imagine a system-on-module - actually, saying that, I think several exist, ex. this i.MX6 device with a 148-pin quad-flat "SOM" with 512 MB of DDR3L and 512 MB of Flash:

https://www.seeedstudio.com/NPi-i-MX6ULL-Dev-Board-Industria...

Whether you consider that Seeed branded metallic QFP (which obviously contains discrete DRAM, Flash, and an iMX6) to be a single package, while a comparably-sized piece of FR4 with a BGA package for each of the application processor, DRAM, and Flash on mezzanine or Compute-module style SODIMM edge connectors would not satisfy your desire for an embedded Linux processor with less routing complexity, I don't know. They build SOMs for people who don't want to pay for 8 layers and BGA fanout all the time.

I don't think there are enough applications for embedded systems that need 128M of onboard SRAM that won't support the power budget, size, complexity, and cost of a few GB of DRAM.

> Why would you avoid something as trivial

L3 cache is orders of magnitude faster than using RAM.

You're talking a maximum of 50GB/s for DDR5, versus 1500GB/s for L3 cache

https://en.wikipedia.org/wiki/List_of_interface_bit_rates#Dy...

https://meterpreter.org/amd-ryzen-9-7900x-benchmark-zen-4-im...

It's a paradigm-shifting increase in processing speed when you don't need to hit RAM.

+ totally agree with that.

There is a use case when you can improve performance by keeping compressed (LZ4) data in RAM and decompressing by small blocks that fit in cache. This is demonstrated by ClickHouse[1][2] - the whole data processing after decompression fits in cache, and compression saves the RAM bandwidth.

[1] https://presentations.clickhouse.com/meetup53/optimizations/ [2] https://github.com/ClickHouse/ClickHouse