The 7F52 has 16MB of cache per core.

I'd love to see what is possible with a tiny runtime/OS in the kilobyte size and running a microservice written in a native language off of each core, everything out of the L3 cache.

I imagine the throughput would be amazing. Single thread per core, e.g. cooperative multitasking. Do this for stream orientated workflows, or even for processing data that is in reasonable sized chunks, it might be screaming fast!

Unikernel as a c++ header would be a good candidate: https://github.com/includeos/IncludeOS