I’ve been working on an early design of a high-performance dynamic binary translator that cannot JIT, and have reached a very similar conclusion as the author. We have an existing threaded interpreter but it’s a mess of hard-to-maintain assembly for two architectures, and we run into funny issues all the time where the two diverge. Plus, being handwritten by people who are not scheduling experts, there is probably some performance left on the table because of our poor choices and the design making it difficult to write complex-but-more-performant code. Nobody wants to write an efficient hash for TLB lookups in a software MMU using GAS macros.

The core point I’ve identified is that existing compilers are pretty good at converting high level descriptions of operations into architecture-specific code (at least, better than we are given the amount of instructions we have to implement) but absolutely awful at doing register selection or dealing with open control flow that is important for an interpreter. Writing everything in assembly lets you do these two but you miss out on all the nice processor stuff that LLVM has encoded into Tablegen.

Anyways, the current plan is that we’re going to generate LLVM IR for each case and run it through a custom calling convention to take that load off the compiler, similar to what the author did here. There’s a lot more than I’m handwaving over that’s still going to be work, like whether we can automate the process of translating the semantics for each instruction into code, how we plan to pin registers, and how we plan to perform further optimizations on top of what the compiler spits out, but I think this is going to be the new way that people write interpreters. Nobody needs another bespoke macro assembler for every interpreter :)

Great summary, this matches my experience. For straight-line code, modern C compilers can't be beat. But when it comes to register allocation, they constantly make decisions that are real head-scratchers.

One of the biggest problems is when cold paths compromise the efficiency of hot paths. You would hope that __builtin_expect() would help, but from what I can tell __builtin_expect() has no direct impact on register allocation. I wish the compiler would use this information to make sure that cold paths can never compromise the register allocation of the hot paths, but I constantly see register shuffles or spills on hot paths that are only for the benefit of cold paths.

Is there anywhere I can follow your work? I am very interested in keeping track of the state of the art.

Yeah, I did a quick check in LLVM at some point to see what it does (query I relied on: https://github.com/llvm/llvm-project/search?q=getPredictable...) and all the results seemed to be exclusively code motion or deciding how to lower a branch. Similarly cold path outlining seemed to just want to split the function in a fairly simple manner rather than doing anything beyond that. Perhaps I missed something, but I think the current hints are just to help the branch predictor or instruction cache rather than significantly alter codegen.

Unfortunately, I don't have much to share at the moment besides my thoughts; I've done a few small tests but haven't been able to really do a full implementation yet. The primary consumer of this work would be iSH (https://github.com/ish-app/ish), which has a need for a fast interpreter, so you can at least take a look at the current implementation to see what we'd like to replace. The nature of the project means that most of my time has been tied up in things like making sure that keyboard avoidance is set up correctly and that users can customize the background color of their terminal :/

With that said, I'd be happy to chat more if you'd like–feel free to send me an email or whatever. Not sure I can say I'm at the state of the art yet, but perhaps we can get there :)