The x86 decoders consume a reasonable amount of power, but the trouble is making them wider without affecting that.

I have an AMD CPU. Zen CPUs come with a fairly wide backend. But the frontend is what it is (especially early Zen), and without SMT it's essentially impossible to keep all those execution units fed. It's not that 8 x86 decoders wouldn't be a benefit, it's just that more decoders isn't cheap in x86 cores, each extra decoder is a serious cost.

If you compare with the big ARM cores, having a wide frontend is not a complex research problem or an impractical cost. 8 wide ARM decode is completely practical. You even have open source superscalar RISC-V cores just publicly available on Github running on FPGAs with 8 wide decode. Large frontends are (relatively) cheap and easy, if you're not x86.

So when we notice that the narrower x86 CPU's decode doesn't consume that much (a "drop in the ocean"), that's because it was designed narrower to keep the PPA reasonable! The reason I can't feed my Zen backend isn't because having a wide frontend is useless and I should just enable SMT anyways, it's because x86 makes wide decodes much less practical than competing architectures.

>You even have open source superscalar RISC-V cores just publicly available on Github running on FPGAs with 8 wide decode. Large frontends are (relatively) cheap and easy, if you're not x86.

Which one? I know BOOM can technically go eight wide in so far as it's parametrizable but I suspect any BOOM backend which could support that much throughout would be a nightmare to instantiate on nearly any FPGA.

I had VROOM! in mind (https://github.com/MoonbaseOtago/vroom) because I remembered it aims for 4 IPC avg with a width of 8. Though looking again it's 8 compressed 16 bit instructions or 4 uncompressed 32 bit instruction.

So you could argue a real mix of instructions is not going to be all 16 bit but some 16 and some 32, so the 8 is rarely achieved in practice, and also the block diagram only shows 4 decode blocks. But it can in fact peak at 8 instructions decoded per clock, so I'll call that 8 wide decode.

(You could even argue it's especially impressive, since RISC-V technically qualifies as variable-length encoding like x86, it's just that only the 16/32 instructions encoding are really in use at the moment)