I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.
But ultimately, the gist of their argument is this:
>Any task will require more Risc V instructions that any contemporary instruction set.
Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.
I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.
I am familiar with many tens of instruction sets, since the first computers with vacuum tubes until all the important instruction sets that are still in use, and there is no doubt that RISC-V requires more instructions and a larger code size than almost all of them, for doing any task.
Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM, the so-called better results were for the compressed extension, not for the normal encoding.
Moreover, the results for RISC-V are hugely influenced by the programming language and the compiler options that are chosen. RISC-V has an acceptable code size only for unsafe code, if the programming language or the compiler options require run-time checks, to ensure safe behavior, then the RISC-V code size increases enormously, while for other CPUs it barely changes.
The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.
Except for this good feature, the rest of the ISA is full of bad features, which frequently require at least 2 instructions instead of 1 instruction in any other CPU, e.g. the lack of indexed addressing, which is needed in any loop that must access some aggregate data structure, in order to be able to implement the loop with a minimum number of instructions.
> the so-called better results were for the compressed extension, not for the normal encoding.
Ignoring RISC-V’s compressed encoding seems a rather artificial restriction.
The compressed encoding has good code density, but low speed.
The compressed RISC-V encoding must be compared with the ARMv8-M encoding not with the ARMv8-A.
The base 32-bit RISC-V encoding may be compared with the ARMv8-A, because only it can have comparable performance.
All the comparisons where RISC-V has better code density compare the compressed encoding with the 32-bit ARMv8-A. This is a classical example of apples-to-oranges, because the compressed encoding will never have a performance in the same league with ARMv8-A.
When the comparisons are matched, 16-bit RISC-V encoding with 16-bit ARMv8-M and 32-bit RISC-V with 32-bit ARMv8-A, RISC-V always loses in code density in both comparisons, because only the RISC-V branch instructions are frequently shorter than those of ARM, while all the other instructions are frequently longer.
There are good reasons to use RISC-V for various purposes, where either the lack of royalties or the easy customization of the instruction set are important, but claiming that it should be chosen not because it is cheaper, but because it were better, looks like the story with the sour grapes.
The value of RISC-V is not in its instruction set, because there are thousands of people who could design better ISAs in a week of work.
What is valuable about RISC-V is the set of software tools, compilers, binutils, debuggers etc. While a better ISA can be done in a week, recreating the complete software environment would need years of work.
> The compressed encoding has good code density, but low speed.
That's 100% nonsense. They have the same performance and in fact, some pipelines can get better performance because they fetch a fixed number of bytes and with compressed instructions, that means more instructions fetched.
The rest of the argument falls apart resting on this fallacy.
They have the same performance only in low performance CPUs intended for embedded applications.
If you want to use a RISC-V at a performance level good enough for being used in something like a mobile phone or a personal computer, you need to simultaneously decode at least 8 instructions per clock cycle and preferably much more, because to match 8 instructions of other CPUs you need at least 10 to 12 RISC-V instructions and sometimes much more.
Nobody has succeeded to simultaneously decode a significant number of compressed RISC-V instructions and it is unlikely that anyone would attempt this, because the cost in area and power of a decoder able to do this is much larger than the cost of a decoder for simultaneous decoding of fixed-length instructions.
This is the reason why also ARM uses a compressed encoding in their -M CPUs for embedded applications but a 32-bit fixed-length encoding in their -A CPUs for applications where more than 1 watt per core is available and high performance is needed.
ARM doesn't have any cores that do 8 wide decode. Neither do Intel or AMD. Apple has, but Apple is not ARM and doesn't share their designs with ARM or ARM customers.
Cortex X-1 and X-1 have 5 wide decode. Cortex A78 and Neoverse N1 have 4 wide decode.
ARM uses compressed encoding in their 32 bit A-series CPUs, for example the Cortex A7, A15 and so on. The A15 is pretty fast, running at up to 2.5 GHz. It was used in phones such as the Galaxy S4 and Note 3 back before 64 bit became a selling point.
Several organisations are making wide RISC-V implementations. Most of them aren't disclosing what they are doing, but one has actually published details of how it's 4-8 wide RISC-V decoder works -- they decode 16 bytes of code at a time, which is 4 instructions if they are all 32 bit instructions, 8 instructions if they are all 16 bit instructions, somewhere between for a mix.
https://github.com/MoonbaseOtago/vroom
Everything is there, in the open, including the GPL licensed SystemVerilog source code. It's not complex. The decode scheme is modular and extensible to as wide as you want, with no increase in complexity, just slightly longer latency.
There are practical limits to how wide is useful not because you can't build it, but because most code has a branch every 5 or 6 instructions on average. You can build a 20-wide machine if you want -- it just won't be any faster because it doesn't fit most of the code you'll be executing.