I really enjoyed this. I use a 5700 XT (RDNA 2) as my one of my main development GPUs, and appreciate its particular strengths, and because of this article am quite tempted to get an RDNA 3 card.

A few observations.

First, the astonishingly low latency for LDS (aka workgroup shared memory). Vello uses this fairly extensively (especially in prefix sum / monoid scan operations, plus the stack monoid), so I'd expect performance to be quite sweet on this card. If every card had this, I'm not sure there'd still be a motivation to do subgroups. I also really liked the chart comparing shared memory latency across cards - I already knew that Intel was slow (especially Gen9 and earlier), so it was nice to see quantitative data.

Second, it seems really odd to me that there's chip area for dual-issue ALU, giving over 60 TFLOPS of f32, but the shader compiler only achieves single issue most of the time. I wonder if there are plans to improve performance over time through software updates.

I'd also be really curious whether using Vulkan subgroup size control would help unlock this higher performance. With that feature, the application can query what subgroup sizes are available (32 and 64 here) and explicitly choose one.

I'm really glad this kind of microbenchmarking is being done, we need more of it.