Very interesting and useful to see.

And in an entirely approach for vectorization for the masses: I do wish that it was easier to access vectorization through BLAS, a library that is well supported across nearly all languages, gets massively optimized, but is hard to install correctly.

Good news is that the Gonum team has been working on an optimized pure Go version of BLAS. It's at parity with netlib blas for some of the important functions (GEMV, GEMV, etc).

Why is this good news? Go is a very easy to use language, and it favours using compile targets, leading it to be available across different platforms. To install, one simply does `go get gonum.org/v1/gonum`

Netlib BLAS is a very low bar [1], and not at all how one should go about writing a performance portable BLAS. BLIS (https://github.com/flame/blis/) is a much better approach, and underlies vendor implementations on AMD (https://developer.amd.com/amd-aocl/blas-library/) and many embedded systems.

[1] GEMV is entirely limited by memory bandwidth, thus quite uninteresting from a vectorization standpoint. Maybe you meant GEMM?