Maybe this will need C based highly optimized backend as well, however, the code looks very clean and beautiful to me. I hope someday differentiable programming becomes possible in common lisp.

BLAS/LAPACK is fortran, not C.

BLAS is a set of subroutine interfaces that originated in Fortran, but can be implemented in any language; most mainstream implementations use one or more of Fortran, C, C++, assembly, and DSLs.

It's my impression that common implementations of BLAS have fortran at the core because fortran (moreso than C) lends itself to automatic vectorization.

BLAS inner loops are usually explicitly hand-vectorized, so Fortran’s autovectorization advantages don’t help. I’ve written many BLAS kernels in assembly over the years.

Not arguing with that, but I think the jury is out on whether they need to be hand vectorized. With recent GCC, generic C for BLIS' DGEMM gives about 2/3 the performance of the hand-coded version on Haswell, and it may be somewhat pessimized by hand-unrolling rather than letting the compiler do it. The remaining difference is thought to be mainly from prefetching, but I haven't verified that. (Details are scattered in the BLIS issue tracker earlier this year.)

For information of anyone who doesn't know about performance of level 3 BLAS: It doesn't come just from vectorization, but also cache use with levels of blocking (and prefetch). See the material under https://github.com/flame/blis. Other levels -- not matrix-matrix -- are less amenable to fancy optimization, with lower arithmetic density to pit against memory bandwidth, and BLIS mostly hasn't bothered with them, though OpenBLAS has.