Maybe this will need C based highly optimized backend as well, however, the code looks very clean and beautiful to me. I hope someday differentiable programming becomes possible in common lisp.
BLAS/LAPACK is fortran, not C.
BLAS is a set of subroutine interfaces that originated in Fortran, but can be implemented in any language; most mainstream implementations use one or more of Fortran, C, C++, assembly, and DSLs.
It's my impression that common implementations of BLAS have fortran at the core because fortran (moreso than C) lends itself to automatic vectorization.
BLAS inner loops are usually explicitly hand-vectorized, so Fortran’s autovectorization advantages don’t help. I’ve written many BLAS kernels in assembly over the years.
For information of anyone who doesn't know about performance of level 3 BLAS: It doesn't come just from vectorization, but also cache use with levels of blocking (and prefetch). See the material under https://github.com/flame/blis. Other levels -- not matrix-matrix -- are less amenable to fancy optimization, with lower arithmetic density to pit against memory bandwidth, and BLIS mostly hasn't bothered with them, though OpenBLAS has.