What does HackerNews think of blis?

Column Vectors vs. Row Vectors | Oct 2022

Here's BLIS's object API:

https://github.com/flame/blis/blob/master/docs/BLISObjectAPI...

Searching "object" in BLIS's README (https://github.com/flame/blis) to see what they think of it:

"Objects are relatively lightweight structs and passed by address, which helps tame function calling overhead."

"This is API abstracts away properties of vectors and matrices within obj_t structs that can be queried with accessor functions. Many developers and experts prefer this API over the typed API."

In my opinion, this API is a strict improvement over BLAS. I do not think there is any reason to prefer the old BLAS-style API over an improvement like this.

Regarding your own experience, it's great that using BLAS proved to be a valuable learning experience for you. But your argument that the BLAS API is somehow uniquely helpful in terms of learning how to program numerical algorithms efficiently, or that it will help you avoid performance problems, is not true. It is possible to replace the BLAS API with a more modern and intuitive API with the same benefits. To be clear, the benefits here are direct memory management and control of striding and matrix layout, which create opportunities for optimization. There is nothing unique about BLAS in this regard---it's possible to learn these lessons using any of the other listed options if you're paying attention and being systematic.

Doing small network scientific machine learning in Julia faster than PyTorch | Apr 2022

The article asks "Which Micro-optimizations matter for BLAS3?", implying small dimensions, but doesn't actually tell me. The problem is well-studied, depending on what you consider "small". The most important thing is to avoid the packing step below an appropriate threshold. Implementations include libxsmm, blasfeo, and the "sup" version in blis (with papers on libxsmm and blasfeo). Eigen might also be relevant.

https://libxsmm.readthedocs.io/

https://blasfeo.syscop.de/

https://github.com/flame/blis

Auto-vectorization for the Masses (2011) | Feb 2020

Expand Context ↕

Netlib BLAS is a very low bar [1], and not at all how one should go about writing a performance portable BLAS. BLIS (https://github.com/flame/blis/) is a much better approach, and underlies vendor implementations on AMD (https://developer.amd.com/amd-aocl/blas-library/) and many embedded systems.

[1] GEMV is entirely limited by memory bandwidth, thus quite uninteresting from a vectorization standpoint. Maybe you meant GEMM?

Reducing the Performance Gap of Intel's MKL on AMD Threadripper | Dec 2019

As for a previous article like this -- it's pointless. Just use BLIS, like AMD do, and which is infinitely faster than MKL on non-x86 systems. https://github.com/flame/blis

AMD Ryzen Threadripper 3000 32-core CPU is more bad news for Intel | Sep 2019

Expand Context ↕

The BLAS & Lapack subset of the API of the Intel Math Kernel Library (MKL) is very well implemented in open source projects such as OpenBLAS and BLIS:

https://github.com/flame/blis

Both are well optimized for AMD CPUs.

Numpy Clone in Common Lisp | May 2019

Expand Context ↕

Not arguing with that, but I think the jury is out on whether they need to be hand vectorized. With recent GCC, generic C for BLIS' DGEMM gives about 2/3 the performance of the hand-coded version on Haswell, and it may be somewhat pessimized by hand-unrolling rather than letting the compiler do it. The remaining difference is thought to be mainly from prefetching, but I haven't verified that. (Details are scattered in the BLIS issue tracker earlier this year.)

For information of anyone who doesn't know about performance of level 3 BLAS: It doesn't come just from vectorization, but also cache use with levels of blocking (and prefetch). See the material under https://github.com/flame/blis. Other levels -- not matrix-matrix -- are less amenable to fancy optimization, with lower arithmetic density to pit against memory bandwidth, and BLIS mostly hasn't bothered with them, though OpenBLAS has.

How to write efficient matrix multiplication | May 2018

This thread is old, but for the sake of archives:

BLIS actually tells you how to write a fast production large-matrix GEMM, and the papers linked from https://github.com/flame/blis would be a better reference than the Goto and van de Geijn.

For small matrices see the references from https://github.com/hfp/libxsmm but you're not likely to be re-implementing that unless you're serious about a version on non-x86_64.

Is software prefetching (__builtin_prefetch) useful for performance? | May 2018

It's useful in linear algebra implementations (if not necessarily using __builtin_prefetch). You can study the (micro-)architecture-dependent uses in e.g. https://github.com/flame/blis https://github.com/xianyi/OpenBLAS and https://github.com/hfp/libxsmm and it may be discussed in the papers describing them. libxsmm provides a view of the complexity outside "large" dimension BLAS.