What does HackerNews think of LoopVectorization.jl?

Macro(s) for vectorizing loops.

Language: Julia

It is a little disappointing that they're setting the bar against vanilla Python in their comparisons. While I'm sure they have put massive engineering effort into their ML compiler, the demos they showed of matmul are not that impressive in an absolute sense; with the analogous Julia code, making use of [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl) to automatically choose good defaults for vectorization, etc...

    julia> using LoopVectorization, BenchmarkTools, Test
           function AmulB!(C,A,B)
               @turbo for n = indices((C,B),2), m = indices((C,A),1)
                   Cmn = zero(eltype(C))
                   for k = indices((A,B),(2,1))
                       Cmn += A[m,k]*B[k,n]
                   end
                   C[m,n]=Cmn
               end
           end
           M = K = N = 144; A = rand(Float32, M,K); B = rand(Float32, K,N); C0 = A*B; C1 = similar(C0);
           AmulB!(C1,A,B)
           @test C1 ≈ C0
           2e-9*M*K\*N/@belapsed(AmulB!($C1,$A,$B))
    96.12825754527164
I'm able to achieve 96GFLOPs on a single core (Apple M1) or 103 GFLOPs on a single core (AMD EPYC 7502). And that's not even as good as what you can achieve using e.g. TVM to do the scheduling exploration that Mojo purports to do.

Perhaps they have more extensive examples coming that showcase the capabilities further. I understand it's difficult to show all strengths of the entire system in a short demonstration video. :)

EDIT: As expected, there are significantly better benchmarks shown at https://www.modular.com/blog/the-worlds-fastest-unified-matr... so perhaps this whole discussion truly is just a matter of the demo not showcasing the true power of the system. Hopefully achieving those high performance numbers for sgemm is doable without too much ugly code.

> My mistake in retrospect was using small arrays as part of a struct, which being immutable got replaced at each time step with a new struct requiring new arrays to be allocated and initialized. I would not have done that in c++, but julia puts my brain in matlab mode.

I see. Yes, it's an interesting design space where Julia makes both heap and stack allocations easy enough, so sometimes you just reach for the heap like in MATLAB mode. Hopefully Prem and Shuhei's work lands soon enough to stack allocate small non-escaping arrays so that user's done need to think about this.

> Alignment I'd assumed, but padding the struct instead of the tuple did nothing, so probably extra work to clear a piece of an simd load. Any insight on why avx availability didn't help would be appreciated. I did verify some avx instructions were in the asm it generated, so it knew, it just didn't use.

The major differences at this point seem to come down to GCC (g++) vs LLVM and proofs of aliasing. LLVM's auto-vectorizer isn't that great, and it seems to be able to prove 2 arrays are not aliasing less reliably. For the first part, some people have just improved the loop analysis code from the Julia side (https://github.com/JuliaSIMD/LoopVectorization.jl), forcing SIMD onto LLVM can help it make the right choices. But for the second part you do need to do `@simd ivdep for ...` (or use LoopVectorization.jl) to match some C++ examples. This is hopefully one of the things that the JET.jl and other new analysis passes can help with, along with the new effects system (see https://github.com/JuliaLang/julia/pull/43852, this is a pretty huge new compiler feature in v1.8, but right now it's manually specified and will take time before things like https://github.com/JuliaLang/julia/pull/44822 land and start to make it more pervasive). When that's all together, LLVM will have more ammo for proving things more effectively (pun intended).

And the "how" behind Octavian.jl is basically LoopVectorization.jl [1], which helps make optimal use of your CPU's SIMD instructions.

Currently there can some nontrivial compilation latency with this approach, but since LV ultimately emits custom LLVM it's actually perfectly compatible with StaticCompiler.jl [2] following Mason's rewrite, so stay tuned on that front.

[1] https://github.com/JuliaSIMD/LoopVectorization.jl

[2] https://github.com/tshort/StaticCompiler.jl

I agree with you that Fortran is running on more than just legacy here. At the same time, I also think Julia has caught up a lot as far as SIMD, multicore, MPI and GPU.

For SIMD, Chris Elrod's LoopVectorization.jl [1] is an (IMHO) incredibly impressive piece of work (which incidentally provides the foundation for I think the first pure Julia linear algebra library competitive with BLAS).

Multithreading is pretty easy with things like `@spawn`/`@sync` and `@threads for` in the base language, as well as super low-overhead multithreading from the Polyester.jl [2] package (which LoopVectorization also uses to provide a version its vectorization macro that'll also multithread your loops in addition to SIMD-vectorizing them).

MPI.jl [3] has been totally problem free for me, though I wouldn't be surprised if the Fortran bindings still have an edge somewhere, and Cuda.jl [4] seems to provide pretty seamless GPU support which should play nicely with MPI.jl's Cuda-aware MPI [5], but I don't work as much with GPUs myself.

[1] https://github.com/JuliaSIMD/LoopVectorization.jl

[2] https://github.com/JuliaSIMD/Polyester.jl

[3] https://github.com/JuliaParallel/MPI.jl

[4] https://github.com/JuliaGPU/CUDA.jl

[5] https://juliaparallel.github.io/MPI.jl/latest/usage/#CUDA-aw...

Yes, and sorry if I also came off as combative here, it was not my intention either. I've used some Common Lisp before I got into Julia (though I never got super proficient with it) and I think it's an excellent language and it's too bad it doesn't get more attention.

I just wanted to share what I think is cool about julia from a metaprogramming point of view, which I think is actually its greatest strength.

> here is a hypothetical question that can be asked: would a julia programmer be more powerful if llvm was written in julia? i think the answer is clear that they would be

Sure, I'd agree it'd be great if LLVM was written in julia. However, I also don't think it's a very high priority because there are all sorts of ways to basically slap LLVM's hands out of the way and say "no, I'll just do this part myself."

E.g. consider LoopVectorization.jl [1] which is doing some very advanced program transformations that would normally be done at the LLVM (or lower) level. This package is written in pure Julia and is all about bypassing LLVM's pipelines and creating hyper efficient microkernels that are competitive with the handwritten assembly in BLAS systems.

To your point, yes Chris' life likely would have been easier here if LLVM was written in julia, but also he managed to create this with a lot less man-power in a lot less time than anything like it that I know of, and it's screaming fast so I don't think it was such a huge impediment for him that LLVM wasn't implemented in julia.

[1] https://github.com/JuliaSIMD/LoopVectorization.jl

Growing up, what my late father probably wanted most from me is for me to find a project of my own. When I was in high school, he once threatened me with "get a life, or I will get you one". Engines, and especially motorcycles, were always a passion of his. He grew up on a farm, and "was rebuilding tractor engines when the other kids were learning to ride bicycles." He still holds a few land speed records he set with motorcycles he designed and built.

But I had no real hobbies or passions of my own, other than playing card games.

It wasn't until my twenties, after I already graduated college with degrees I wasn't interested in and my dad's health failed, that I first tried programming. A decade earlier, my dad was attending the local Linux meetings when away from his machine shop.

Programming, and especially performance optimization/loop vectorization are now my passion and consume most of my free time (https://github.com/JuliaSIMD/LoopVectorization.jl).

Hearing all the stories about people starting and getting hooked when they were 11 makes me feel like I lost a dozen years of my life. I had every opportunity, but just didn't take them. If I had children, I would worry for them.

I think it's more interesting to see what people do with the language instead of focusing on microbenchmarks. There's for instance this great package https://github.com/JuliaSIMD/LoopVectorization.jl which exports a simple macro `@avx` which you can stick to loops to vectorize them in ways better than the compiler (=LLVM). It's quite remarkable you can implement this in the language as a package as opposed to having LLVM improve or the julia compiler team figure this out.

See the docs which kinda read like blog posts: https://juliasimd.github.io/LoopVectorization.jl/stable/

And then replacing the matmul.jl with the following:

    @avx for i = 1:m, j = 1:p
        z = 0.0
        for k = 1:n
            z += a[i, k] * b[k, j]
        end
        out[i, j] = z
    end
I get a 4x speedup from 2.72s to 0.63s. And with @avxt (threaded) using 8 threads it goes town to 0.082s on my amd ryzen cpu. (So this is not dispatching to MKL/OpenBLAS/etc). Doing the same in native Python takes 403.781s on this system -- haven't tried the others.