It is a little disappointing that they're setting the bar against vanilla Python in their comparisons. While I'm sure they have put massive engineering effort into their ML compiler, the demos they showed of matmul are not that impressive in an absolute sense; with the analogous Julia code, making use of [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl) to automatically choose good defaults for vectorization, etc...

    julia> using LoopVectorization, BenchmarkTools, Test
           function AmulB!(C,A,B)
               @turbo for n = indices((C,B),2), m = indices((C,A),1)
                   Cmn = zero(eltype(C))
                   for k = indices((A,B),(2,1))
                       Cmn += A[m,k]*B[k,n]
                   end
                   C[m,n]=Cmn
               end
           end
           M = K = N = 144; A = rand(Float32, M,K); B = rand(Float32, K,N); C0 = A*B; C1 = similar(C0);
           AmulB!(C1,A,B)
           @test C1 ≈ C0
           2e-9*M*K\*N/@belapsed(AmulB!($C1,$A,$B))
    96.12825754527164

I'm able to achieve 96GFLOPs on a single core (Apple M1) or 103 GFLOPs on a single core (AMD EPYC 7502). And that's not even as good as what you can achieve using e.g. TVM to do the scheduling exploration that Mojo purports to do.

Perhaps they have more extensive examples coming that showcase the capabilities further. I understand it's difficult to show all strengths of the entire system in a short demonstration video. :)

EDIT: As expected, there are significantly better benchmarks shown at https://www.modular.com/blog/the-worlds-fastest-unified-matr... so perhaps this whole discussion truly is just a matter of the demo not showcasing the true power of the system. Hopefully achieving those high performance numbers for sgemm is doable without too much ugly code.