Yesterday, I watched Emery Berger’s “Python performance matters” Which is you should not bother to write fast Python, just delegate all heavy lifting to optimized C/c++ and the likes. His group has a Python profiler whose main goal is to point out which parts should be delegated. Of course optimized is better but the numeric microbenchmarks in this post are mostly examples of code that is better delegated to low overhead compiled languages.
> you should not bother to write fast Python, just delegate all heavy lifting to optimized C/c++ and the likes
Certainly that's something you can do, but unfortunately for Python it opens the door for languages like Julia, which are trying to say that you can have your cake and eat it too.
I like Julia but its easy to write slow julia unless you keep the performance tips in mind. Arrays are horribly slow, tuples are much faster, but having tuples be a multiple of 128 bytes adds 10% or more to speed. I honestly don't understand how much slower arrays are. Its like 30x or similar.
That’s bizarre, are they implemented as linked lists or something? Why would arrays be that slow?
They aren't, which is why the statement doesn't make sense haha. The difference is really just that tuples of isbits types can be stack allocated (and tuples of isbits types are isbits, so they can nest, etc.), and so in some cases with sufficiently small amounts of data, creating stack allocated objects is much faster than creating heap allocated objects.
But if you compare concretely typed heap allocated arrays in Julia and C, there's no real performance difference (Steven Johnson has a nice notebook displaying this https://scpo-compecon.github.io/CoursePack/Html/languages-be..., and if you see the work on LoopVectorization.jl it makes it really clear that the only major difference is that C++ tends to know how to prove non-aliasing (ivdep) in a bit more places (right now)). So the real question is, did you actually want to use a heap allocated object? I think this really catches people new to performance optimization off guard since in other dynamic languages you don't have such control to "know" things will be placed on the stack, so you generally heap allocate everything (Python even heap allocates numbers), which is one of the main reasons for the performance difference against C. Julia gives you the tools to write code that doesn't heap allocate objects, but also makes it easy to heap allocate objects (because if you had to `malloc` and `free` everywhere... well you'd definitely lose the "like Python" and and be much closer to C++ or Rust in terms of "ease of use", which would defeat the purpose for many cases). But if you come from a higher level language, there's this magical bizarre land of "things that don't allocate" (on the heap) and so you learn "oh I got 30x faster from Python, but then someone on a forum showed me how to do 100x better by not making arrays?", which is somewhat obvious from a C++ mindset but less obvious from a higher level language mindset.
And FWIW, this is probably the biggest performance issue newcomers run into, and I think one of the things to which a solution is required to make it mainstream. Thankfully, there's already prototype PRs that are well underway, for example https://github.com/JuliaLang/julia/pull/43573 automatically stack-allocates small arrays which can prove certain escape properties, https://github.com/JuliaLang/julia/pull/43777 is looking to hoist allocations out of loops so even if they are required they are minimized in the loop contexts automatically, etc. The biggest impediment is that EscapeAnalysis.jl is not in Base, and it requires JET.jl which is not in Base, and so both of those need to be made "Base friendly" to join the standard library and then the compiler can start to rely on its analysis (which will be nice because JET.jl can do things like throw more statically-deducible compile time errors, another thing people ask for with Julia). There's a few people who are working really hard on doing this, so it is a current issue but it's a known one with prototyped solutions and a direction to get it into the shipped compiler. When that's all said and done, of course "good programming" will still help the compiler in some cases, but in most people shouldn't have to worry about stack vs heap (it's purposefully not part of the user API in Julia and considered a compiler detail for exactly this reason, so it's non-breaking for the compiler to change where objects live and improve performance over time).
My background is over 28 years of c++ programming 20 of that as my profession. My mistake in retrospect was using small arrays as part of a struct, which being immutable got replaced at each time step with a new struct requiring new arrays to be allocated and initialized. I would not have done that in c++, but julia puts my brain in matlab mode. If I had gone full matlab and just used a matrix or tensor instead of an array of structa of arrays then I would have still taken a hit from the mutability, but not the one I did take. After successfully doing some DSP in julia I became arrogant and took on the nbody benchmark at benchmark game. I did discover that targeting skylake instead of ivybridge made no change. It doubled the speed of the c++ and rust solutions. Removing @inbounds was less than a 10% hit. And the iterative fastmath inverse sqrt was absolutely neutral on the skylake machine I used. Sqrt was just as fast. I did get a 10% bump by making the 3tuples into 4tuples for position and velocity. Alignment I'd assumed, but padding the struct instead of the tuple did nothing, so probably extra work to clear a piece of an simd load. Any insight on why avx availability didn't help would be appreciated. I did verify some avx instructions were in the asm it generated, so it knew, it just didn't use.
I see. Yes, it's an interesting design space where Julia makes both heap and stack allocations easy enough, so sometimes you just reach for the heap like in MATLAB mode. Hopefully Prem and Shuhei's work lands soon enough to stack allocate small non-escaping arrays so that user's done need to think about this.
> Alignment I'd assumed, but padding the struct instead of the tuple did nothing, so probably extra work to clear a piece of an simd load. Any insight on why avx availability didn't help would be appreciated. I did verify some avx instructions were in the asm it generated, so it knew, it just didn't use.
The major differences at this point seem to come down to GCC (g++) vs LLVM and proofs of aliasing. LLVM's auto-vectorizer isn't that great, and it seems to be able to prove 2 arrays are not aliasing less reliably. For the first part, some people have just improved the loop analysis code from the Julia side (https://github.com/JuliaSIMD/LoopVectorization.jl), forcing SIMD onto LLVM can help it make the right choices. But for the second part you do need to do `@simd ivdep for ...` (or use LoopVectorization.jl) to match some C++ examples. This is hopefully one of the things that the JET.jl and other new analysis passes can help with, along with the new effects system (see https://github.com/JuliaLang/julia/pull/43852, this is a pretty huge new compiler feature in v1.8, but right now it's manually specified and will take time before things like https://github.com/JuliaLang/julia/pull/44822 land and start to make it more pervasive). When that's all together, LLVM will have more ammo for proving things more effectively (pun intended).