The important part of Julia is its programming model.

The implicitly parallel fork-join model is easy to program and incredibly general. And I'm glad to see a high-level language embrace it.

-------

I probably should note some other languages of this model: CUDA, OpenCL, ISPC, OpenMP, OpenACC. Most of these other languages are low-level, where you manage memory directly (and GPUs have very little memory per thread. So manual memory management is still hugely important)

But for speed of development, prototyping, and higher level reasons, a language like Julia that implements this "mindset" is going to be hugely important moving forward.

------

The fact that parallel fork-join scales from 2 CPU threads all the way up to thousands of GPU-threads... or CPU + SIMD (such as AVX512), is proof that this methodology is useful. I feel like people are sleeping on this model: its hugely useful for scaling on what I believe to be the computer of the future.

Classing for example ISPC as low level invalidates this comment for me; ISPC among others provides a great interface to SIMD model for CPUs that simply isn’t available in Julia.

ISPC is higher level than its peers, but its still new/delete based manual memory allocation. So in the great scheme of programming languages, its still rather low level (since you're manually handling memory).

Indeed: ISPC provides constructs to provide structure-of-arrays, and other low-level memory layout details. This is a good thing: these details have significant implications on the speed of your program.

Nonetheless, any language which wrangles with manual details of memory layout, or new/delete based memory allocation, is inevitably going to be classified as low level in my books.

> ISPC among others provides a great interface to SIMD model for CPUs that simply isn’t available in Julia.

If Julia can compile into GPU-assembly (which is innately SIMD), I'm sure an AVX-based compile could work eventually.

They may have to target AVX512 (since most GPU-assembly requires per-thread exec-masks), but the general concept is being proven as Julia can now compile down into PTX or AMDGPU assembly.

Julia's compile down to GPU-assembly / SIMD code is not supported in the "general case", only in select circumstances. But that's still an incredible boon for a high-level language.

> but its still new/delete based manual memory allocation

I guess we aren’t solving the same problems: memory allocation is trivial is my domain, mapping complex nested control flow in SIMD is the hard part.

> Julia can now compile down into PTX or AMDGPU assembly.

Sure but Julia-as-syntax is nothing special now, Numba does this for a Python as well.

you can and you can't. Julia is composable -- Suppose I want to write a library to find compression polynomials for reed-solomon encoding system (see mary wootters' talks on youtube) for my storage product. I need a LU decomposition algorithm that operates on GF256 (which is just an 8-bit int). But the +/-/x/divide operations are all messed up. I'd have to rewrite LU decomposition. How confident are you that you can get even that right? I'm pretty good at implementing algorithms, but I've messed up LU decomposition before.

Then suppose I rewrite the LU decomposition algorithm, using python. Now I want to accelerate the search by running the search on GPUs. I have to re-rewrite the code from scratch. Each GF256 encoding has to have rejiggered operators, and so I need to rewrite custom GPU kernels, then figure out how to resequence the operations (* looks different for each GF256 encoding), etc.

This is all SUPER easy in julia.

> (see mary wootters' talks on youtube)

https://www.youtube.com/watch?v=Gh578e98qAk

> Suppose I want to write a library to find compression polynomials for reed-solomon encoding system (see mary wootters' talks on youtube) for my storage product. I need a LU decomposition algorithm that operates on GF256

Surely you use isa-l[1].

[1] https://github.com/intel/isa-l

> Now I want to accelerate the search by running the search on GPUs.

GPUs are float oriented so I don't think you'll get the performance you hope for out of 8 bit integer operations. If you have interesting results to share I'd like to read them.