What does HackerNews think of awesome-tensor-compilers?

A list of awesome compiler projects and papers for tensor computation and deep learning.

#26 in Compiler
#14 in Deep learning
#12 in Python
> So long as Pytorch only practically works with Nvidia GPUs, everything else is little more than a rounding error.

This is changing.

https://github.com/merrymercy/awesome-tensor-compilers

There are more and better projects that can compile an existing PyTorch codebase into a more optimized format for a range of devices. Triton (which is part of PyTorch) TVM and the MLIR based efforts (like torch-MLIR or IREE) are big ones, but there are smaller fish like GGML and Tinygrad, or more narrowly focused projects like Meta's AITemplate (which works on AMD datacenter GPUs).

Hardware is in a strange place now... It feels like everyone but Cerebras and AMD/Intel was squeezed out, but with all the money pouring in, I think this is temporary.

> I really want to do some great work and help people.

Have you looked into ML compilation?

https://github.com/merrymercy/awesome-tensor-compilers

IMO there is low hanging fruit in the space between high performance ML compilers/runtimes and the actual projects people use. If you practice porting projects you use to these frameworks, that would give you a massive performance edge.

While not all "languages" persay, I am excited about the various ML compilation efforts:

https://github.com/merrymercy/awesome-tensor-compilers

Modern ML training/inference is inefficient, and lacks any portability. These frameworks are how that changes...

As random examples, TVM runs LLaMA on Vulkan faster than PyTorch CUDA, and AITemplate almost doubles the speed of Stable Diffusion. Triton somewhat speeds up PyTorch training in the few repos that use it now, and should help AMD hardware even more than Nvidia.

(This is more of a link-dump than a paper discussion --)

For the line of inquiry w.r.t tensor compilers and MLIR/LLVM (linalg, polyhedral, [sparse_]tensor, etc), I personally found the following really helpful: https://news.ycombinator.com/item?id=25545373 (links to a survey), https://github.com/merrymercy/awesome-tensor-compilers

I also have an interest in the community more widely associated with pandas/dataframes-like languages (e.g. modin/dask/ray/polars/ibis) with substrait/calcite/arrow their choice of IR. Some links: https://github.com/modin-project/modin, https://github.com/dask/dask/issues/8980, https://news.ycombinator.com/item?id=16510610, https://news.ycombinator.com/item?id=35521785

I broadly classify them as such since the former has a stronger disposition towards linear/tensor-algebra, while the latter towards relational algebra, and it isn't yet clear (to me) how well innovations in one carry over to the other (if they do), and hence I'm also curious to hear more about proposals for a unified language across linalg and relational alg (e.g. https://news.ycombinator.com/item?id=36349015).

I'm particularly interested in pandas precisely because it seems to be right at the intersection of both forms of algebra (and draws a strong reaction from people who are familiar/comfortable with one community and not the other). See e.g. https://datapythonista.me/blog/pandas-20-and-the-arrow-revol... and https://wesmckinney.com/blog/apache-arrow-pandas-internals/

XLA is domain-specific compiler for linear algebra. Triton generates and compiles an intermediate representation for tiled computation. This IR allows more general functions and also claims higher performance.

obligatory reference to the family of work: https://github.com/merrymercy/awesome-tensor-compilers

Compiling from high-level lang to GPU is a huge problem, and we greatly appreciate efforts to solve it.

If I understand correctly, this (CM) allows for C-style fine-level control over a GPU device as though it were a CPU.

However, it does not appear to address data transit (critical for performance). Compilation and operator fusing to minimize transit is possibly more important. See Graphcore Poplar, Tensorflow XLA, Arrayfire, Pytorch Glow, etc.

Further, this obviously only applies to Intel GPUs, so investing time in utilizing low-level control is possibly a hardware dead-end.

Dream world for programmers is one where data transit and hardware architecture are taken into account without living inside a proprietary DSL Conversely, it is obviously against hardware manufacturers' interests to create this.

Is MLIR / LLVM going to solve this? This list has been interesting to consider:

https://github.com/merrymercy/awesome-tensor-compilers