What does HackerNews think of awesome-tensor-compilers?

MatX: Faster Chips for LLMs | Aug 2023

> So long as Pytorch only practically works with Nvidia GPUs, everything else is little more than a rounding error.

This is changing.

https://github.com/merrymercy/awesome-tensor-compilers

There are more and better projects that can compile an existing PyTorch codebase into a more optimized format for a range of devices. Triton (which is part of PyTorch) TVM and the MLIR based efforts (like torch-MLIR or IREE) are big ones, but there are smaller fish like GGML and Tinygrad, or more narrowly focused projects like Meta's AITemplate (which works on AMD datacenter GPUs).

Hardware is in a strange place now... It feels like everyone but Cerebras and AMD/Intel was squeezed out, but with all the money pouring in, I think this is temporary.

Ask HN: How to get good as a self taught ML engineer? | Jul 2023

> I really want to do some great work and help people.

Have you looked into ML compilation?

https://github.com/merrymercy/awesome-tensor-compilers

IMO there is low hanging fruit in the space between high performance ML compilers/runtimes and the actual projects people use. If you practice porting projects you use to these frameworks, that would give you a massive performance edge.

Ask HN: What new programming language(s) are you most excited about? | Jul 2023

While not all "languages" persay, I am excited about the various ML compilation efforts:

https://github.com/merrymercy/awesome-tensor-compilers

Modern ML training/inference is inefficient, and lacks any portability. These frameworks are how that changes...

As random examples, TVM runs LLaMA on Vulkan faster than PyTorch CUDA, and AITemplate almost doubles the speed of Stable Diffusion. Triton somewhat speeds up PyTorch training in the few repos that use it now, and should help AMD hardware even more than Nvidia.

Research papers on ML in Compilers | Jun 2023

Expand Context ↕

You might be interested in this: https://github.com/merrymercy/awesome-tensor-compilers

The Distributed Tensor Algebra Compiler (2022) | Jun 2023

Expand Context ↕

(This is more of a link-dump than a paper discussion --)

For the line of inquiry w.r.t tensor compilers and MLIR/LLVM (linalg, polyhedral, [sparse_]tensor, etc), I personally found the following really helpful: https://news.ycombinator.com/item?id=25545373 (links to a survey), https://github.com/merrymercy/awesome-tensor-compilers

I also have an interest in the community more widely associated with pandas/dataframes-like languages (e.g. modin/dask/ray/polars/ibis) with substrait/calcite/arrow their choice of IR. Some links: https://github.com/modin-project/modin, https://github.com/dask/dask/issues/8980, https://news.ycombinator.com/item?id=16510610, https://news.ycombinator.com/item?id=35521785

I broadly classify them as such since the former has a stronger disposition towards linear/tensor-algebra, while the latter towards relational algebra, and it isn't yet clear (to me) how well innovations in one carry over to the other (if they do), and hence I'm also curious to hear more about proposals for a unified language across linalg and relational alg (e.g. https://news.ycombinator.com/item?id=36349015).

I'm particularly interested in pandas precisely because it seems to be right at the intersection of both forms of algebra (and draws a strong reaction from people who are familiar/comfortable with one community and not the other). See e.g. https://datapythonista.me/blog/pandas-20-and-the-arrow-revol... and https://wesmckinney.com/blog/apache-arrow-pandas-internals/

Triton: Open-Source GPU Programming for Neural Networks | Jul 2021

Expand Context ↕

XLA is domain-specific compiler for linear algebra. Triton generates and compiles an intermediate representation for tiled computation. This IR allows more general functions and also claims higher performance.

obligatory reference to the family of work: https://github.com/merrymercy/awesome-tensor-compilers

C-for-Metal: High Performance SIMD Programming on Intel GPUs | Jan 2021

Compiling from high-level lang to GPU is a huge problem, and we greatly appreciate efforts to solve it.

If I understand correctly, this (CM) allows for C-style fine-level control over a GPU device as though it were a CPU.

However, it does not appear to address data transit (critical for performance). Compilation and operator fusing to minimize transit is possibly more important. See Graphcore Poplar, Tensorflow XLA, Arrayfire, Pytorch Glow, etc.

Further, this obviously only applies to Intel GPUs, so investing time in utilizing low-level control is possibly a hardware dead-end.

Dream world for programmers is one where data transit and hardware architecture are taken into account without living inside a proprietary DSL Conversely, it is obviously against hardware manufacturers' interests to create this.

Is MLIR / LLVM going to solve this? This list has been interesting to consider:

https://github.com/merrymercy/awesome-tensor-compilers