What does HackerNews think of vast?

VAST is an experimental compiler pipeline designed for program analysis of C and C++. It provides a tower of IRs as MLIR dialects to choose the best fit representations for a program analysis or further program abstraction.

Language: C++

#61 in C
#30 in C++
I've been researching this pretty deeply for the last few years, and I've come to the conclusion that, without a complete redesign, most popular programming languages cannot have direct control of these optimizations in an ergonomic manner.

The reasonI think this is because: Most languages target C or LLVM, and C and LLVM have a fundamentally lossy compilation processes.

To get around this, you'd need a hodge podge of pre compiler directives, or take a completely different approach.

I found a cool project that uses a "Tower of IRs" that can restablish source to binary provenance, which, seems to me, to be on the right track:

https://github.com/trailofbits/vast

I'd definitely like to see the compilation processes be more transparent and easy to work with.

At Trail of Bits, we've been working on this type of IR for C and C++ code [1]. We operate as a kind of Clang middle end, taking in a Clang AST, and spitting LLVM IR that is Clang-compatible out the other end. In this middle area, we progressively lower from a high-level MLIR dialect down to LLVM.

[1] https://github.com/trailofbits/vast

Nod. VAST[1] for LLVM MLIR was mentioned in recent discussion[2]. Hmm, so perhaps with a "degree of absorption" axis, from say idiomatic-reimplementation to might-as-well-be-a-shell-call, with also transliterated-port and c-library-api, perhaps one key threshold might be sufficiently-digested-to-permit-refactoring? Puzzlement. With the pain of dealing with LLIR on one end, to an imaginary scraping of LLVM to a non-C++ knowledge representation sufficient for emitting native implementations... I'm unclear on the space's cost/benefit transitions.

Or consider large-scale code analysis and refactoring tooling - perhaps having that should be a bootstrap target, a unit of absorption is app repo, and a key threshold is ability to refactor source? Or not repo, but arbitrary scale. So Language Server Protocol blended with dynamic loading and calling convention? Smalltalkish "live" image environment with piles of mutating forked repos? That sort of cross-checks - ask a "programmer apprentice" ai, err, or simply a programming team, "I'd like capabilities foo with characteristics bar", a set of forked repos might be an unremarkable outcome. So that might suggest a language bootstrap target of ffi plus refactoring-LSP client?

FFI is an unremarkable bootstrap target, and a refactoring-LSP client gives control over both sides, so maybe next, how to move code across the line? AST scraping and transliteration? Polyglot direct memory access to data types? Suggesting as targets maybe high-end "can exercise and analyze compiler output" ffi, and rich AST tooling? Language implementation, with its specs and test suites, can be a nice context for such work. Which might bring us back around to an emphasis on early implementation of other languages, but with maybe an increased focus on interoperation with existing implementations? Control of config, build, and linkage, might also need emphasis?

Hmm, fun, tnx! [1] https://github.com/trailofbits/vast [2] https://news.ycombinator.com/item?id=33387149

At Trail of Bits, we are creating a new compiler front/middle end for Clang called VAST [1]. It consumes Clang ASTs and creates a high-level, information-rich MLIR dialect. Then, we progressively lower it through various other dialects, eventually down to the LLVM dialect in MLIR, which can be translated directly to LLVM IR.

Our goals with this pipeline are to enable static analyses that can choose the right abstraction level(s) for their goals, and using provenance, cross abstraction levels to relate results back to source code.

Neither Clang ASTs nor LLVM IR alone meet our needs for static analysis. Clang ASTs are too verbose and lack explicit representations for implicit behaviours in C++. LLVM IR isn't really "one IR," it's a two IRs (LLVM proper, and metadata), where LLVM proper is an unspecified family of dialects (-O0, -O1, -O2, -O3, then all the arch-specific stuff). LLVM IR also isn't easy to relate to source, even in the presence of maximal debug information. The Clang codegen process does ABI-specific lowering takes high-level types/values and transforms them to be more amenable to storing in target-cpu locations (e.g. registers). This actively works against relating information across levels; something that we want to solve with intermediate MLIR dialects.

Beyond our static analysis goals, I think an MLIR-based setup will be a key enabler of library-aware compiler optimizations. Right now, library-aware optimizations are challenging because Clang ASTs are hard to mutate, and by the time things are in LLVM IR, the abstraction boundaries provided by libraries are broken down by optimizations (e.g. inlining, specialization, folding), forcing optimization passes to reckon with the mechanics of how libraries are implemented.

We're very excited about MLIR, and we're pushing full steam ahead with VAST. MLIR is a technology that we can use to fix a lot of issues in Clang/LLVM that hinder really good static analysis.

[1] https://github.com/trailofbits/vast