What does HackerNews think of wuffs?

Wrangling Untrusted File Formats Safely

Language: C

#1 in Parsing
#4 in Python
There are already huffman-decoding and some parts of webp algorithms in https://github.com/google/wuffs (language that finds missing bounds checks during compilations). In contrary, according to readme, this language allows to write more optimized code (compared to C). WEBP decoding is stated as a midterm target in the roadmap.
> Chrome doesn't yet support a memory safe language

In addition to the safety features you mentioned, Chrome supports Wuffs, a memory safe programming language that supports runtime correctness checks, designed for writing parsers for untrusted files. I don’t think it existed at the start of the webp project either, but that’s what I would expect the webp parser to be written in, over Rust or a garbage collected language.

https://github.com/google/wuffs

Specifically, since performance is crucial for this type of work, it should be written in WUFFS. WUFFS doesn't emit bounds checks (as Java does and as Rust would where it's unclear why something should be in bounds at runtime) it just rejects programs where it can't see why the indexes are in-bounds.

https://github.com/google/wuffs

You can explicitly write the same checks and meet this requirement, but chances are since you believe you're producing a high performance piece of software which doesn't need checks you'll instead be pulled up by the fact the WUFFS tooling won't accept your code and discover you got it wrong.

This is weaker than full blown formal verification, but not for the purpose we care about in program safety, thus a big improvement on humans writing LGTM.

> parsing encoded files tends to introduce vulnerabilities

If we are talking about binary formats, now there are systematic solutions like https://github.com/google/wuffs that protect against vulnerabilities. But SQLite is not just a format - it's an evolving ecosystem with constantly added features. And the most prominent issue was not even in core, it was in FTS3. What will SQLite add next? More json-related functions? Maybe BSON? It is useful, but does not help in this situation.

Regarding traces, there are many forensics tools and even books about forensic analysis of SQLite databases. In well-designed format such tools should not exist in the first place. This is hard requirement: if it requires rewriting the whole file - then so be it.

I hadn’t seen wuffs before. Interesting approach to writing safe libraries:

Wuffs is not a general purpose programming language. It is for writing libraries, not programs. Wuffs code is hermetic and can only compute (e.g. convert "compressed bytes" to "decompressed bytes"). It cannot make any syscalls (e.g. it has no ambient authority to read your files), implying that it cannot allocate or free memory (and is therefore trivially safe against things like memory leaks, use-after-frees and double-frees).

https://github.com/google/wuffs

There are efforts to do that, notably https://github.com/google/wuffs

RLBox is another interesting option that lets you sandbox C/C++ code.

I think the main reason is that security is one of those things that people don't care about until it is too late to change. They get to the point of having a fast PDF library in C++ that has all the features. Then they realise that they should have written it in a safer language but by that point it means a complete rewrite.

The same reason not enough people use Bazel. By the time most people realise they need it, you've already implemented a huge build system using Make or whatever.

For this work we don't need a general purpose language like Rust.

WUFFS is a special purpose language for Wrangling Untrusted File Formats Safely:

WUFFS pays a high price (loss of generality) for a valuable reward (compile time assurance of memory safety, very high performance) and it makes no sense for people to hand roll this sort of software in C when they should use WUFFS.

https://github.com/google/wuffs

Here's an off-topic answer.

Depends on what you want your toy language to do and what sort of runtime support you'd like to lean on.

JVM is pretty good for a lot of script-y languages, does impose overhead of having a JVM around. Provides GC, Threads, Reflection, consistent semantics. Tons of tools, libraries, support.

WebAssembly is constrained (for running-in-a-browser safety reasons) but then you get to run your code in a browser, or as a service, etc, and Other People are working hard on the problem of getting your WA to go fast. That used to be a big reason for using JVM, but it turns out that Security Is Darn Hard.

I have used C in the (distant) past as an IL, and that works up to a point, implementing garbage collection can be a pain if that's a thing that you want. C compilers have had a lot of work on them over the years, and you also have access to some low-level stuff, so if you were E.G. trying to come up with a little language that had super-good performance, C might be a good choice. (See also, [Wuffs](https://github.com/google/wuffs), by Nigel Tao et al at Google).

A suggestion, if you do target C -- don't work too hard to find isomorphisms between C's data structures and YourToyLang's data structures. Back around 1990, I did my C-generating compiler for Modula-3, and a friend at Xerox PARC used C as a target for Cedar Mesa, and Hans used it in a lower-level way (so I was mapping between M-3 records and C structs, for example, Hans was not) and the lower-level way worked better -- i.e., I chose poorly. It worked, but lower-level worked better.

If you are targeting a higher-level language, Rust and Go both seem like interesting options to me. Both have the disadvantage that they are still changing slightly but you get interesting "services" from the underlying VM -- for Rust, the borrow checker, plus libraries, for Go, reflection, goroutines, and the GC, plus libraries.

Rust should get you slightly higher performance, but I'd worry that you couldn't hide the existence of the borrow checker from your toy language, especially if you wanted to interact with Rust libraries from YTL. If you wanted to learn something vaguely publishable/wider-interesting, that question right there ("can I compile a TL to Rust, touch the Rust libraries, and not expose the borrow checker? No+what-I-tried/Yes+this-worked") is not bad.

I have a minor conflict of interest suggesting Go; I work on Go, usually on the compiler, and machine-generated code makes great test data. But regarded as a VM, I am a little puzzled why it hasn't seen wider use, because the GC is great (for lower-allocation rates than Java however; JVM GC has higher throughout efficiency, but Go has tagless objects, interior pointer support, and tiny pause times. Go-the-language makes it pretty easy to allocate less.) Things Go-as-a-VM currently lacks:

- tail call elimination (JVM same)

- don't ever construct a pointer to Object+sizeof(Object) (i.e., to the first byte past the end) (JVM same)

- defined semantics for racy programs that don't use atomics (structures can tear; there is a race detector, use it). (JVM same-ish; racy programs suck)

- integer overflow checking (JVM same)

- consistent conversion from +/-FPInf to integers,

- if you're not careful about expressing floating point a+b*c, you'll get the platform multiply-add rounding

- signaling/quiet NaN representation follows the platform

Some of these are on my list of Would-Be-Nice to fix, but that doesn't mean they will happen, because there are O(zero) people using Go as a VM, as far as I know, so their problems have zero weight. Tail-call-elimination in particular would be hard, and I see no substantial benefit in solving the pointer-past-end problem (it's a minor issue in our own code generation, and we deal with it).

Reminds me a bit of WUFFS (https://github.com/google/wuffs), although a bit more structured (or high level)
That sounds a bit like what WUFFS is doing

WUFFS: https://github.com/google/wuffs

> A cooler feature would be requiring the compiler to prove the addition wouldn’t overflow.

Wuffs is specialized enough to do exactly that (https://github.com/google/wuffs).

There might be a few others. https://github.com/google/wuffs for example isn't general purpose or mainstream, but it's meant to solve practical problems, so I don't think I'd call it a research language. Opinions may vary.
[WUFFS](https://github.com/google/wuffs) is made for stuff like this, and it has a library available as transpiled C code.
Program and data aren't really different, philosophically. On some level this even applies to people. When someone teaches you French is that program or data? Is it just data? Why can you now understand French then? Or if it's program, how does that work, who taught the teacher how to program you?

So, our best effort is to constrain what certain data can do when we process it, in the hope that this prevents surprising negative consequences like a PDF that steals privileged information and sends it elsewhere.

Notice that, in some sense, a PDF which just contains a photograph of your wife tied to a chair and holding today's newspaper, plus human readable text like, "We have your wife Sarah and all three kids Beth, Jim and Amanda. We are watching. Do not try to call for help. Email the privileged information to [email protected] or we will kill your family" is also potentially effective at doing this, but we would not usually consider that an exploit in this context.

One irritation in this space is that programmers love General Purpose Programming Languages. The idea of the general purpose language is that it can do anything. But the problem in this sort of situation is that we don't want programs which can do anything, in fact doing anything is our worst case scenario. We actually want Special Purpose Programming Languages. We want to write our PDF data processing software in a language that even if we were trying can't do the things that should never happen as a result of processing a PDF.

This is the purpose of languages like WUFFS: https://github.com/google/wuffs

You can't write a WUFFS program to, for example, email anything to [email protected] even if you desperately needed to, which means you definitely won't accidentally write a program which can email the privileged information to the crooks when fed a PDF. Of course the PDF mentioned earlier with the kidnap note inside it could still work. And also of course making a PDF renderer out of WUFFS would be a really big ask. WUFFS-the-library today can render PNG, GIF, BMP but notably not yet JPEG. But it's clearly possible for something like PDF rendering to happen under these constraints. Nobody ordinarily viewing a PDF wants it to do arbitrary stuff.

Idk, but there's an interesting project, wuffs: a programming language specifically aimed at efficiently and safely parsing document structure. There is a PNG decoder [2].

1. https://github.com/google/wuffs

2. https://news.ycombinator.com/item?id=26714831