> AMA!

I have a few questions. Hermit's README says:

> Instead, in order to provide complete determinism, the user should provide a fixed file system base image (e.g., with Docker)

How are you sanitizing the result of stat(), for example?

I'm guessing you're already aware of this, but fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.

Specifically, inode numbers are not guaranteed to be assigned in any order. So how do you return deterministic inode numbers in stat() calls?

I'm sure you're also aware of this, but other filesystem results are also not guaranteed to be deterministic. For example, the `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens from the user-space point of view except time passing (this happens in ZFS, for example, due to delayed allocation).

statvfs() is another problematic call which would have to be heavily sanitized.

As another example, there are filesystems that generate a random seed for every directory that is created, to prevent applications from exploiting hash table weaknesses and causing a denial-of-service when creating many directory entries that hash to the same value.

This random seed can have the side effect of readdir() calls returning directory entries in a different order based on the seed, even if the directory was created in the exact same way (with identical directory entries, created in the same order).

It seems like this would require reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.

Similarly, telldir() / seekdir() can also be problematic because the cookie value returned by the filesystem in telldir() is opaque and would be different for different directory seed values.

This seems like it would require Hermit to maintain a full in-memory view of all the directory entries of a directory that is opened and is being read, which also means that this in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process and calls that modify the directory by another process (Edit: I guess Hermit can synchronize this view easily since it can assume it has full control of all processes inside the container).

It also seems like Hermit would also need to re-read the entire directory when rewinddir() is called, to avoid introducing non-previously-existing concurrency bugs like described in the above paragraph (Edit: again, this is probably not necessary since Hermit can maintain a synchronized view of a directory on its own).

Reading the directory also seems to imply that you'd have to assign deterministic inode numbers for all directory entries, which also seems non-trivial.

Do you think this is an accurate assessment of the situation? How does Hermit handle all of this?

Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?

Last question: how do you sanitize the RDTSC CPU instruction?

Thanks for all the questions! Whew, here goes.

> How are you sanitizing the result of stat(), for example?

Ah, that part's the same as it was described in the ASPLOS'20 paper (https://dl.acm.org/doi/10.1145/3373376.3378519). Briefly, we present a somewhat sanitized version of the file system. Like if you do `hermit run -- ls -l`, you'll see that you are root, and everything owned by you is owned by root (and everything else is owned by nfsnobody currently). The file mod/access times are all set to whatever you provide for `--epoch`, because we think you mainly care about the file system contents, not the mod times. (Mod times do CHANGE as execution progresses though, otherwise programs like `make` become confused.)

> fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.

Indeed! Hence we present only the sanitized view of the filesystem, so that we can treat each filesystem as its abstract contents (files as bitstrings, and directories as sorted lists of files). For inodes, we translate to and from virtual inodes, rather than reveal the real inodes of the file system.

If there's a flaw in this abstraction of the file system... we'd love to find it. Mainly we've been using it with zfs and Meta's "edenfs" virtual file system so far.

> `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens

Yes, now you feel our pain. We've been plugging these kinds of things, and there are surely some we missed. We report boring constant values where we can get away with it (for st_blksize). Irreproducible "counter example" programs are a welcome form of contribution ;-)!

> As another example, there are filesystems that generate a random seed

Ah, that's an interesting one. Since we determinize (only) userspace, any random bytes that get into guest memory are deterministic (getrandom, /dev/random, etc), but internal nondeterminism in the filesystem we won't see, and instead we'll rely on sanitizing the way the filesystem appears to userspace. But if it just affects order, we should be ok, because we sort before returning from the getdents syscall.

> reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.

Yes, unfortunately. That's not great for performance, and we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.

> telldir() / seekdir()

Well these (and readdir) are not syscalls. So if we've handled getdents correctly we should be ok here. But this is a good suggestion for tests we need to add that stress these!

> in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process

Yes, we have a "non-interferene" assumption that either you're running with a container-private file system (e.g. inside docker) or none of your directories are concurrently messed with by other processes outside the container.

> assign deterministic inode numbers for all directory entries, which also seems non-trivial.

Yep, that's what we do!

> Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?

We require a certain level of good behavior by the program. To be practical, we aim to work for realistic programs but not necessarily adversarial ones. One way that you can break our sandboxing, for example, is to run CPUID, learn that the processor does not support instruction X, but then execute X anyway. For example, we can trap RDTSC in userspace, but not RDRAND.

If someone wants to use Hermit for, say, reverse engineering malware, then we need a Reverie backend that is hardened by emulating/rewriting the instruction stream carefully to protect against unsupported instructions or undefined conditions.

> Last question: how do you sanitize the RDTSC CPU instruction?

That one we can trap in userspace and then we return deterministic virtual time. Currently that time is a linear combination of the branches and system calls executed by all threads up to the current moment in time. For example, if you do RDTSC in a loop, you will see time advancing.

But using rdtsc to recreate fine-grained timers and implement side-channel attacks is impossible under Hermit. (Side channel attacks in general are only possible if first breaking the sandboxing in some way, and we don't know of a specific attack that will do the trick yet.)

> we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.

Actually, I just thought that if you decide to go down this path because of this specific pathological issue -- the cache that you're mentioning could simply be a file that Hermit maintains to represent a sorted view of a directory (and which could be deleted after Hermit finishes running if you so desire, although you could keep it in-between runs with some care!).

So you'd only have to pay the cost of reading the entire directory once per directory that is read -- specifically, on the first time that some program reads it.

You could even pre-generate such cached views for all directories in the container once Hermit starts, if you'd prefer to pay this cost at start-up (although I don't see why).

All subsequent operations relating to a directory could use the corresponding file as a cached and sorted view of it (and such operations that modify the directory would have to keep the file in sync, of course).

So even if a program reads a directory, then closes it, and some other program opens it and reads it, this second read would use the cache file instead, so it would not have to read the whole directory at once.

This would be even better for directories that are created by programs running inside Hermit, since the cache file would be created along with the directory when it is initially created, so there would be no need to read the entire directory at once at any time.

It would also mean that Hermit wouldn't have to potentially use large amounts of memory just because of this cache, or "forget" entries from the cache while it is running (to limit memory usage), since it would be stored persistently on-disk (and would still benefit from in-memory caching from the underlying filesystem / kernel).

Just an idea.

Yeah, it's interesting to think about persisting the state we would need to make the file system more sympatico with Hermit. If we were willing to have a daemon.... Meta develops this "watchman" tool that our build infrastructure uses. I think for existing file systems we could imagine a daemon that watches the directory and caches what we need.

But if we could dream of new file systems, then I want one that is basically btrfs but with O(1) file-level parallel-hashes (rather than block level). Heck maybe it could even enforce a sorted order on directories for us too. The hashes would be very useful in record-and-replay scenarios, where you could know immediately whether the input files are in your content-addressible blob storage without having to hash them every time.

(We have some hash support in Meta's EdenFS FUSE file system https://github.com/facebook/sapling)

P.S. about Reproducible Builds -- we pitched this to the community in 2019 (at the Reproducible Builds Summit) and with that ASPLOS'20 paper, but we would be eager to see anyone wanting to pick it up and take it further.