What does HackerNews think of goofys?

a high-performance, POSIX-ish Amazon S3 file system written in Go

Language: Go

#41 in Go
Another interesting discussion: https://stackoverflow.com/questions/14633686/possible-to-sym...

And some cool github links:

- https://github.com/kahing/goofys -- a high-performance, POSIX-ish Amazon S3 file system written in Go

- https://github.com/maxogden/mount-url -- mount a http file as if it was a local file using fuse

> We've had some ideas around using this for distributed querying: in our case, each node responsible for a given partition of a dataset would be able to download just the objects in that partition on the fly (though constraint pruning), so we wouldn't need to knowingly seed each worker with data.

IMHO, if you're going to do this, I'd recommend not doing this in Postgres itself, but rather doing it at the filesystem level. It's effectively just a tiered-storage read-through cache, and filesystems have those all figured out already.

You know how pgBackRest does "partial restore" (https://pgbackrest.org/user-guide.html#restore/option-db-inc...), by making all the heap files seem to be there, but actually the ones you don't need are sparse files ftruncate(1)'d to the right length to make PG happy? And that this works because PG only cares about DB objects it's not actively querying insofar as making sure they're there under readdir(2) with the expected metadata?

Well, an object-storage FUSE filesystems, e.g. https://github.com/kahing/goofys, would make PG just as happy, because PG could see all the right files as "being there" under readdir(2), even though the files aren't really "there", and PG would block on first fopen(2) of each file while goofys fetched the actual object to back the file.

(IIRC PG might fopen(2) all its files once on startup, just to ensure it can; you can hack around this by modding the origin-object-storage filesystem library to not eagerly "push down" its fopen(2)s into object fetches — instead just returning a file-descriptor connected to a lazy promise for the object — and then have read(2) and write(2) thunk that lazy promise, such that the first real IO done against the virtual file be what ends up blocking to fetch the object.)

So you could just make your pg_base dir into an overlayfs mountpoint for:

• top layer: tmpfs (only necessary if you don't give temp tables their own tablespace)

• middle layer: https://github.com/kahing/catfs

• bottom layer: goofys mount of the shared heap-file origin-storage bucket

Note that catfs here does better than just "fetching objects and holding onto them" — it does LRU cache eviction of origin objects when your disk gets full!

(Of course, this setup doesn't allow writes to the tables held under it. So maybe don't make this your default tablespace, but instead a secondary tablespace that "closed" partitions live in, while "open" partitions live in a node-local tablespace, with something like pg_partman creating new hourly tables, and then pg_cron running a session to note down the old ones and do a VACUUM FREEZE ?; ALTER TABLE ? SET TABLESPACE ?; on them to shove them into the secondary tablespace — which will write-through the catfs cache, pushing them down into object storage.)

https://github.com/kahing/goofys/ can detect public bucket automatically and not require prior setup. https://cloud.google.com/storage/docs/gsutil works as well (despite it's from google it works fine with s3) iirc
> 1. FUSE filesystems. This means sending really slow queries to S3, and so you're much better off using compression to get better performance.

As author of https://github.com/kahing/goofys/ I respectfully disagree :-)

I learned go and wrote https://github.com/kahing/goofys in a month, and learned rust and wrote https://github.com/kahing/catfs in about the same amount of time
It did ok for the most part, except it also suggested "goofys" as a tag for https://github.com/kahing/goofys.
Open source project so not quite a "site": submitted goofys (https://github.com/kahing/goofys/) 644 days ago and had 40 upvotes on HN, and from what I recall I had a couple hundred stars on github right after. Now I am approaching 900 stars, a niche community of users, and occasional drive-by contributions.

Compare to catfs (https://github.com/kahing/catfs/) which I recently posted but did not make to front page, and right now it's at 14 stars. I would say both projects have similar audiences comparable in complexity, which would mean front page on HN gave goofys a 20x or so boost in terms of github stars.

Note that the first time I posted goofys it did not make it to front page. @dang emailed me to re-post it and the second time it was boosted to front page.

https://github.com/kahing/goofys/

I've been spending what free time I have on this. It started out as a curious project to learn Go and to prove that a useful and good s3fs like project can be done relatively quickly. These days it's used by companies moving PBs of storage into S3 to research labs trying to analysis RNA sequences with 100s of machines.

A couple things I hope to get done this year:

* a reasonably easy way to use it in conjunction of docker

* a reasonably easy way to expose this over NFS/CIFS (for devices/OSes that don't support fuse)

* a reasonably easy way to do caching

A bigger vision is to build more things on top of relatively commoditized web services so free software can adapt to the 21st century without a large operating budget.