At that kind of scale, S3 makes zero sense. You should definitely be rolling your own.

10PB costs more than $210,000 per month at S3, or more than $12M after five years.

RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)

That means that you could fit all 15TB (for erasure encoding with Minio) in less than two racks for around $150k up-front.

Figure another $5k/mo for monthly opex as well (power, bandwidth, etc.)

Instead of $12M spent after five years, you'd be at less than $500k, including traffic (also far cheaper than AWS.) Even if you got AWS to cut their price in half (good luck with that), you'd still be saving more than $5 million.

Getting the data out of AWS won't be cheap, but check out the snowball options for that: https://aws.amazon.com/snowball/pricing/

atomicity

If you have a PBs of data that you rarely access, it seems to make sense to compress it first.

I've rarely seen any non-giants with PBs of data properly compressed. For example, small JSON files converted into larger, compressed parquet files will use 10-100x less space. I am not familiar with images but see no reason why encoding batches of similar images should make it hard to get similar or even better compression ratios

Also, if you decide to move off later on, your transfer costs will also be cheaper if you can move it off in a compressed form first.

cerved

couple be wrong but I don't believe compression of batches of compressed images compresses well

but it'd be very interested to here about techniques on this because I have a lot of space eaten up by timelapses myself

d110af5ccf

On the contrary, batches of images with a high degree of similarity compress _very_ well. You have to use an algorithm specifically designed for that task though. Video codecs are a real world example of such - consider that H. 265 is really compressing a stream of (potentially) completely independent frames under the hood.

I'm not sure what the state of lossless algorithms might be for that though.

simcop2387

Best I know of for that is something like lrzip still, but even then it's probably not state of the art. https://github.com/ckolivas/lrzip

It'll also take a hell of a long time to do the compression and decompression. It'd probably be better to do some kind of chunking and deduplication instead of compression itself simply because I don't think you're ever going to have enough ram to store any kind of dictionary that would effectively handle so much data. You'd also not want to have to re-read and reconstruct that dictionary to get at some random image too.