Another neat hack is to use Postgres as a quick & dirty replacement for Hadoop/MapReduce if you have a job that has big (100T+) input data but small (~1G) output data. A lot of common tasks fall into this category: generating aggregate statistics from large log files, searching Common Crawl for relevant webpages, identifying abusive users or transactions, etc.

The architecture is to stick a list of your input shards in a Postgres table, have a state flag that goes PENDING->WORKING->FINISHED->(ERROR?), and then spin up a bunch of worker processes as EC2 spot instances that check for the next PENDING task, mark it as WORKING, pull it, process it, mark it as FINISHED, and repeat. They write their output back to the DB in a transaction; there's an assumption that aggregation can happen in-process and then get merged in a relatively cheap transaction. If the worker fails or gets pre-empted, it retries (or marks as ERROR) any shards it was previously working on.

Postgres basically functions as the MapReduce Master & Reducer, the worker functions as the Mapper and Combiner, and there's no need for a shuffle phase because output <<< input. Almost all the actual complexity in MapReduce/Hadoop is in the shuffle, so if you don't need that, the remaining stuff takes < 1 hour to implement and can be done without any frameworks.

That's a very similar pattern to what I've been using at TW.

At the core of our implementation is this library to schedule the jobs to different workers: https://github.com/kagkarlsson/db-scheduler

Would highly recommend that library!