I've built three distributed job systems at this point. A handy rule of thumb which I have promoted for years is "build for 10x your current scale."

If you need to handle 70 requests/second, design for 700. If you need to handle 20 servers running batch jobs, design for 200 servers. If you're in a startup that grows 100% per year, you'll be at 8x scale in 3 years. So you'll have time to rewrite as you grow!

Out of three job systems I built, the first one tried to avoid SQL for "scalability." Then we hit a bunch of edge cases that needed transactional integrity, and we were in a world of pain.

My two more recent distributed job systems use PostgreSQL as a coordinator. They're literally built around SELECT FOR UPDATE SKIP LOCKED. One of them routinely controls 350 workers and the other does elaborate prioritization for thousands of jobs. Both of them will continue to run just fine until they're earning millions of dollars a year—for example, the one controlling 350 workers should scale to about 2,000 CPUs with a little work.

Hyperscale technologies are fairly cheap right up until you discover you need transactions. At that point, faking transactional semantics on top of an eventually consistent data store becomes an engineering nightmare.

So sit down, and do the math. If your company was earning $100 million/year, how big would your distributed system need to be? Can you easily get a PostgreSQL instance big enough to handle that load? (Or could you just shard per client?) If so, strongly consider using PostgreSQL. It makes a hundred things trivial.

merb

you can even built a advisory lock queue in Postgres, which is way slower but has some benefits.

pas

slower to build or slower to run?

merb

slower to run, but when you keep the postgres connection open you will know that the job is still running, while with for update skip locked you would need to have a status and a job_timeout basically.

so pg_try_advisory_lock/pg_advisory_unlock can lock over transactions while for update skip locked can't, thus you would either need to keep a transaction open or use status+job_timeout (and in postgres you should not use long transactions)

basically we use c#, but we looked into https://github.com/que-rb/que which uses advisory_locks, since our jobs take like 1 min to 2 hours it was a no-brainer to use advisory_locks. it's just not the best thing if you have thousands of fast jobs per second, but for a more moderate queue where you have like 10000 jobs per minute/10 minutes/30 minutes and they take like 1 min to 2 hours its fine.

we also do not delete jobs, we do not care about storage since the job table basically does not take a lot. and we have a lot of time to catchup at night since we are only in europe