In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

It ultimately doesn't matter because of the low volume they're dealing with, but gang, "just slap a queue on it" gets you the same results as "just slap a cache on it" if you don't understand the tool you're working with. If they knew that some jobs would take hours and some jobs would take seconds, why would you not immediately spin up four queues. Two for the short jobs (one acting as a DLQ), and two for the long jobs (again, one acting as a DLQ). Your DLQ queues have a low TTL, and on expiration those messages get placed back onto the tail of the original queues. Any failure by your consumer, and that message gets dropped onto the DLQ and your overall throughput is determined by the number * velocity of your consumers, and not on your queue architecture.

This pg queue will last a very long time for them. Great! They're willing to give up the easy fanout architecture for simplicity, which again at their volume, sure, that's a valid trade. At higher volumes, they should go back to the drawing board.

> I've never seen this in years of dealing with RabbitMQ.

Did you do long running jobs like they did? It's a stereotype, but I don't think they used the technology correctly here -- you're not supposed to hold onto messages for hours before acknowledging. They should have used RabbitMQ just to kick off the job, immediately ACKing the request, and job tracking/completion handled inside... a database.

at which point, if you've got to use a DB to track status, really why bother with the queuing system?

Because a queuing system offers a different thing than a (relational) database.

You can build a queuing system with a database, but you have to do that. Some of the features and constraints of the database might even make your life harder than it has to be.

Instead, view it like that: there is a need for a queuing system and a job system. Either or both can be implemented using a database for certain concersn, but it can also be a custom implementation. It's not a great idea to mix the two things unless the operational and infrastructure costs and complexity outweigh the benefits of a clear separation.

There are libraries that implement queueing on top of databases that require very little setup by the user. For example https://github.com/timgit/pg-boss