This is something that annoyed me a bit with OTP. The basic strategies aren't really enough for that, so you need something like https://github.com/jlouis/fuse
I wrote something like that myself, but it hasn't seen a ton of use: https://github.com/davidw/hardcore
Back in the day, I found this to be a pretty good 'circuit breaker' type of thing: https://github.com/jlouis/fuse
I wrote something similar for the (OTP) application level: https://github.com/davidw/hardcore
This is a fair point, even in Erlang land. There are a zillion things encouraging you to "let it crash" and far fewer going beyond that.
One thing that doesn't get mentioned often enough is a circuit-breaker like fuse: https://github.com/jlouis/fuse
This also has some more advanced topics:
Circuit Breakers provide three things not provided by your above scheme:
First, there is a configurable policy on how many errors to tolerate before breaking. This policy is not baked into your code, but lives outside, often in configuration files. Some of the work that is currently going on is the support for more advanced policies and more advanced ways of ramping up connectivity again on the flip side.
Second, there is no resource buildup. In your example, every request to iTunes will wait and use resources while it is waiting. Once the circuit breaks, you immediately respond with an error. In a system with 10k req/s the buildup is pretty serious if you have a timeout of 5 seconds, say, since it will effectively be 50k reqs waiting. Which could be 50k network sockets.
Third, fuse has a monitoring system built in. Any fuse you create will post its current state to an event manager which can be used to build monitoring applications (essentially this is a low-volume pub/sub pattern). Rolling your own, you have to provide this monitoring yourself, but using fuse, you get a nice way to plug into the fabric. This is used by Riak's Search system yokuzuna for instance.
Finally, what makes fuse special from the other circuit breakers is that it has a full QuickCheck specification. I.e., we have a pseudo-formal account of fuse working as intended according to the specification. In particular, I tend to generate random fuse scenarios for a couple of hours before releasing new versions. This amounts to a couple million random test cases, and we approach a full model-check of the code as we spend more time generating test cases. There are some novel work in there with respect to handling randomness and time in test cases via Erlang's QuickCheck's excellent mocking system. As a result, there have been few bug reports and likewise few fatal errors reported.
(Edit: for completeness, all the code of fuse is online here: https://github.com/jlouis/fuse including documentation)
Sadly, they are not mentioned much in books or other documentation, despite being a potentially extremely useful piece of infrastructure for some kinds of projects.
If you go with one GenServer per node, then that one just connects to redis and just pulls jobs (using BLPOP or whatever). When it has gotten a job it checks out a worker from poolboy and assigns it that work.
You could also just have every single worker process go directly to redis and have it pull jobs in a loop. But where's the fun in that ;)
If you want a single global coordinator instead of one per node you can use :global [2] to globally register a process in the cluster. This process is then cluster-wide reachable under its registered name. It can talk to each of your worker pools in the cluster and round-robin try to check out workers and assign them work. And if you do this you might as well ask yourself if you really need redis instead of keeping it all within your Elixir system.
Deciding on which node this process lives is still up to you, but there are libraries like locks [3] that allow you to automatically determine a leader in your cluster.
And once this is done you can start dealing with overload :)
Of course this is just a simple and naive approach, there are a lot of really useful Erlang libraries to check out. Here's a list of libs that helped me when getting started by reading their docs and sources in no particular order:
https://github.com/heroku/canal_lock - Erlang lock manager for concurrently variable resource numbers
https://github.com/jlouis/safetyvalve - queueing facilities for tasks to be executed so their concurrency and rate can be limited on a running system
https://github.com/fishcakez/sbroker - process broker for matchmaking between two groups of processes using sojourn time based active queue management to prevent congestion.
https://github.com/ferd/backoff - exponential backoffs and timers to be used within OTP processes when dealing with cyclical events, such as reconnections, or generally retrying things
https://github.com/jlouis/fuse - A Circuit Breaker for Erlang
https://github.com/basho/sidejob - Parallel worker and capacity limiting library for Erlang
https://github.com/pspdfkit-labs/sidetask - My humble Elixir wrapper for basho's sidejob
[1] https://github.com/devinus/poolboy also used in Ecto, look through the Ecto sources if you want to see how it's used in Elixir.
[2] http://erlang.org/doc/man/global.html
[3] https://github.com/uwiger/locks and then https://github.com/uwiger/locks/blob/master/doc/locks_leader...
Enough of this and it will crash the node. You need to design for this in an Erlang system.
The ever-helpful jlouis has some useful writing on the subject: http://jlouisramblings.blogspot.it/2010/11/on-erlang-state-a...
As well as these: https://github.com/jlouis/fuse
Sadly, this is not discussed as much as it should be in Erlang land.