12 factor seems to have stood the test of time really well—I was introduced via Heroku (who I think invented it?) quite a long time ago in tech years, and yet it still seems to be probably the most popular ‘framework’ for devops.

In fact, my startup EnvKey[1] was heavily inspired by the 12 factor approach. While it has always worked well for me, one bit that always felt thorny was using the environment for configuration and secrets. It’s obviously great to get this stuff out of code, but then you face new issues on how to keep it all in sync across many environments and pass potentially highly sensitive data around securely.

EnvKey fills in this gap using end-to-end encryption and a seamless integration that builds on top of environment variables—it’s a drop-in replacement if you already use the environment for config. Check it out if you’re looking for something to smooth out this aspect of 12 factor! We have lots of folks using it with GCP/GKE.

1 - https://www.envkey.com

Indeed it has.

I wanted to put secrets as env vars for my ECS containers. There wasn't an easy to do that. I wound up writing an entrypoint inside the container that would download secrets from AWS SSM and expose them to application code via env vars.

Then AWS ECS released the feature to specify env vars as SSM values naively from the ECS agent.

It felt good knowing someone else saw this as the "right" way too.

---

I will say that Factor 9 (Disposability) seems rarely followed in many applications, SIGTERM is not handled gracefully.

Program shutdown is not supposed to be handled "gracefully", simply because there's no guarantee that your program will be able to gracefully shut down. Google itself doesn't handle program shutdown at all. Your program must be written so that it's safe to outright kill it at any moment, because at scale that's what's going to happen from time to time whether you want it or not. It is best to shed this illusion that your program will have the opportunity to shut down in an orderly fashion, because when it runs on tens of thousands of computers, you can pretty much count on graceful shutdown not happening at least every now and then.

This has the added benefit of making the datacenters easier to manage. Say you have a bunch of workloads packed into racks of servers. Say one of those racks needs an electrical upgrade or hardware replacement. If all programs are preemptible by design, you can just tell cluster management software (Borg) to kill them and restart the tasks elsewhere. Or, in fact, you could even just pull the plug on the rack without telling Borg to do anything. Because workloads are spread out between fault domains, only a small fraction of tasks gets restarted elsewhere, and a properly designed system will not corrupt data or even let the external users know that anything happened.

I'm sorry, but this is mostly wrong.

Yes, you need to design your code to withstand disaster shutdown SIGKILL type situations. That doesn't mean you get to ignore SIGTERM.

The vast majority of shutdowns are due to routine maintenance events. If you get SIGTERM-ed and all you did was crash, here's a short list off the top of my head of bad things that can and do result:

* Tail latency goes up, because clients talking to tasks that don't shut down gracefully have to wait for at least one RPC timeout - possibly more if channel timeout is longer - before they retry elsewhere. (This will manifest in multiple ways, because things like load balancers will also be slow to respond.)

* System contention goes up - if your server was holding any kind of distributed lock (e.g. DB write) when it went down, everyone else needs to wait for that lock to time out before someone else can take it. (Hopefully your locks require keep-alives to hold them!)

* Corruption of data in transit - a crashing binary is basically a big ol blob of UB. With enough replication and checksumming you can mitigate this, but it doesn't mean you get to do dangerous things now! Guardrails only work if you don't take a sledgehammer to them.

* I really, really hope you don't have any request affinity in your system, because if you do, now your caches are empty, your dependencies are going to see you doing expensive ops much more frequently, and so on. (And if you're a streaming system, well now you're just all kinds of screwed.)

I hardly ever felt so strongly about anything in software engineering, although I will concede this is an acquired taste, and it took me some time to see the obvious truth in this.

I will reiterate. If your programs are designed to be kill-safe, it's a waste of time to shut them down any other way. It is also harmful to your systems design, because you can't guarantee that your hardware won't give out, especially if you run on e.g. 50-100K cores (as many Google services do). You can basically be certain at that point that individual tasks (and hardware they run on) will die from time to time. Note that this only applies to shut down, not to e.g. draining traffic. That part can and should still be orderly in most systems so as not to disrupt the user experience too much, and Google does have that in their RPC stack (lame duck mode etc). For everything else you end up doing a lot of logging, checkpointing, and 2PC. For distributed systems you end up using consensus mechanisms and eliminating SPOFs.

If you rely on orderly shutdown for correctness in your distributed system, I'd like to know the name of your product so I can avoid it.

I think you're talking past each other a bit here.

Nothing the parent is discussing is regarding correctness, so I think your last sentence is a bit uncalled for.

I'm actually serious about that last bit. If your distributed system relies on guarantees it _does not have_ in order to operate correctly, one would be well advised to stay away from it.

You can be super serious about it all you want, but it's accusatory and doesn't reflect the post you responsed to, which referred to performance and user experience implications with shooting nodes in the head.

Personally I think there was the formation of a really good debate. As you said, if you drain the nodes traffic before killing it, you're probably right, any costs associated with maintaining consistency is probably saved by the human aspect of just not waiting for a clean shutdown of processes at scale.

But when you throw in the sass, people stop listening and we all get dumber for it.

But there's no "debate" here to be had. All the high scale companies (Google, FB, Amazon, Microsoft, Netflix, others) do not rely on their distributed system nodes being able to wind down in an orderly fashion. Shit, Netflix and Google (and likely others as well) stage fault tolerance exercises, taking random nodes (or entire datacenters) out of rotation and checking if things still work. There's no way to get to five nines if you expect your program to always behave.

Here's one from Netflix that will give you an ulcer: https://github.com/Netflix/chaosmonkey

Here's what Google does: https://www.usenix.org/conference/lisa15/conference-program/...

>> we all get dumber for it

Not _all_. Only those who feel inclined to reject the obvious.