What does HackerNews think of chaosmonkey?

Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.

Language: Go

I do actually agree with you, what I am saying is that a big part of Linux testing is crowdsourced: you and I and hundreds of millions are testing if there's a regression every time we run Linux, or running buggy, incorrect software on machines in random levels of stability and soundness.

Few unit tests in the kernel would be able to compete with so many chaos monkeys (a reference to Netflix's https://github.com/Netflix/chaosmonkey).

It might be cliche with AWS at this point but how much meta services does it have, as in services with sole purpose of maintaining your own aws account?

Like Well Architected Tool/Framework, Trusted Advisor etc?

I'm impressed that chaos monkey [0] is now a first party service. So you can break your own cloud account on demand and pay for breaking it?

[0]: https://github.com/Netflix/chaosmonkey (I think first popular open source implementation of chaos engineering)

Chaos engineering is the practice of deliberately breaking live systems to improve their robustness. The first widely noted system is Chaos Monkey developed at Netflix. https://github.com/Netflix/chaosmonkey
But there's no "debate" here to be had. All the high scale companies (Google, FB, Amazon, Microsoft, Netflix, others) do not rely on their distributed system nodes being able to wind down in an orderly fashion. Shit, Netflix and Google (and likely others as well) stage fault tolerance exercises, taking random nodes (or entire datacenters) out of rotation and checking if things still work. There's no way to get to five nines if you expect your program to always behave.

Here's one from Netflix that will give you an ulcer: https://github.com/Netflix/chaosmonkey

Here's what Google does: https://www.usenix.org/conference/lisa15/conference-program/...

>> we all get dumber for it

Not _all_. Only those who feel inclined to reject the obvious.

I have not used it, but I have heard this is a very useful tool https://github.com/Netflix/chaosmonkey
I'd say yes. I heard about this tool just a week ago at a developer conference.

https://github.com/Netflix/chaosmonkey

The difference is that when AWS goes down, Netflix/Spotify still have backups and could adapt infrastructure if the outage involved permanent data-loss. You're talking about the people who built https://github.com/Netflix/chaosmonkey

I'd argue that it should be _easier_ for a 2-man company to adapt to cloud service outages, as they likely don't have to keep up with nearly as many backups or moving parts.

The only real way to deal with this is to test distributed systems.

Doing so is hard, but only way to reliably know a system behaves given unpredictable failures.

So learn up:

- http://jepsen.io

- https://github.com/Netflix/chaosmonkey

- https://github.com/gundb/panic-server

The problem with the analogy is that for a learning algorithm, there are clear definitions of the model complexity as it relates directly to the outcome being optimized. YAGNI applied to a model is a penalty term for parameters or various methods of regularization.

But when the “goal” of the system is just “arbitrary short term desires of management” you can easily point out the problems, but there is no agreement on what constraints you can use to trade-off against it.

Especially for extensibility, where you can get carried away easily with making a system extensible for future changes, many of which turn out to be wasted effort because you did not end up needing that flexibility anyway, and everything changed after Q2 earnings were announced, etc.

In those cases, it can actually be more effective engineering to “overfit” to just what the management wants right now, and just accept that you have to pay the pain of hacking extensibility in on a case by case basis. This definitely reduces wasted effort from a YAGNI point of view.

The closest thing I could think of to the same idea of “regularizing” software complexity would be Netflix’s ChaosMonkey [0], which is basically like Dropout [1] but for deployed service networks instead of neural networks.

Extending this idea to actual software would be quite cool. Something like the QuickCheck library for Haskell, but which somehow randomly samples extensibility needs and penalizes some notion of how hard the code would be to extend to that case. Not even sure how it would work...

[0]: < https://github.com/Netflix/chaosmonkey >

[1]: < https://en.m.wikipedia.org/wiki/Dropout_(neural_networks) >

They have a "Chaos Monkey" [1] feature that is intended to bring down individual nodes. "Exposing engineers to failures more frequently incentivizes them to build resilient services."

If Chaos Monkey had been responsible for setting off a global outage, I could imagine business leaders getting cold feet about using a tool like this. In traditional companies, anyways, they'd never have seen the benefit of it and after only hearing the costs, they'd probably be livid that a widespread outage had been caused by something like this.

[1] https://github.com/Netflix/chaosmonkey

Like the SRE book from Google. Its also worth checking out how companies such as Netflix put reliability into practice.

Spinnaker

https://www.spinnaker.io/

Chaos Monkey

https://github.com/Netflix/chaosmonkey

Principles of Chaos Engineering

http://principlesofchaos.org/

There is no mention of Netflix on the site, but the term Chaos Engineering, and the popularization of the technique, seem to come from Netflix. The Chaos Monkey README even links to this site.

https://github.com/Netflix/chaosmonkey

https://medium.com/netflix-techblog/chaos-engineering-upgrad...

http://www.oreilly.com/webops-perf/free/chaos-engineering.cs...

https://en.wikipedia.org/wiki/Chaos_Monkey

Also check out:

- Chaos Monkey by Netflix (https://github.com/Netflix/chaosmonkey)

- Jepsen Tests by Aphyr (http://jepsen.io/)

- PANIC by us (https://github.com/gundb/panic-server)