What does HackerNews think of chaosmonkey?

Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.

Language: Go

Ask HN: Why the Linux Kernel doesn't have unit tests? | Nov 2022

I do actually agree with you, what I am saying is that a big part of Linux testing is crowdsourced: you and I and hundreds of millions are testing if there's a regression every time we run Linux, or running buggy, incorrect software on machines in random levels of stability and soundness.

Few unit tests in the kernel would be able to compete with so many chaos monkeys (a reference to Netflix's https://github.com/Netflix/chaosmonkey).

AWS Fault Injection Simulator | Mar 2021

It might be cliche with AWS at this point but how much meta services does it have, as in services with sole purpose of maintaining your own aws account?

Like Well Architected Tool/Framework, Trusted Advisor etc?

I'm impressed that chaos monkey [0] is now a first party service. So you can break your own cloud account on demand and pay for breaking it?

[0]: https://github.com/Netflix/chaosmonkey (I think first popular open source implementation of chaos engineering)

“Chaos Engineering” – Published by O'Reilly | Apr 2020

Expand Context ↕

Chaos engineering is the practice of deliberately breaking live systems to improve their robustness. The first widely noted system is Chaos Monkey developed at Netflix. https://github.com/Netflix/chaosmonkey

Twelve-factor app development on Google Cloud | Nov 2019

Expand Context ↕

But there's no "debate" here to be had. All the high scale companies (Google, FB, Amazon, Microsoft, Netflix, others) do not rely on their distributed system nodes being able to wind down in an orderly fashion. Shit, Netflix and Google (and likely others as well) stage fault tolerance exercises, taking random nodes (or entire datacenters) out of rotation and checking if things still work. There's no way to get to five nines if you expect your program to always behave.

Here's one from Netflix that will give you an ulcer: https://github.com/Netflix/chaosmonkey

Here's what Google does: https://www.usenix.org/conference/lisa15/conference-program/...

>> we all get dumber for it

Not _all_. Only those who feel inclined to reject the obvious.

Trigger a Kernel Panic to Diagnose Unresponsive EC2 Instances | Aug 2019

Expand Context ↕

This is why Netflix made "chaosmonkey" and other tools nearly a decade ago: https://github.com/Netflix/chaosmonkey

Principles of Chaos Engineering | Jun 2019

I have not used it, but I have heard this is a very useful tool https://github.com/Netflix/chaosmonkey

Google Cloud Is Down | Jun 2019

Expand Context ↕

I'd say yes. I heard about this tool just a week ago at a developer conference.

https://github.com/Netflix/chaosmonkey

"DigitalOcean Killed Our Company" | May 2019

Expand Context ↕

The difference is that when AWS goes down, Netflix/Spotify still have backups and could adapt infrastructure if the outage involved permanent data-loss. You're talking about the people who built https://github.com/Netflix/chaosmonkey

I'd argue that it should be _easier_ for a 2-man company to adapt to cloud service outages, as they likely don't have to keep up with nearly as many backups or moving parts.

On Infrastructure at Scale: A Cascading Failure of Distributed Systems | Jan 2019

The only real way to deal with this is to test distributed systems.

Doing so is hard, but only way to reliably know a system behaves given unpredictable failures.

So learn up:

- http://jepsen.io

- https://github.com/Netflix/chaosmonkey

- https://github.com/gundb/panic-server

Machine Learning: The High Interest Credit Card of Technical Debt (2014) | Jun 2018

Expand Context ↕

The problem with the analogy is that for a learning algorithm, there are clear definitions of the model complexity as it relates directly to the outcome being optimized. YAGNI applied to a model is a penalty term for parameters or various methods of regularization.

But when the “goal” of the system is just “arbitrary short term desires of management” you can easily point out the problems, but there is no agreement on what constraints you can use to trade-off against it.

Especially for extensibility, where you can get carried away easily with making a system extensible for future changes, many of which turn out to be wasted effort because you did not end up needing that flexibility anyway, and everything changed after Q2 earnings were announced, etc.

In those cases, it can actually be more effective engineering to “overfit” to just what the management wants right now, and just accept that you have to pay the pain of hacking extensibility in on a case by case basis. This definitely reduces wasted effort from a YAGNI point of view.

The closest thing I could think of to the same idea of “regularizing” software complexity would be Netflix’s ChaosMonkey [0], which is basically like Dropout [1] but for deployed service networks instead of neural networks.

Extending this idea to actual software would be quite cool. Something like the QuickCheck library for Haskell, but which somehow randomly samples extensibility needs and penalizes some notion of how hard the code would be to extend to that case. Not even sure how it would work...

[0]: < https://github.com/Netflix/chaosmonkey >

[1]: < https://en.m.wikipedia.org/wiki/Dropout_(neural_networks) >

Netflix suffers first massive global outage | Jun 2018

Expand Context ↕

They have a "Chaos Monkey" [1] feature that is intended to bring down individual nodes. "Exposing engineers to failures more frequently incentivizes them to build resilient services."

If Chaos Monkey had been responsible for setting off a global outage, I could imagine business leaders getting cold feet about using a tool like this. In traditional companies, anyways, they'd never have seen the benefit of it and after only hearing the costs, they'd probably be livid that a widespread outage had been caused by something like this.

[1] https://github.com/Netflix/chaosmonkey

Operant Conditioning by Software Bugs (2012) | Apr 2018

Expand Context ↕

Netflix; they even call it a monkey as well: https://github.com/Netflix/chaosmonkey

Ask HN: Which books describe modern devops? | Feb 2018

Like the SRE book from Google. Its also worth checking out how companies such as Netflix put reliability into practice.

Spinnaker

https://www.spinnaker.io/

Chaos Monkey

https://github.com/Netflix/chaosmonkey

Principles of Chaos Engineering

http://principlesofchaos.org/

Principles of Chaos Engineering | Jan 2018

There is no mention of Netflix on the site, but the term Chaos Engineering, and the popularization of the technique, seem to come from Netflix. The Chaos Monkey README even links to this site.

https://github.com/Netflix/chaosmonkey

https://medium.com/netflix-techblog/chaos-engineering-upgrad...

http://www.oreilly.com/webops-perf/free/chaos-engineering.cs...

https://en.wikipedia.org/wiki/Chaos_Monkey

A Primer on Automating Chaos | Aug 2017

Also check out:

- Chaos Monkey by Netflix (https://github.com/Netflix/chaosmonkey)

- Jepsen Tests by Aphyr (http://jepsen.io/)

- PANIC by us (https://github.com/gundb/panic-server)

Summary of the Amazon S3 Service Disruption | Mar 2017

Expand Context ↕

I think you meant "Chaos Monkey" [1].

[1] https://github.com/Netflix/chaosmonkey