It seems that they recreated a Service Mesh. Running Istio or Consul Connect takes care or the vast majority of issues listed in this post: Encryption, Identity, access control. And even trasparently for the developpers (no modification of the code...)
Only the scale that these solutions can support is different.
Isn't Istio implemented mostly as a sidecar container(Envoy Proxy) though? The article mentions they are running containers via their Tupperware orchestrator. If they are largely running containerized where is the scaling issue with adding sidecars to implement the service mesh? I don't have any experience with Istio but I'm genuinely curious along which axis it(or Connect, Linkerd etc) doesn't scale.
Envoy is so slow that deployment at this scale would be too costly, or if you could afford it would immediately present itself as a huge opportunity for cost reduction. People who are measuring their tail latency in microseconds aren't going to tolerate Envoy's marginal latency, which will be milliseconds even at the median.
Interesting, I wasn't aware that Istio has such performance issues. Isn't Google using this though as well or at least an internal version of it? Surely they are on the same scale as FB.
I'm curious at to what the cause of the latency is. TLS handshakes?
https://github.com/istio/istio.io/pull/4220
More here, which basically suggests, don’t stop Istio from scaling out before 500 rps, it doesn’t like that at all:
https://kinvolk.io/blog/2019/05/performance-benchmark-analys...