> I should mention that Borg, like Kubernetes, relies on containers like Docker

Hmm, containers _like_ Docker, or Docker? I thought Google used lmctfy since long before Docker?

cgroups were written at Google, and have been used internally for a very long time; they provide "container"-like limits on resource usage for a group of processes.

I assume that Google isn't using Docker internally for production services, but don't know for sure (and I assume anyone who does know for sure can't tell you).

beering

It also just doesn't make sense for Google to use Docker (or even Kubernetes) for their core infra. They've been the foremost leader in distributing containerized applications across data centers. Whatever they've already built is almost certainly more battle-tested and more customized to their needs than anything public that's based on their concepts.

elvinyung

Sure it's battle tested, but it's quite possible that the advantage of having a software with more eyeballs on it is even more valuable.

Last I heard, the biggest reason that Google still uses Borg instead of Kubernetes is mostly because of switching costs.

jsmthrowaway

You heard wrong.

There’s a meme out there, helpfully nudged along by Google, that Kubernetes is Borg “done right” and the successor. It’s even mentioned in this article. Neither of those things are true. Not even remotely. Please pass along to everyone to stop repeating the meme, because it distracts from Kubernetes’ true purpose, which is to lock people into GKE and force competitors to ship a Kubernetes runtime to compete. It’s partly the OpenStack playbook: if everybody runs the same platform, competition inherently drifts toward other aspects of the businesses, such as customer support. Seriously, I’m the only one who noticed the timing of Kubernetes and Google Cloud? Really? But Google shipped it, so now it has an ecosystem, and zealots who force entire teams onto it for zero upside and nonzero overhead, while the system that actually looks like Borg, the far superior to k8s HashiCorp Nomad, twiddles its thumbs with pretty much no mindshare.

For one, if Kubernetes were the successor to Borg, they wouldn’t have hobbled it architecturally as much as they have by marrying it to both Docker (kinda) and etcd, and deciding in the beginning to do every cluster mutation via external consensus in etcd, because you know, that’s a great idea and a classic Google design. Remember when Kubernetes pushed all job state changes through consensus and a flapping job could OOM etcd? I do too. Someone cynical could argue its fundamental architectural limitations are intentional. (I would argue simply that it didn’t have Paul Menage and most of the other names on the Borg paper working on it, to my knowledge.) I hear keyboards getting angry to yammer about how it just works. Not at seven-digit machine scale, it doesn’t, and never will. I’m happy it works for you. It’s a toy for a large fleet, which I’ll revisit in another point. Everyone I am aware of at Borg or Mesos scale has ruled out or failed with Kubernetes. No, really.

Relatedly, if Kubernetes were the successor to Borg, it’d be in C++. It just would be, and that’s not a language flame war. Ever wonder what percentage of systems at Google are C++? Ever wonder what that number does when you qualify “infrastructural”?

For two, Google containers aren’t an entire operating system, unlike Docker without crazy gymnastics. Seriously, this paragraph could be an essay. To paraphrase Jeremy Clarkson, Docker looks like a proper container system described over a blurry fax. Maybe we do need seventy probably-not-deduplicated copies of getty on every machine, and I’m yelling at clouds. I doubt it, and it reeks of “disk is cheap, fuck it.”

For three, Kubernetes is several orders of magnitude behind Borg and Omega, if that’s still alive, in terms of scheduling performance and maximum “cluster” size (I quote cluster because Google identifies a Borg unit as a cell, and a cluster means something else). This is not fixable in Kubernetes, in my reasonably informed opinion, without doing consensus differently. To my point, Borg does consensus and cluster state much differently, and you know what? It’s fine. Anybody who has used fauxmaster will back me up on that, and John Wilkes even said yup, every time we hit a limit we manage to double it with no end in sight. Why would that suddenly change? etcd remains Kubernetes’ Achilles heel, and this is why messaging around Kubernetes has gravitated toward smaller, targeted clusters. Bonus: if you find a bug in etcd make sure it loudly affects Kubernetes so it gets properly prioritized. Double bonus: someone was brave enough to suggest Consul in #1957 and children. Go read and sigh.

For four, when’s the last time you ran a 10,000+ node MapReduce on Kubernetes? Surprise, the underpinnings of Borg handle both batch and interactive with the exact same control plane, which is where the billions of containers a day number they occasionally talk about comes from. I mean, several JARs of glue might get you to Hadoop scheduling via Kubernetes, but that’s a much different animal than the platform thinking in terms of jobs with different interactivity requirements.

For five, half of Borg is the shit around it. Borg works because everything behaves the same. Everything is a Web server. Everything exposes /statusz. Everything builds and monitors the same way. Everything speaks the same RPC to each other. All of this is implemented by forcing production systems at Google to choose from four (as of my tenure) languages which are well tended and manicured by hundreds of people. Google has a larger C++ standard library team than many startups have engineers. Borg works because apps let it work. They’re not black boxes. Unlike Kubernetes.

Which brings me to point the sixth, which is that the reason you haven’t seen open source Borg is (a) they’re not moving off it, like, ever (I’d bet my season tickets on it), because significant parts of every production system and tool would have to change and (b) they can’t unravel Borg and the rest of the google3 codebase, because it’s so fundamental to the Google ecosystem and half of Borg’s magic is wrapped up in other projects within Google which they aren’t keen to show you.

Link to this answer next time anyone gets tempted to relay what they’ve heard about Borg and Kubernetes, please. For years I’ve watched this tale evolve until it’s barely recognizable as factual. Saying Kubernetes is Borg’s proper successor not only drastically insults Borg and the hundreds (thousands?) who have worked on it, it also calls to mind thinking of a cotton candy machine as the successor to the automobile. They’re that different.

davidopp__

[Disclaimers: I worked on Borg and Omega, and currently work on Kubernetes/GKE. Everything here is my personal opinion.]

There's a lot to unpack here, but I'll do my best.

I don't see Kubernetes locking people into GKE. There's an extensive conformance program (https://github.com/cncf/k8s-conformance) administered by the CNCF. AWS and Azure both have certified hosted Kubernetes offerings. Portability is in Google's best interest.

Go, Docker, and etcd were the best open-source technologies for the job at the time Kubernetes was created (and arguably still are). Open-sourcing Borg would have been impossible, due to its use of many Google-specific libraries (though a number of those have been open-sourced since then), and its close coupling to the Google production environment. Commenting more specifically on each of the pieces you mentioned:

* Go was chosen over C++ because, like C++, it is a systems language, but is much more accessible for building an open-source community.

* Docker was (and still is) by far the most popular container runtime, and the slimmer containerd makes it even more appropriate to serve as the container runtime for a system like Kubernetes. While it's true that in Borg the container runtime and "package" (container image) management systems are separate, the tradeoffs between packaging more in the image vs. pre-installing dependencies on the host are exactly the same as with Docker images. In any event, it's very feasible to build very slim Docker images (you definitely don't need getty in your image :-).

* You can read the reasons etcd was chosen in this recent comment (https://news.ycombinator.com/item?id=17476142) from a Red Hat employee who is one of the earliest contributors to Kubernetes and one of the most prolific. Regarding consensus, I didn't understand your comment; Borg uses Paxos and etcd uses Raft, but those are basically equivalent algorithms.

Regarding scalability, we do continuous scalability testing as part of the Kubernetes CI pipeline, at a cluster size of 5000 nodes. If you're interested in learning more, I'd encourage you to joint the scalability SIG (https://github.com/kubernetes/community/tree/master/sig-scal...). I'm not aware that "messaging around Kubernetes has gravitated toward smaller, targeted clusters." It's true that a lot of people do use small-ish clusters, but AFAICT that's not because of scalability limitations, but rather because (1) the hosted Kubernetes offerings make it so easy to spin up clusters on demand, and (2) until recently, Kubernetes was lacking critical multi-tenancy features that would allow, say, multiple teams within a company to safely share a cluster.

Regarding mixing batch and interactive/serving applications in a single cluster managed by a single control plane, this has been the intention of Kubernetes from the beginning. It's true that open-source batch systems like Hadoop and Spark have traditionally shipped with their own orchestrators/schedulers, but that's starting to change as Kubernetes becomes more popular, for example Spark now supports Kubernetes natively (https://kubernetes.io/blog/2018/03/apache-spark-23-with-nati...). In terms of features that enable batch and serving workloads to share a node and a cluster, Kubernetes has had the concept of QoS classes (https://kubernetes.io/docs/tasks/configure-pod-container/qua...) from the beginning, and as of the most recent Kubernetes release we now have priority/preemption (https://cloudplatform.googleblog.com/2018/02/get-the-most-ou...). QoS classes and priority/preemption are the two main concepts that allow batch and interactive/serving application to share nodes and clusters in Borg, and we now have them in Kubernetes.

On your fifth point, I agree that this is one of the strengths of the Google production environment, but Kubernetes is limited in how prescriptive it can be in dictating how people write applications, since we want Kubernetes to work with essentially any application. This is why we have, for example, extremely flexible liveness/readiness probes in Kubernetes (https://kubernetes.io/docs/tasks/configure-pod-container/con...) rather than the expectation that every application has a built-in web server that exports a predefined /statusz endpoint. That said, we have been more prescriptive in how to build Kubernetes control plane components (for example such components generally have /healthz endpoints and export Prometheus instrumentation according to the guidelines outlined at https://github.com/kubernetes/community/blob/master/contribu...). Over time as containers and the "cloud native" architecture become more popular, I think there will be more standardization in the ways you described when people see the benefits it provides in allowing them to plug in their app immediately to standard container ecosystems. To some extent Istio (https://github.com/istio/istio) is a step in that direction, and in some sense even better because it interposes transparently rather than requiring you to build your application a particular way.

For anyone interested in learning more about the evolution of cluster management systems at Google, I recommend this paper: https://ai.google/research/pubs/pub44843

While Kubernetes is definitely not the same codebase as Borg, I do think it's accurate to say that it is the descendant of Borg.