I have growing concerns over GitLab. I appreciate the openness and forthrightness of their handling of these sorts of incidents and in regard to other issues, but there comes a point where none of that matters if the product itself is inherently unreliable. This post isn't about any one incident, including this one, but an ongoing trend that I don't get the sense is improving.
To be completely fair, I'm on the free plans and have no rightful expectation of any sort of performance level. Having said that, I have considered moving to paid tiers, and I do advise others in regard to these sorts of services (whether cloud based or on-premise). Every time I see a planned maintenance, of "under 1m", I simply realize that I shouldn't plan anything important for GitLab that day. I can't see that this would be any different with GitLab.com paid plans and I have to imagine there is something inherently difficult in managing the software if the developer of that software has issues with common maintenance; this colors my impressions of what this might be like in-house. It seems a beast, I get that there are scaling issues that are different between something like gitlab.com and GitLab on-premise, but history of these things is coloring my impression.
At some point moving fast and breaking things needs to give way to the pursuit of enough stability that users aren't overly concerned about whether or not the tools they depend on, or more importantly the data they're storing, are regularly at risk.
I don't want to buy services/products from a company that is just trustworthy and open when things go wrong. I want to buy services/products from companies that only need to prove those qualities with some rarity.
It always seems to be either their storage abstraction (Ceph?) or Postgres. I've never used Postgres but from what I see it's not really designed for enormous scale (it's a very old project). Perhaps they would be well served to see if CockroachDB gives them more stability. I've started using it at a small scale and the clustering aspect seems legit.
I work with PostgreSQL quite a lot and I don't think this is the case.
Yes, PostgreSQL is a mature project. But that has little or no bearing by itself on what degree it can scale; Linux is just about as old, yet it still drives most of the infrastructure we're talking about. In context, PostgreSQL has been under continuous development since inception and no more so than in the past decade or so with substantial community and corporate backing. To conflate project age with current robustness is to indulge in fallacious thinking (in this case "cum hoc ergo propter hoc", if I'm not mistaken). The project has advanced with the times. There are good examples of PostgreSQL running at scale, including at companies operating at scale such as Instagram, Skype (up to at least the point of the Microsoft acquisition), Pandora, as well as others. Most of these, I'd bet, are/were using PostgreSQL at scales substantially larger than that faced by GitLab.
Relational databases require good management and good design to function properly. There were people who specialized in this called Data Base Administrators (at least the management piece anyway). My feeling (perhaps unjustified) is in start-upish environments there is a tendency to minimize DBA expertise in favor of having more conventional developers that can "get the database to work"... which is a different standard than "getting the database to perform". That's mostly not my world, so I may be jumping to conclusions (I'm an enterprise systems guy on most days).
I think GitLab probably has a complex software product and lacks the correct expertise in infrastructure (including, yes, DBAs) for their SaaS offering. I'm reading tea leaves, but that would be a perfect storm which would produce exactly what we see no matter how robust any one piece of this puzzle might be.
I knew someone here would misinterpret why I mentioned its age. As I said, it's not what it was designed for. Perhaps it has bolted on things in the 20+ years since it was first conceived, but it wasn't purpose-built for scale.
> To conflate project age with current robustness is to indulge in fallacious thinking
It only makes sense to talk about "robustness" in the context something was designed to work in. I can't abuse a system and call it unrobust. Of course Postgres is robust. You're twisting what I was saying.
> There are good examples of PostgreSQL running at scale, including at companies operating at scale such as Instagram
I'd actually be keen to learn more about how Postgres works at Instagram scale without their own customisations to make it work in that setting.
If Postgres works well in that setting why do things like this exist? https://github.com/citusdata/citus