Hi all - I'm the head of engineering at GitHub. Please accept my sincere apology for this downtime. The cause was a bad deploy (a db migration that changed an index). We were able to revert in about 30 minutes. This is slower than we'd like, and we'll be doing a full RCA of this outage.
For those who are interested, on the first Wednesday of each month, I write a blog post on our availability. Most recent one is here: https://github.blog/2021-03-03-github-availability-report-fe...
@keithba I have build a - private - GitHub action around https://github.com/sbdchd/squawk - for Postgres - that lints all our migrations files on each PR. The action extract raw SQL from the codebase and pass them into squawk. It catches many exclusive locks migration or missing `index concurrently` that would otherwise have been release to production and causing downtime or degraded service. Maybe something you should start doing.
Even so, it's always possible for an engineer to submit a schema change which is detrimental to performance. For example, dropping an important index, or changing it such that some necessary column is no longer present. Linters simply cannot catch some classes of these problems, as they're application/workload-specific. Usually they must be caught in code review, but people make mistakes and could approve a bad change.
Disclosure: I'm the author of Skeema, but have not worked for or with GitHub in any capacity.
[1] https://github.com/github/gh-ost
[2] https://github.blog/2020-02-14-automating-mysql-schema-migra...