I get Uber is huge. But honestly, there was nothing out there that could fulfill there use case? Cassandra, ElasticSearch, Influx, etc.? I might be completely wrong, but I just highly doubt that.
I can give you an ex-insiders view on this.
Uber made an early strategic decision to invest in on-premise infrastructure due to fears that either Amazon or Google would enter the on-demand market as competitors and bring their cloud infrastructure to bear and potentially squeeze us for costs. Azure wasn’t much of an option during this time. This decision limited our adoption of cloud native solutions like SpannerDB and DynamoDB. We ended up doing a lot of sharded MySQL in our own data centers instead.
This on-prem decision led to a lot challenges internally where we would adopt OSS and then have difficulty scaling it to our needs. For some tech like Kafka it worked out, and we hired Kafka contributors who helped us scale it. For other tech like Cassandra it was a pretty epic failure. I am sure more of these war stories exist that I wasn’t privy to myself.
Coupled with the fact that we were early adopters into Golang which had its own OSS ecosystem, we found that writing a lot of our own infrastructure solutions was the only viable option at our scale.
What you are seeing now is a lot of that home grown infrastructure being open sourced in big way as people who have left Uber continue to see value in investing in the tech that they worked so hard to build. There is probably a nontrivial amount of work to scale the Uber OSS down for smaller use cases but some startups are emerging to make that happen.
Source: I worked at Uber from 2015-2019 on product and platform teams and had several close colleagues in infra.
Netflix loves Cassandra, right? [0][1] So could someone describe why it wasn't a great fit for Uber? How come it was easier to invent the wheel in Go compared to cobbling together something with Cassandra/ES/Kafka (or other Java gadgets from the Hadoop ecosystem)?
[0]: https://netflixtechblog.com/scaling-time-series-data-storage... [1]: https://www.datastax.com/resources/video/cassandra-netflix-a...
Netflix actually built their own metrics time series store called Atlas for similar reasons to Uber building M3DB (FOSDEM talk mentions hardware reduction and oncall reduction), however open source Atlas only has an in-memory store component which was too expensive for Uber to run (since the dataset is in petabytes).
Ok, but I am fairly confident Netflix also is at that kind of scale.
Netflix has a section on Atlas's documentation about how they get around this: https://github.com/Netflix/atlas/wiki/Overview#cost
They also did this nice video that outlines their entire operation including how they do rollups: https://www.youtube.com/watch?v=4RG2DUK03_0
This is how they do the rollup but keep their tails accurate to parts per million and the middle to be parts per hundred: https://github.com/tdunning/t-digest