This article has a large gap in the story: it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude. They have become ubiquitous in diverse, medium-sized industrial enterprises and it has turned them into some of the largest customers of cloud providers due to the data intensity. Organizations routinely spend $100M/year to deal with this data, and the workloads are literally growing exponentially. Almost no one provides tooling and platforms that address it. (This is not idle speculation, I’ve run just about every platform you can name through lab tests in anger. They are uniformly inadequate for these data models, everyone relies on bespoke platforms designed by specialists if they can afford the tariff.)

If you add real-time sensor data sources to the mix, the rest of the architecture model kind of falls apart. Requirements upstream have cascading effects on architecture downstream. The deficiencies are both technical and economic.

First, you need a single ordinary server (like EC2) to be able to ingest, transform, and store about 10M events per second continuously, while making that data fully online for basic queries. You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second; even at that rate, you’ll need a fantastic cluster architecture. Most of the open source platforms tap out at 100k events per second per server for these kinds of mixed workloads and no one can afford to run 20k+ servers because the software architecture is throughput limited (never mind the cluster management aspects at that scale).

Second, storage cost and data motion are the primary culprits that make these data models uneconomical. Open source tends to be profligate in these dimensions, and when you routinely operate on endless petabytes of data, it makes the entire enterprise problematic. To be fair, this is not to blame open source platforms per se, they were never designed for workloads where storage and latency costs were critical for viability. It can be done, but it was never a priority and you would design the software very differently if it was.

I will make a prediction. When software that can address sensor data models becomes a platform instead of bespoke, it will eat the lunch of a lot of adjacent data platforms that aren’t targeted at sensor data for a simple reason: the extreme operational efficiency of data infrastructure required to handle sensor data models applies just as much to any other data model, there simply hasn’t been an existential economic incentive to build it for those other data models. I've seen this happen several times; someone pays for bespoke sensor data infrastructure and realizes they can adapt it to run their large-scale web analytics (or whatever) many times faster and at a fraction of the infrastructure cost, even though it wasn't designed for it. And it works.

>> Almost no one provides tooling and platforms that address it

I think this is due to the nature of the mentioned companies are not being too common (yet?). There are tools and systems that you can use, especially from high frequency trading which has somewhat similar challenges. KDB+ and co. would be my first stop to check if there is something that I could use. The question is the financial structure and scaling of the problem, to determine if these tools are in game. There are other interesting projects in the space:

- https://github.com/real-logic/aeron

- https://lmax-exchange.github.io/disruptor/

Of course these are not exactly what you need, long term storage and querying (like KDB) is largely unsolved.

The other tools that you might be referring to by "most of the opensource platforms" indeed are not capable doing this. I spent the last 10 years on optimizing such platforms but it is not even remotely close to what you need, you (or anybody who thinks these could be optimized) are wasting your time.