One thing about logging and tracing is the inevitable cost (in real money).

I love observability probably more than most. And my initial reaction to this article is the obvious: why not both?

In fact, I tend to think more in terms of "events" when writing both logs and tracing code. How that event is notified, stored, transmitted, etc. is in some ways divorced from the activity. I don't care if it is going to stdout, or over udp to an aggregator, or turning into trace statements, or ending up in Kafka, etc.

But inevitably I bump up against cost. For even medium sized systems, the amount of data I would like to track gets quite expensive. For example, many tracing services charge for the tags you add to traces. So doing `trace.String("key", value)` becomes something I think about from a cost perspective. I worked at a place that had a $250k/year New Relic bill and we were avoiding any kind of custom attributes. Just getting APM metrics for servers and databases was enough to get to that cost.

Logs are cheap, easy, reliable and don't lock me in to an expensive service to start. I mean, maybe you end up integrating splunk or perhaps self-hosting kibana, but you can get 90% of the benefits just by dumping the logs into Cloudwatch or even S3 for a much cheaper price.

FWIW part of the reason you're seeing that is, at least traditionally, APM companies rebranding as Observability companies stuffed trace data into metrics data stores, which becomes prohibitively expensive to query with custom tags/attributes/fields. Newer tools/companies have a different approach that makes cost far more predictable and generally lower.

Luckily, some of the larger incumbents are also moving away from this model, especially as OpenTelemetry is making tracing more widespread as a baseline of sorts for data. And you can definitely bet they're hearing about it from their customers right now, and they want to keep their customers.

Cost is still a concern but it's getting addressed as well. Right now every vendor has different approaches (e.g., the one I work for has a robust sampling proxy you can use), but that too is going the way of standardization. OTel is defining how to propagate sampling metadata in signals so that downstream tools can use the metadata about population representativeness to show accurate counts for things and so on.

> Newer tools/companies have a different approach that makes cost far more predictable and generally lower.

What newer tools/companies are in this category? Any that you recommend?

I think we fit in that bucket [1] - open source, self-hostable, based on OpenTelemetry and backed by Clickhouse DB (columnar, not time-series).

Clickhouse gives users much greater flexibility in tradeoffs than either a time-series or inverted-index based store could offer (along with S3 support). There's nothing like a system that can balance high performance AND (usable) high cardinality.

[1] https://github.com/hyperdxio/hyperdx

disclaimer (in case anyone just skimmed): I'm one of the authors of HyperDX