We have all our staging systems report to DD and Sentry, tagged with an `env:stage/prod/whatever` tag to slice the metrics by. This helps maintain parity for alerting too (although pagerduty is not enabled for stage. Sometimes it's tough to get right because the lower volume of traffic on stage makes the alert metric queries very noisy. For example, alerts that fire for error rates may not resolve if no new successful requests come in for a few hours on stage.

How do you test the staging environment - is it replay of prod data or load generated by scripts? How much extra cost does it incur on DD & Sentry (as a % of prod monitoring cost)?

This is the hard part... I would say my current company does a poor job, and every other company I've worked for has also done a poor job. They've treated it like an internal playground for devs to validate that their features work, not a representative copy of prod with all the scaling problems and user-data funkiness that come with it. Here are some options I see:

- load a replica of your prod DB to stage daily/weekly and have all the same ETL jobs running

- setup load testing or user behavior regression tests to automatically go through critical pathways like user authentication and registration ("bare essentials" functionality, since writing these is tedious). This might be a good chance to use traffic-capture to at least get started/make setting up these behavior tests easier

- if it's a consumer-facing product, have employees dogfood the product on stage

- if it's a product for businesses, run your business off the stage or a 3rd slightly more stable "internal" environment to create some consequences for not keeping it running smoothly.

My experiences have not had representative load on stage, so the extra billing is proportionally smaller (since you're paying what you use in most cases). I don't know the billing specifics, but you can also consider dropping the log/metric retention window significantly on stage (say 1mo instead of 6mos) to save costs.

Ultimately I don't think you're going to get the same scaling problems to manifest on stage. It's more of a functionality testing ground IME.

I have 3 YOE as a dev so don't base your whole business plan on my ideas

Got it. Have you tried traffic replay tools like https://github.com/buger/goreplay?