Thoroughly surprised the lack of difference between read latency on SATA SSD and NVMe SSD. I thought NVMe was significantly more performant than SATA. Guess the difference is all on the writing end, so if you had some sort of mostly read workload it's fine.

It's IOPS, not throughput, that NVMe gets you. (Or rather, if you're not being bottlenecked by IOPS, then NVMe won't get you anything over a JBOD of SATA SSDs.)

SATA SSDs will serve you just fine if you're Netflix or YouTube, streaming a bunch of video files to people. High throughput, low IOPS — no advantage to NVMe.

But you'll get a huge boost from NVMe if you're doing a lot of little concurrent random reads from a much-larger-than-memory dataset. For example, if you're Twitter, fetching updates to users' timelines.

If you've got higher concurrency, you'll want several NVMe SSDs behind a hardware RAID controller, in RAID0.

And if you've got really high concurrency, you'll want several of these NVMe SSD RAID controllers, on separate PCI-e lanes, under software RAID0.

That's the point when you're starting to reach those "basically identical to memory" figures that are the use-case for Intel's Optane.

Thanks for this -- this is what I'm missing. I was starting to essentially suspect that NVMe is really more of a benefit for "hyperscaler"/IaaS providers... This lays it out well for me, and the how too. I rent my hardware so I don't have access to the hardware RAID stuff but this really filled out holes for me.

> If you've got higher concurrency, you'll want several NVMe SSDs behind a hardware RAID controller, in RAID0.

Hmnn why would you want RAID0 here? faster speed but absolutely no redundancy? what about RAID1+0/0+1?

> That's the point when you're starting to reach those "basically identical to memory" figures that are the use-case for Intel's Optane.

I need to do more reading on this but thanks for pointing this out

> Hmnn why would you want RAID0 here? faster speed but absolutely no redundancy? what about RAID1+0/0+1?

I can't speak to others†, but for our use-case, we're not using NVMe as "persistent storage but faster and smaller" but rather as "main memory but larger and slower".

Our data isn't canonically on our nodes' NVMe pools, any more than it's canonically in the the nodes' local memory. The canonical representation of our (PGSQL data warehouse) data is its WAL log segments, shipped to/from object storage.

The NVMe pool can be thought of as holding a "working" representation of the data — sort of like how, if you have a business layer that fetches a JSON document from a document store, then the business-layer node's main memory then holds a deserialized working representation of the data.

You wouldn't back up/protect the data on your NVMe pool, any more than you'd back up the memory backing the deserialized JSON. If you wanted to modify the canonical data, you'd modify it in the canonical store. (For JSON, the document database; for our use-case, the WAL segments + base-backups in the object store.)

† I would note that "large ephemeral working-set backing store" is seemingly largely considered by IaaS providers to be the main use-case for having NVMe attached to compute. The IaaS providers that support NVMe (AWS, GCP) only support it as instanced or scratch (i.e. ephemeral) storage, that doesn't survive node shutdown. These are strictly "L4 memory", not stores-of-record.

> I can't speak to others†, but for our use-case, we're not using NVMe as "persistent storage but faster and smaller" but rather as "main memory but larger and slower".

> Our data isn't canonically on our nodes' NVMe pools, any more than it's canonically in the the nodes' local memory. The canonical representation of our (PGSQL data warehouse) data is its WAL log segments, shipped to/from object storage.

I appreciate this insight -- one of the features that drew me to KeyDB[0] was their FLASH feature, with drives being as fast as they are these days it feels like having tiered cache with DRAM + NVMe disk isn't so bad! Great to see others think this way as well.

Redis is one thing but treating pg this way is somewhat interesting -- Do you also have the usual hot standby setup with synchronous replication just in case of a machine going down before the latest WAL segment is shipped out?

> † I would note that "large ephemeral working-set backing store" is seemingly largely considered by IaaS providers to be the main use-case for having NVMe attached to compute. The IaaS providers that support NVMe (AWS, GCP) only support it as instanced or scratch (i.e. ephemeral) storage, that doesn't survive node shutdown. These are strictly "L4 memory", not stores-of-record.

Reasonable, thanks for explicitly noting this. I think the write characteristics (and technically the read as this article shows) of NVMe do make it attractive but don't preclude using it as the store of record medium but what you said definitely syncs up with the other configurations I've seen and the offerings at AWS (I haven't really looked at GCP node types).

[0]: https://github.com/EQ-Alpha/KeyDB/wiki/Legacy-FLASH-Storage

> Do you also have the usual hot standby setup with synchronous replication just in case of a machine going down before the latest WAL segment is shipped out?

We don't have to — the data in our data warehouse is a rearrangement/normalization/collation of origin data from public sources. If our DB cluster dies, we stand up a new one from a backup, let it catch up from WAL segments, and then our ETL pipeline will automatically re-bind to it, discover the newest data it has available inside it, and begin ETLing to it. From our ETL ingestion-agent's perspective, there's no difference between "loading to a DB freshly restored from backup because it lost 10 minutes of updates" and "loading to a DB that is/was up-to-date because we just received some new origin data."

(Our situation of our primary-source data being easily and reliably re-fetched is pretty unique to our vertical; but you can put pretty much any system into the same shape by first writing your primary-source data to an event store / durable unbounded-size MQ [e.g. Kafka] as you receive it; and then ETLing it into your DB from there. Forget DB disaster recovery; just keep the MQ data safe, and you'll always be able to replay the process that translates an MQ event-stream into a DB state.)

We do have hot standbys, but they're async (using wal-g — master writes WAL segments to object store; replicas immediately fetch WAL segments from object store, as quickly as they can manage.) They're for query load scaling, not disaster recovery.

Appreciate the detailed description of what works for your warehousing solution!

> (Our situation of our primary-source data being easily and reliably re-fetched is pretty unique to our vertical; but you can put pretty much any system into the same shape by first writing your primary-source data to an event store / durable unbounded-size MQ [e.g. Kafka] as you receive it; and then ETLing it into your DB from there. Forget DB disaster recovery; just keep the MQ data safe, and you'll always be able to replay the process that translates an MQ event-stream into a DB state.)

I'm familiar with this almost-event-sourcing model. AFAIK is what banks and financial firms do (at scales I can't imagine and with technology they'll never open source) and is the only way to do rock-solid lose-nothing architecture that is relatively easy to reason about and flexible.

I have no need to do this but Debezium w/ NATS + Jetstream/Liftbridge would be my pick for this, since I'm a sucker for slightly newer tooling that boasts at being "cloud native" (i.e. easy for me to administer). AFAIK the boring choice might be Debezium + Kafka with some RAIDx drives below it all just in case.

> We do have hot standbys, but they're async (using wal-g — master writes WAL segments to object store; replicas immediately fetch WAL segments from object store, as quickly as they can manage.) They're for query load scaling, not disaster recovery.

Read only replicas for load scaling is pretty common, was surprised to hear wal-g though -- pg has an absolutely dizzying array of replication options[0], and I don't think I landed on wal-g as the "best" choice for me (with a heavily application-centric viewpoint) and this gives me reasonf or some pause. I think I mostly had it down to barman or pgbackrest. The object-store-capable options always caught my eye (and what put backrest over the top for me IIRC), will go back and revisit. Does covalent maintain an engineering blog? because sounds like ya'll are doing stuff I'd love to read about and cargo cult^H^H critically consider.

Weirdly enough I've just been down a bit of a rabbit hole on the data engineering world and have been thoroughly impressed by dbt[1]. The concept is simple, but combined with a powerful database like pg it holds a lot of promise in my mind. I'm mostly excited by the idea of exposing the power of postgres and it's many extensions and features to a SQL-first view of the world that dbt encourages, and building some sort of "data platform" that is nothing more than enshrining and exposing the best ways to do things with that stack.

Is dbt or a similar tool a part of your stack in any way? Listening to people who sort of... distribute hype around data engineering, you'd think Snowflake is the next greatest thing since sliced bread, but the pattern of debezium + dbt + a big ol' well-configured and replicated postgres cluster feels like more than enough. I'm aware of course that Snowflake is really just a management layer (essentially, they're an Redshift-aaS provider for any cloud) but it feels like that gets lost in discussion. Sounds like you all have a well considered data pipeline (so the snowflake part is done for you), wondering what the other links in the chain look like. Any Metabase or Superset at the end of the pipeline?

[EDIT] Want to note the recent presentation from CERN on the topic of replication that most recently influenced me[2] (skip to the last 2 slides if you want the TL;DR)

[0]: https://wiki.postgresql.org/wiki/Binary_Replication_Tools

[1]: https://docs.getdbt.com

[2]: https://www.slideshare.net/AnastasiaLubennikova/advanced-bac...

pgBackRest is probably strictly better than wal-g for our use-case, but I already knew wal-g was at least satisfactory in solving our problems (from previous usage in implementing something Heroku-Dataclips-alike before), so I threw it on and filed an issue for doing a compare-and-contrast down the line. So far it hasn't caused too many issues, and does indeed solve our main issue: allowing truly-async replication (e.g. replication to DCs in other countries, where the link has a huge RTT, high potential packet loss, and might even go down for hours at a time due to, say, politics — without that causing any kind of master-side buffering or slowdown.)

> Does covalent maintain an engineering blog?

Not yet, but we've certainly been considering it. Probably something we'll do when we have a bit more breathing room (i.e. when we get out from under all the scaling we're doing right now.)

> Is dbt or a similar tool a part of your stack in any way?

Nothing like that per se (though it'd very much be a data-layer match for how we're doing infra management using k8s "migrations" through GitOps. Imagine SQL schema-definition k8s CRDs.)

But that's mostly because — to our constant surprise — our PG cluster's query plans improve the more we normalize our data. We hardly build matviews or other kinds of denormalized data representations at all (and when we do, we almost always eventually figure out a partial computed-expression GIN index or something of the sort we can apply instead.)

We are planning to offer our users something like dbt, and we may or may not use dbt under the covers to manage it. (Analogous to how GitHub created their own git library to use in their Rails backend.)

One thing we do want to enable is avoiding the recomputation of intermediate "steps" within long chained CTE queries (imagine: taking 10 seconds to compute A, and then using it in a join to quickly select some Bs; then, in another query, taking another 10 seconds to compute A again, because now you need to select Cs instead), without keeping around any sort of persistent explicitly-materialized intermediate, or writing ugly procedural logic. https://materialize.com/ seems promising for that. (Just wish it was a library I could shove inside PG, rather than an external system that is apparently going to grow to having its own storage engine et al.)

> Any Metabase or Superset at the end of the pipeline?

We've always mentioned to people that Tableau has first-class support for connecting to/querying Postgres. Given the type of data we deal with, that's usually all they need to know. :)

> pgBackRest is probably strictly better than wal-g for our use-case, but I already knew wal-g was at least satisfactory in solving our problems (from previous usage in implementing something Heroku-Dataclips-alike before), so I threw it on and filed an issue for doing a compare-and-contrast down the line. So far it hasn't caused too many issues, and does indeed solve our main issue: allowing truly-async replication (e.g. replication to DCs in other countries, where the link has a huge RTT, high potential packet loss, and might even go down for hours at a time due to, say, politics — without that causing any kind of master-side buffering or slowdown.)

I always thought that dataclips was just them running on ZFS but honestly this would make more sense. I don't use Heroku much so I didn't see any limitations that jumped out to me as "oh they have that limitation because they're doing pgbackrest" but.

> Not yet, but we've certainly been considering it. Probably something we'll do when we have a bit more breathing room (i.e. when we get out from under all the scaling we're doing right now.)

Yup sounds reasonable, I certainly am not a customer (nor in a position to be one really, as I don't do any blockchain) so satisfying me brings you almost negative ROI. Scaling sounds fun in and of itself (when it isn't terrifying though) -- thanks for sharing so much about the setup!

> Nothing like that per se (though it'd very much be a data-layer match for how we're doing infra management using k8s "migrations" through GitOps. Imagine SQL schema-definition k8s CRDs.)

Now this is fascinating -- I'm also very much sold on k8s and SQL-schemas-as-CRDs sounds like it'd be relatively easy to implement and very high ROI. Migrations were always a sort of weird infrastructure-mixed-with-dev thing, and being able to tie the CRDs for migrations to the StatefulSet running the DB and the Deployment running the application seems like it could open up a whole new world of more intelligent deployment. cool stuff.

> But that's mostly because — to our constant surprise — our PG cluster's query plans improve the more we normalize our data. We hardly build matviews or other kinds of denormalized data representations at all (and when we do, we almost always eventually figure out a partial computed-expression GIN index or something of the sort we can apply instead.)

I don't know how people use open source DBs that are not postgres. I mean I do, but I can't wait until zheap removes the most common reason for people choosing MySQL to begin with.

> We are planning to offer our users something like dbt, and we may or may not use dbt under the covers to manage it. (Analogous to how GitHub created their own git library to use in their Rails backend.)

dbt does look pretty good, it seems like just enough structure. I do want to also bring up ddbt[0] if ya'll do some research/exploratory sprint on it.

> One thing we do want to enable is avoiding the recomputation of intermediate "steps" within long chained CTE queries (imagine: taking 10 seconds to compute A, and then using it in a join to quickly select some Bs; then, in another query, taking another 10 seconds to compute A again, because now you need to select Cs instead), without keeping around any sort of persistent explicitly-materialized intermediate, or writing ugly procedural logic. https://materialize.com/ seems promising for that. (Just wish it was a library I could shove inside PG, rather than an external system that is apparently going to grow to having its own storage engine et al.)

Interesting -- would Incremental View Maintenance[2] work for this? Don't know if this is useful to you but the dbt conference I got sucked into watching most of had a talk with the materialize CEO and some other people. I don't remember it being particularly high signal (I mean let's be real it's basically a 3-way ad for the CEO involved) but it might be worth a watch[1].

> We've always mentioned to people that Tableau has first-class support for connecting to/querying Postgres. Given the type of data we deal with, that's usually all they need to know. :)

I really underestimate how much Tableau is the standard (because I'm not a data person probably) -- basically between that, looker and PowerBI I'm not sure people use much else, even if it's open source/free.

[0]: https://github.com/monzo/ddbt

[1]: https://www.youtube.com/watch?v=0DDNjB4O0As

[2]: https://github.com/sraoss/pgsql-ivm