Question for anyone who knows a lot about Postgres connection pooling:

We want to move to using a connection pooling infrastructure component, but so far, we haven't been able to figure out how to do so given our setup.

We've got a wacky data model where we have N Postgres schemas, each of which has the same tables (but with different data, obviously.) They're not tenants; more like separate "datasets." Picture something like: each schema is a digraph, with tables like "vertices", "edges", "edge_labels", etc. (Not what we're doing, but it's close.)

In theory, we could solve our problem by just sticking all the "vertices" tables together as list-partitions of one mega "vertices" table keyed by dataset_id; but 1. it's kind of incoherent — the data makes no sense when combined together like that, it's only valid when considered in isolation; and 2. it's very operationally convenient for us to manage datasets as separate schemas, re: security, naming, onlining/offlining data, etc.

Also, while each user request is only going to interact with a single dataset (schema), a given user might care about a lot of datasets, and make requests about many of them at arbitrary times—so, as far as I can tell, we don't have the sort of stable load per dataset that would allow us to hold separate connection pools per dataset.

And, since there's no way to make a schema name into a bind variable, fully-qualifying our queries means generating dynamic SQL strings, which is both expensive on the business layer, and on the SQL query planner, which can't just recognize+unify its plans from the query's history from other schemas.

Right now, the way we query the data, is that in our business layer, before each query, we have a middleware that injects a statement like this:

    SET search_path TO 'name_of_dataset', 'public';
We then run a non-fully-schema-qualified query, and it finds whichever tables that have been made visible to it.

Our queries currently don't run in transactions (i.e. we're using JDBC auto-commit on the client side), because they're by-and-large just single (complex) SELECT statements. So the middleware-generated statement above (for now) runs as a separate statement, in a separate single-statement transaction, on the same connection, that runs the query. So per-statement connection-pooling would break everything, as our queries would be being routed to backend connections "primed" with the wrong search_path.

And I get the feeling that transaction-level connection pooling won't work for us either (at least for now), because our whole operational problem currently is that our HTTP requests acquire DB connections and then sit on them while other things are processed, depleting the connection pool; and if we turned off JDBC auto-commit, our search_path-injecting middleware (which injects its statement on connection-pool checkout) would just end up starting a DB transaction at the beginning of each of those HTTP requests, where the business layer would then be sitting around with an idle in transaction DB connection, mapping directly to an open pgBouncer backend connection, completely destroying any wins connection-pooling would gain us. We'd be right back where we started.

So, in other words, we want/need per-statement pooling; but we need it to allow us to also specify the search path per statement.

I've seen that at least pgBouncer has an explicit WONTFIX for this requirement, since in their minds it conflicts with their "can't tell it apart from a regular Postgres session" session-state semantics.

Should we bite the bullet and move to putting everything in one schema + list-partitioning + an explicit dataset_id column, so that we can parameterize out the dataset per-query using a bind variable? Or something else?

We are building a solution for this problem at Splitgraph [0] – it sounds like we could probably help with your use case, assuming this is for analytical (OLAP) data. You can get it to work yourself with our open source code [1], but our (private beta, upcoming public) SaaS will put all your schemas on a more scalable “data delivery network,” which incidentally, happens to be implemented with PgBouncer + rewriting + ephemeral instances. We also have private repositories / fine-grained ACL shipping shortly.

In a local engine (just a Postgres DB managed by Splitgraph client to add extra stuff), there is no PgBouncer or batteries-included authN/Z, but we use Foreign Data Wrappers to accomplish the same query resolving. Our goal is for you to have the ability to do everything locally as an individual, with the SaaS becoming useful for “multiplayer” (teams and orgs).

On Splitgraph, every dataset – and every version of every dataset – has an address. Think of it like tagged Docker images. The address either points to an immutable “data image” (in which case we can optionally download objects required to resolve a query on-the-fly, although loading up-front is possible too) or to a live data source (in which case we proxy directly to it via FDW translation). This simple idea of _addressable data products_ goes a long way – for example, it means that computing a diff is now as simple as joining across two tables (one with the previous version, one with the new).

Please excuse the Frankenstein marketing site – we’re in the midst of redesign / rework of info architecture while we build out our SaaS product.

Feel free to reach out if you’ve got questions. And if you have a business case, we have spots available in our private pilot. My email is in my profile – mention HN :)

[0] https://www.splitgraph.com/connect

[1] examples: https://github.com/splitgraph/splitgraph/tree/master/example...

We're already sharding our datasets onto separate PG nodes using FDWs (mostly due to us running PG on GCP, and GCE VMs having inherent vertical scaling limitations on node-local storage, which we rely 100% on for our workloads.)

Also, our "datasets" are all live data. They aren't modified by the users querying our API, but they are constantly appended to (and I mean constantly, i.e. every few milliseconds.)

For us, PG is in our stack at this point because of its hybrid-OLAP nature: it can do realtime, row-at-a-time ingestion of data without degradation, like an OLTP store / time-series DB; but it can then perform intensive OLAP workloads against that up-to-the-moment data, involving joins, CTEs, partial computed-expression indices, etc.

(The use-case for this sort of mixed workload? Think "realtime monitoring/alerting on custom-per-user financial Business-Intelligence queries from a common financial transaction stream." We can't pre-denormalize the data, because each client wants something different. Instead, we need a normalized representation with tons of indices that serves all potential queries equally well, in roughly real time.)

For non-realtime analysis of "data at rest" (i.e. data that can be ingested in at-most-hourly batches), we can just use Snowflake. We already do, for logs and other things.

To be honest, our fondest dream would be to have a two-tiered DB setup:

1. a cold layer, where we get the SQL semantics of Postgres's querying engine, but where the storage engine is similar to those of scale-out DW services like Snowflake, with elastic per-workload compute warehouse nodes fetching compressed-columnar data from object storage into local per-worker caches (it sounds like this is similar to what you're building?)

2. a hot layer, using traditional Postgres storage, which serves as a sort of writeback cache for the cold layer:

• where all INSERTs hit the hot layer and then get async-batched to the cold layer;

• where the hot layer keeps its own indices to enable fancy OLAP querying of the hot data;

• where those queries see the cold data as "part of" the hot data, in a UNION ALL sense (probably through a FDW);

• where those queries will constraint-exclude the cold data (and thus trigger no workload on the cold layer) if the cold data isn't relevant to resolving the query — like PG11 partition constraint-exclusion.

AFAIK, you can't currently use a foreign table as a partition of a local parent table, so that'd be the first thing needing to be solved there. The next problem after that would be resolving the need for an AccessExclusiveLock to modify the range constraint on each of the parititons, since 1. the split-point between "historical" and "timely" data would be sliding forward every hour-or-so, but 2. the table would be just saturated with long-running SELECTs, so much so that it would never get a moment to lock itself to make a modification like that.

(Really, the magic wand there would be to allow partitions to have multiple parents and for "a partition" — the metadata that contains the list/range/hash constraint — to be a separate DB object from the DB table it refers to, where multiple "partitions" can reference one table-with-the-partition's-data-in-it, as long as the data meets the union of all their constraints. With those abstractions, you could build a new parent table that also references the old child tables through new partition metadata, and then just do an atomic swap of the parent tables' names when you're ready, where old queries would go on using the tables they had already dereferenced to their OIDs, and new queries would start using the new tables. Then queue up an effectively-async DROP TABLE on the old parent table, that would resolve when the DB runs out of old queries locking it open.)

The other founder of Splitgraph here! I've been experimenting recently with this, since we want users to be able to write to Splitgraph images efficiently without having to turn them into PG tables and then re-snapshot them.

My setup is similar to your cold-hot idea (diagram [1]), where the cold layer uses our "layered querying" FDW. A scan through it is basically a scan through a UNION of cstore_fdw files (columnar format) and we can use the object metadata to determine which objects to download/scan through. If the constraints don't match the object boundaries at all, the scan returns empty. This emulates PG partitions but more dynamically (you can change them without a full table lock).

This means that we can have a view on top of the "hot" and the "cold" blocks that will query both of them, but exclude the "cold" OLAP layer most of the time. We can also redirect writes to the view to hit the "hot" OLTP layer instead using `INSTEAD OF` triggers. In my PoC, I record inserts/updates/deletes in that table and then collapse them into one action per PK using a window function, but if the table is append-only, this should be much easier.

I'm not completely sure re: the locks at changeover/flush time. The idea is that we can create a new object (up to some bookmark) in one transaction, attach it to the Splitgraph table in the second transaction, then truncate the "hot" table up to the same bookmark in the final transaction. Duplicate rows can be eliminated by the view, so the final table will look consistent between transactions.

Re: "using a foreign table as a partition of a local parent table", I think it's actually possible, even though we don't use it here: there's a cool setup in [3] where they use it to shard a table onto multiple worker nodes (the coordinator node has the partitioned table, each partition points to a postgres_fdw shim on the same machine that then points to the actual physical partition on the worker).

We've had some ideas around using this for distributed querying: in our case, each node responsible for a given partition of a dataset would be able to download just the objects in that partition on the fly (though constraint pruning), so we wouldn't need to knowingly seed each worker with data. Interesting to think about, though at some point it feels like reinventing Presto/Snowflake/Spark.

Hope this helps!

[1] https://imgur.com/a/1p394PI

[2] https://www.splitgraph.com/docs/large-datasets/layered-query...

[3] https://swarm64.com/post/scaling-elastic-postgres-cluster/

> We've had some ideas around using this for distributed querying: in our case, each node responsible for a given partition of a dataset would be able to download just the objects in that partition on the fly (though constraint pruning), so we wouldn't need to knowingly seed each worker with data.

IMHO, if you're going to do this, I'd recommend not doing this in Postgres itself, but rather doing it at the filesystem level. It's effectively just a tiered-storage read-through cache, and filesystems have those all figured out already.

You know how pgBackRest does "partial restore" (https://pgbackrest.org/user-guide.html#restore/option-db-inc...), by making all the heap files seem to be there, but actually the ones you don't need are sparse files ftruncate(1)'d to the right length to make PG happy? And that this works because PG only cares about DB objects it's not actively querying insofar as making sure they're there under readdir(2) with the expected metadata?

Well, an object-storage FUSE filesystems, e.g. https://github.com/kahing/goofys, would make PG just as happy, because PG could see all the right files as "being there" under readdir(2), even though the files aren't really "there", and PG would block on first fopen(2) of each file while goofys fetched the actual object to back the file.

(IIRC PG might fopen(2) all its files once on startup, just to ensure it can; you can hack around this by modding the origin-object-storage filesystem library to not eagerly "push down" its fopen(2)s into object fetches — instead just returning a file-descriptor connected to a lazy promise for the object — and then have read(2) and write(2) thunk that lazy promise, such that the first real IO done against the virtual file be what ends up blocking to fetch the object.)

So you could just make your pg_base dir into an overlayfs mountpoint for:

• top layer: tmpfs (only necessary if you don't give temp tables their own tablespace)

• middle layer: https://github.com/kahing/catfs

• bottom layer: goofys mount of the shared heap-file origin-storage bucket

Note that catfs here does better than just "fetching objects and holding onto them" — it does LRU cache eviction of origin objects when your disk gets full!

(Of course, this setup doesn't allow writes to the tables held under it. So maybe don't make this your default tablespace, but instead a secondary tablespace that "closed" partitions live in, while "open" partitions live in a node-local tablespace, with something like pg_partman creating new hourly tables, and then pg_cron running a session to note down the old ones and do a VACUUM FREEZE ?; ALTER TABLE ? SET TABLESPACE ?; on them to shove them into the secondary tablespace — which will write-through the catfs cache, pushing them down into object storage.)