What does HackerNews think of duckdb_fdw?

DuckDB Foreign Data Wrapper for PostgreSQL

Language: PLpgSQL

#32 in PostgreSQL
Unofficially maybe? There's a 3rd party duckdb_fdw plugin for postgres that allows querying DuckDB from Postgres: https://github.com/alitrack/duckdb_fdw
You can use the DuckDB FDW (linked in the blog post) for this! https://github.com/alitrack/duckdb_fdw
Interesting thought! I have not tried this yet so I only have a guess as an answer. Could you export the data as SQL statements and then run those statements on DuckDB? That may be easier to set up, but may take longer to run...

DuckDB also has the ability to read Postgres data directly, and there is a Postgres FDW that can read from DuckDB!

https://github.com/duckdblabs/postgresscanner

https://github.com/alitrack/duckdb_fdw

Another similar option is cstore_fdw [0] -- it's now part of Citus but can still be used standalone as a foreign data wrapper. We use it at my startup to do OLAP on Postgres. It has some advantages on parquet_fdw:

* Supports writes (actually generating Parquet files was also difficult in my testing: I used odbc2parquet [1] but thanks for the ogr2ogr tip!) so you can write directly to the foreign table by running INSERT INTO ... SELECT FROM ...

* Supports all PG datatypes (including types from extensions, like PostGIS)

* Performance was basically comparable in my limited testing (faster for a single-row SELECT with cstore_fdw in our case since we do partition pruning, same for a full-table scan-and-aggregation).

Re: performance overhead, with FDWs we have to re-munge the data into PostgreSQL's internal row-oriented TupleSlot format again. Postgres also doesn't run aggregations that can take advantage of the columnar format (e.g. CPU vectorization). Citus had some experimental code to get that working [2], but that was before FDWs supported aggregation pushdown. Nowadays it might be possible to basically have an FDW that hooks into the GROUP BY execution and runs a faster version of the aggregation that's optimized for columnar storage. We have a blog post series [3] about how we added agg pushdown support to Multicorn -- similar idea.

There's also DuckDB which obliterates both of these options when it comes to performance. In my (again limited, not very scientific) benchmarking of on a customer's 3M row table [4] (278MB in cstore_fdw, 140MB in Parquet), I see a 10-20x (1/2s -> 0.1/0.2s) speedup on some basic aggregation queries when querying a Parquet file with DuckDB as opposed to using cstore_fdw/parquet_fdw.

I think the dream is being able to use DuckDB from within a FDW as an OLAP query engine for PostgreSQL. duckdb_fdw [5] exists, but it basically took sqlite_fdw and connected it to DuckDB's SQLite interface, which means that a lot of operations get lost in translation and aren't pushed down to DuckDB, so it's not much better than plain parquet_fdw. I had a complex query in the PG dialect generated with dbt that used joins, CTEs and window functions. I don't remember the exact timings, but it was even slower on duckdb_fdw than with cstore_fdw, whereas I could take the same query and run it on DuckDB verbatim, only replacing the foreign table name with the Parquet filename.

This comment is already getting too long, but FDWs can indeed participate in partitions! There's this blog post that I keep meaning to implement where the setup is, a "coordinator" PG instance has a partitioned table, where each partition is a postgres_fdw foreign table that proxies to a "data" PG instance. The "coordinator" node doesn't store any data and only gathers execution results from the "data" nodes. In the article, the "data" nodes store plain old PG tables, but I don't think there's anything preventing them from being parquet_fdw/cstore_fdw tables instead.

[0] https://github.com/citusdata/cstore_fdw

[1] https://github.com/pacman82/odbc2parquet

[2] https://github.com/citusdata/postgres_vectorization_test

[3] https://www.splitgraph.com/blog/postgresql-fdw-aggregation-p...

[4] https://www.splitgraph.com/trase/supply-chains

[5] https://github.com/alitrack/duckdb_fdw

[6] https://swarm64.com/post/scaling-elastic-postgres-cluster/

After reading a few comments here, here's the experiment I'm going to run this weekend to couple Postgres (AWS rds) and DuckDB

There's already an FDW for DuckDB [1]! But it's not available from within AWS RDS because they only allow a few extensions. So, I would launch an EC2 instance and launch a postgres instance on it. Then I would install the DuckDB FDW on this postgres instance. Then, on the AWS RDS postgres instance, I would use the postgres_fdw extension to connect to the postgres instance running on EC2 and indirectly gain access to it's DuckDB FDW. I'm not sure if this chained FDW strategy will work out. But here's why I think it's a great idea:

1. Any data that I am willing to put into DuckDB is guaranteed to have backups and/or an ETL already setup to sync data. Therefore I don't need this EC2 setup to be "reliable". If it goes down, I'll just rerun the ETL.

2. I need to pay a small DevOps cost to have some performance gain but the analytics complexity remains unchanged because at the end of the day it's just Postgres. I get to continue to use DBT against the original AWS RDS.

The only thing I need to see is if there are any performance gains to be had using this setup. I won't take that for granted until it's tested.

[1] https://github.com/alitrack/duckdb_fdw

I should also add that there is a duckdb fdw, so you could have DuckDB read from your parquet files and do faster transformations before you pull your data into Postgres!

https://github.com/alitrack/duckdb_fdw

Here is a DuckDB FDW for Postgres! I have not used it, but it sounds like what you need! https://github.com/alitrack/duckdb_fdw