What does HackerNews think of fugue?
A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.
When the tests pass, we can change from DuckDB to Spark. This helps decouple testing Spark pipelines from the SparkSession and infrastructure, which saves a lot of compute resources during the iteration process.
This setup requires an abstraction layer to make the SQL execution agnostic to platforms and to make the data sources mockable. We use the open source Fugue layer to define the business logic once, and have it be compatible with DuckDB and Spark.
It is also worth noting that FugueSQL will support warehouses like BigQuery and Snowflake in the near future as part of their roadmap. So in the future, you can unit test SQL logic, and then bring it to BigQuery/Snowflake when ready.
For more information, there is this talk on PyData NYC (SQL testing part): https://www.youtube.com/watch?v=yQHksEh1GCs&t=1766s
Fugue project repo: https://github.com/fugue-project/fugue/
https://github.com/fugue-project/fugue
A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.
Also:
I am one of the contributors of Fugue. Fugue is an open-source abstraction layer that ports Python/Pandas/SQL code to Spark or Dask. This article covers the programming interface and benefits Fugue provides, specifically:
* Handling inconsistent behavior between different compute frameworks (Pandas, Spark, and Dask) * Allowing reusability of code across Pandas-sized and Spark-sized data * Dramatically speeding up testing and and iteration cycles * Enabling new users to be productive with Spark much faster * Providing a SQL interface capable of handling end-to-end workflows
There was a previous post on here about our SQL interface that lets you use SQL on top of Pandas, Spark and Dask. This post talks about the broader project. https://news.ycombinator.com/item?id=28830243
Our repo can be found here: https://github.com/fugue-project/fugue
Happy to answer any questions!
Fugue is a higher level abstraction compared to Ray. It provides unified and non-invasive interfaces for people to use Spark, Dask and Pandas. Ray/Modin is also on our roadmap.
It provides both Python interface (not pandas-like) and Fugue SQL (standard SQL + extra features). Users can choose the one they are most comfortable with as the semantic layer for distributed computing, they are equivalent.
With Fugue, most of your logic will be in simple Python/SQL that is framework and scale agnostic. From the mindset to the code, Fugue minimizes your dependency on any specific computing frameworks including Fugue itself.
Please let me know if you want to learn more. our slack is in the README of the fugue repo
Fugue repo: https://github.com/fugue-project/fugue Tutorials: https://fugue-project.github.io/tutorials/