What does HackerNews think of fugue?

A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.

Language: Python

#18 in SQL
Please integrate it with Fugue; a unified interface that lets you swap out execution engines like pandas for Spark.

https://github.com/fugue-project/fugue

The hard part about testing SQL is decoupling from infrastructure and big data sources. We use DuckDB, and pandas dataframes mock data sources to unit test SQL. Python testing frameworks (or simple assert statements) can be used to compare inputs and outputs.

When the tests pass, we can change from DuckDB to Spark. This helps decouple testing Spark pipelines from the SparkSession and infrastructure, which saves a lot of compute resources during the iteration process.

This setup requires an abstraction layer to make the SQL execution agnostic to platforms and to make the data sources mockable. We use the open source Fugue layer to define the business logic once, and have it be compatible with DuckDB and Spark.

It is also worth noting that FugueSQL will support warehouses like BigQuery and Snowflake in the near future as part of their roadmap. So in the future, you can unit test SQL logic, and then bring it to BigQuery/Snowflake when ready.

For more information, there is this talk on PyData NYC (SQL testing part): https://www.youtube.com/watch?v=yQHksEh1GCs&t=1766s

Fugue project repo: https://github.com/fugue-project/fugue/

Fugue is an interesting library in this space , though I haven’t tried it

https://github.com/fugue-project/fugue

A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.

All of the keynotes were great and satisfied totally different needs. I think they should have lead with PyScript though.

Also:

- https://github.com/fugue-project/fugue

- https://github.com/pyodide/pyodide

Hi HN,

I am one of the contributors of Fugue. Fugue is an open-source abstraction layer that ports Python/Pandas/SQL code to Spark or Dask. This article covers the programming interface and benefits Fugue provides, specifically:

* Handling inconsistent behavior between different compute frameworks (Pandas, Spark, and Dask) * Allowing reusability of code across Pandas-sized and Spark-sized data * Dramatically speeding up testing and and iteration cycles * Enabling new users to be productive with Spark much faster * Providing a SQL interface capable of handling end-to-end workflows

There was a previous post on here about our SQL interface that lets you use SQL on top of Pandas, Spark and Dask. This post talks about the broader project. https://news.ycombinator.com/item?id=28830243

Our repo can be found here: https://github.com/fugue-project/fugue

Happy to answer any questions!

Hey, I am the author of Fugue.

Fugue is a higher level abstraction compared to Ray. It provides unified and non-invasive interfaces for people to use Spark, Dask and Pandas. Ray/Modin is also on our roadmap.

It provides both Python interface (not pandas-like) and Fugue SQL (standard SQL + extra features). Users can choose the one they are most comfortable with as the semantic layer for distributed computing, they are equivalent.

With Fugue, most of your logic will be in simple Python/SQL that is framework and scale agnostic. From the mindset to the code, Fugue minimizes your dependency on any specific computing frameworks including Fugue itself.

Please let me know if you want to learn more. our slack is in the README of the fugue repo

Fugue repo: https://github.com/fugue-project/fugue Tutorials: https://fugue-project.github.io/tutorials/