What does HackerNews think of dremio-oss?

Dremio - the missing link in modern data

Language: Java

I have been using Dremio to query large volume of CSV files: https://docs.dremio.com/software/data-sources/files-and-dire...

Although having them in some columnar format is much better for fast responses.

GitHub: https://github.com/dremio/dremio-oss

For my home projects I generate parquet (columnar and very well suited for DW like queries) files with pyarrow and use Dremio (Direct SQL on data lake): https://github.com/dremio/dremio-oss (https://www.dremio.com/on-prem/) to query them (minio or just local disk or s3) and use Apache Superset for quick charts or dashboards.
Another SQL engine on data lake that heavily uses arrow is Dremio.

https://www.dremio.com/webinars/apache-arrow-calcite-parquet...

https://github.com/dremio/dremio-oss

If you have parquet on S3, using an engine like Dremio (or any engine based on arrow) can give you some impressive performance. Key innovations in OSS on data analytics/data lake:

Arrow - Columnar in memory format; Gandiva - LLVM based execution kernel; Arrow flight - Wire protocol based on arrow; Project Nessie - A git like workflow for data lakes

https://arrow.apache.org/. https://arrow.apache.org/docs/format/Flight.html. https://arrow.apache.org/blog/2018/12/05/gandiva-donation/ https://github.com/projectnessie/nessie

Please correct me if I have this wrong, but my vague understanding is that the data representation heart of Apache Drill lives on in the rather active Apache Arrow project.

https://stackoverflow.com/questions/53533506/what-is-the-dif...

https://github.com/apache/arrow/commit/e6905effbb9383afd2423...

And the platform/tools side of Drill now lives on as Dremio, which uses Apache Arrow.

https://github.com/dremio/dremio-oss

So the essence of Drill still lives, but it became half Apache project and half vendor controlled and supported, and the root of that split is now orphaned.

I'm also a developer on Arrow (https://github.com/jacques-n/), similar to WesM. It is always rewarding (and also sometimes challenging) to hear how people understand or value something you're working on.

I think Dan's analysis is evaluating Arrow from one particular and fairly constrained perspective of "if using Arrow and Parquet for RDBMS purposes, should they exist separately". I'm glad that Dan comes to a supportive conclusion even with a pretty narrow set of criteria.

If you broaden the criteria to all the different reasons people are consuming/leveraging/contributing to Arrow, the case only becomes more clear for its existence and use. As someone who uses Arrow extensively in my own work and professionally (https://github.com/dremio/dremio-oss), I find many benefits including two biggies: processing speed AND interoperability (now two different apps can share in-memory data without serialization/deserialization or duplicate memory footprint). And best of all, the community is composed of collaborators trying to solve similar problems, etc. When you combine all of these, Arrow is a no brainer as an independent community and is developing quickly because of that (80+ contributors, Many language bindings (6+) and more than 1300 github stars in just a short amount of time).