Have you see tiledb? https://tiledb.com/data-types/dataframes My team is currently transitioning from HDF5 to tiledb for genomics data.
Similar to Parquet:
* TileDB is columnar and comes with a lot of compressors, checksum and encryption filters.
* TileDB is built in C++ with multi-threading and vectorization in mind
* TileDB integrates with Arrow, using zero-copy techniques
* TileDB has numerous optimized APIs (C, C++, C#, Python, R, Java, Go)
* TileDB pushes compute down to storage, similar to what Arrow does
Better than Parquet:
* TileDB is multi-dimensional, allowing rapid multi-column conditions
* TileDB builds versioning and time-traveling into the format (no need for Delta Lake, Iceberg, etc)
* TileDB allows for lock-free parallel writes / parallel reads with ACID properties (no need for Delta Lake, Iceberg, etc)
* TileDB can handle more than tables, for example n-dimensional dense arrays (e.g., for imaging, video, etc)
Useful links:
* Github repo (https://github.com/TileDB-Inc/TileDB)
* TileDB Embedded overview (https://tiledb.com/products/tiledb-embedded/)
* Docs (https://docs.tiledb.com/)
* Webinar on why arrays as a universal data model (https://tiledb.com/blog/why-arrays-as-a-universal-data-model)
Happy to hear everyone’s thoughts.