I'm curious if there will be any appreciable performance gains here that are worthwhile. FWIW, last I checked[0], Polars still smokes Pandas in basically every way.
I’ve moved entirely to Polars (which is essentially Pandas written in Rust with new design decisions) with DuckDB as my SQL query engine. Since both are backed by Arrow, there is zero copy and performance on large datasets is super fast (due to vectorization, not just parallelization)
I keep Pandas around for quick plots and legacy code. I will always be grateful for Pandas because there truly was no good dataframe library during its time. It has enabled an entire generation of data scientists to do what they do and built a foundation — a foundation which Polars and DuckDB are now building on and have surpassed.
How well does Polars play with many of the other standard data tools in Python (scikit learn, etc.)? Do the performance gains carry over?
It works as well as Pandas (realize that scikit actually doesn’t support Pandas — you have to cast a dataframe into a Numpy array first)
I generally work with with Polars and DuckDB until the final step, when I cast it into a data structure I need (Pandas dataframe, Parquet etc)
All the expensive intermediate operations are taken care of in Polars and DuckDB.
Also a Polars dataframe — although it has different semantics — behaves like a Pandas dataframe for the most part. I haven’t had much trouble moving between it and Pandas.
You do not have to cast pandas DataFrames when using scikit-learn, for many years already. Additional in recent version there has been increasing support for also returning DataFrames, at least with transformers and checking column names/order.
Yes that support is still not complete. When you pass a Pandas dataframe into Scikit you are implicitly doing df.values which loses all the dataframe metadata.
There is a library called sklearn-pandas which doesn’t seem to be mainstream and dev has stopped since 2022.
> What should be the API for working with pandas, pyarrow, and dataclasses and/or pydantic?
> Pandas 2.0 supports pyarrow for so many things now, and pydantic does data validation with a drop-in dataclasses.dataclass replacement at pydantic.dataclasses.dataclass.
Model output may or may not converge given the enumeration ordering of Categorical CSVW columns, for example; so consistent round-trip (Linked Data) schema tool support would be essential.
CuML is scikit-learn API compatible and can use Dask for distributed and/or multi-GPU workloads. CuML is built on CuDF and CuPY; CuPy is a replacement for NumPy arrays on GPUs with 100x relative performance.
CuPy: https://github.com/cupy/cupy :
> CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.
> CuPy is an open-source array library for GPU-accelerated computing with Python. CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.
> The figure shows CuPy speedup over NumPy. Most operations perform well on a GPU using CuPy out of the box. CuPy speeds up some operations more than 100X. Read the original benchmark article Single-GPU CuPy Speedups on the RAPIDS AI Medium blog
CuDF: https://github.com/rapidsai/cudf
CuML: https://github.com/rapidsai/cuml :
> cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.*
> cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.
> For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.
FWICS there's now a ROCm version of CuPy, so it says CUDA (NVIDIA only) but also compiles for AMD. IDK whether there are plans to support Intel OneAPI, too.
What of the non-Arrow parts of other pandas-compatible and not pandas-compatible DataFrame libraries can be ported back to Pandas (and R)?