Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars

Florian Jetter, Patrick Hoefler

Tuesday 14:10 in B05-B06

Type/Track Talk pydata-data-handling-engineering

Dask is a library for distributed computing with Python that integrates tightly with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). The re-implementation of the DataFrame API addresses all of the pain points that users ran into. We will look into how Dask is a lot faster now, how it performs on benchmarks that is struggled with in the past and how it compares to other tools like Spark, DuckDB and Polars.

Level Domain Expertise Novice Python Skill Level Novice

Florian Jetter

Patrick Hoefler

Affiliation: Coiled

Patrick Hoefler is a member of the pandas core team and a Dask maintainer. He is currently working at Coiled where he focuses on Dask development and the integration of a logical query planning layer into Dask. He holds a Msc degree in Mathematics and works towards a Msc in Software engineering at the University of Oxford.

visit the speaker at: Github