# Dask — Parallel Computing and Scalable DataFrames for Python > Dask is a flexible parallel computing library for Python that scales NumPy, pandas, and scikit-learn workflows from a laptop to a distributed cluster with minimal code changes. ## Install Save in your project root: # Dask — Parallel Computing and Scalable DataFrames for Python ## Quick Use ```bash pip install dask[complete] python -c " import dask.dataframe as dd df = dd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') print(df.groupby('species').sepal_length.mean().compute()) " ``` ## Introduction Dask extends the familiar NumPy and pandas APIs to datasets that do not fit in memory. It builds task graphs that execute in parallel on a single machine or across a cluster, letting data scientists scale their workflows without rewriting code in a new framework. ## What Dask Does - Provides `dask.dataframe`, a partitioned pandas-compatible DataFrame for larger-than-memory data - Offers `dask.array` for blocked NumPy arrays with the same slicing and math API - Includes `dask.bag` for parallel processing of unstructured or semi-structured data - Runs a dynamic task scheduler that optimizes execution graphs and manages memory - Scales from threads on a laptop to thousands of workers via Dask Distributed ## Architecture Overview Dask operations build a directed acyclic graph (DAG) of tasks lazily. When `.compute()` is called, a scheduler executes the graph. The default threaded scheduler works for IO-bound tasks; the multiprocessing scheduler handles CPU-bound work. For distributed computing, `dask.distributed` provides a `Client` that connects to a `Scheduler` process coordinating multiple `Worker` processes. Workers communicate data via TCP and spill to disk when memory is tight. ## Self-Hosting & Configuration - Install via pip: `pip install dask[complete]` (includes distributed, diagnostics, DataFrame extras) - Single machine: just call `.compute()` — Dask uses threads or processes automatically - Distributed cluster: start with `dask scheduler` and `dask worker` commands, or use `LocalCluster()` for testing - Deploy on Kubernetes with `dask-kubernetes`, on YARN with `dask-yarn`, or on HPC with `dask-jobqueue` - Monitor with the built-in dashboard at `http://scheduler:8787/status` ## Key Features - Familiar API: Dask DataFrames and arrays mirror pandas and NumPy so there is almost no learning curve - Lazy evaluation with automatic task graph optimization (fusion, culling) - Built-in real-time dashboard for monitoring task progress and worker memory - Integrates with the PyData ecosystem: scikit-learn (via Dask-ML), XGBoost, and more - Scales from a single laptop core to clusters with thousands of workers ## Comparison with Similar Tools - **Apache Spark (PySpark)** — JVM-based distributed engine; Dask is pure Python with lighter overhead - **Ray** — general-purpose distributed runtime; Dask focuses on data-parallel collections (DataFrames, arrays) - **Polars** — fast single-machine DataFrame library; Dask adds distribution and larger-than-memory support - **Vaex** — memory-mapped DataFrames for out-of-core analytics; Dask has broader compute primitives - **Modin** — pandas acceleration layer; Dask offers more control over scheduling and deployment ## FAQ **Q: When should I use Dask instead of pandas?** A: When your data exceeds available RAM, when you want to parallelize pandas operations across cores, or when you need to distribute work to a cluster. **Q: Can Dask run on a single machine?** A: Yes. `LocalCluster()` or the default scheduler parallelizes work across cores without any external infrastructure. **Q: How does Dask compare to Spark for Python workloads?** A: Dask has lower overhead, uses native Python objects (pandas DataFrames), and requires no JVM. Spark is more mature for very large clusters and SQL-heavy workloads. **Q: Does Dask support GPU computing?** A: Yes. Dask integrates with RAPIDS cuDF for GPU-accelerated DataFrames and Dask-CUDA for multi-GPU scheduling. ## Sources - https://github.com/dask/dask - https://docs.dask.org/ --- Source: https://tokrepo.com/en/workflows/5c5393ec-3e26-11f1-9bc6-00163e2b0d79 Author: AI Open Source