# Dask — Parallel Computing and Scalable DataFrames for Python

> Dask is a flexible parallel computing library for Python that scales NumPy, pandas, and scikit-learn workflows from a laptop to a distributed cluster with minimal code changes.

## Install

Save in your project root:

# Dask — Parallel Computing and Scalable DataFrames for Python

## Quick Use
```bash
pip install dask[complete]
python -c "
import dask.dataframe as dd
df = dd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
print(df.groupby('species').sepal_length.mean().compute())
"
```

## Introduction
Dask extends the familiar NumPy and pandas APIs to datasets that do not fit in memory. It builds task graphs that execute in parallel on a single machine or across a cluster, letting data scientists scale their workflows without rewriting code in a new framework.

## What Dask Does
- Provides `dask.dataframe`, a partitioned pandas-compatible DataFrame for larger-than-memory data
- Offers `dask.array` for blocked NumPy arrays with the same slicing and math API
- Includes `dask.bag` for parallel processing of unstructured or semi-structured data
- Runs a dynamic task scheduler that optimizes execution graphs and manages memory
- Scales from threads on a laptop to thousands of workers via Dask Distributed

## Architecture Overview
Dask operations build a directed acyclic graph (DAG) of tasks lazily. When `.compute()` is called, a scheduler executes the graph. The default threaded scheduler works for IO-bound tasks; the multiprocessing scheduler handles CPU-bound work. For distributed computing, `dask.distributed` provides a `Client` that connects to a `Scheduler` process coordinating multiple `Worker` processes. Workers communicate data via TCP and spill to disk when memory is tight.

## Self-Hosting & Configuration
- Install via pip: `pip install dask[complete]` (includes distributed, diagnostics, DataFrame extras)
- Single machine: just call `.compute()` — Dask uses threads or processes automatically
- Distributed cluster: start with `dask scheduler` and `dask worker` commands, or use `LocalCluster()` for testing
- Deploy on Kubernetes with `dask-kubernetes`, on YARN with `dask-yarn`, or on HPC with `dask-jobqueue`
- Monitor with the built-in dashboard at `http://scheduler:8787/status`

## Key Features
- Familiar API: Dask DataFrames and arrays mirror pandas and NumPy so there is almost no learning curve
- Lazy evaluation with automatic task graph optimization (fusion, culling)
- Built-in real-time dashboard for monitoring task progress and worker memory
- Integrates with the PyData ecosystem: scikit-learn (via Dask-ML), XGBoost, and more
- Scales from a single laptop core to clusters with thousands of workers

## Comparison with Similar Tools
- **Apache Spark (PySpark)** — JVM-based distributed engine; Dask is pure Python with lighter overhead
- **Ray** — general-purpose distributed runtime; Dask focuses on data-parallel collections (DataFrames, arrays)
- **Polars** — fast single-machine DataFrame library; Dask adds distribution and larger-than-memory support
- **Vaex** — memory-mapped DataFrames for out-of-core analytics; Dask has broader compute primitives
- **Modin** — pandas acceleration layer; Dask offers more control over scheduling and deployment

## FAQ
**Q: When should I use Dask instead of pandas?**
A: When your data exceeds available RAM, when you want to parallelize pandas operations across cores, or when you need to distribute work to a cluster.

**Q: Can Dask run on a single machine?**
A: Yes. `LocalCluster()` or the default scheduler parallelizes work across cores without any external infrastructure.

**Q: How does Dask compare to Spark for Python workloads?**
A: Dask has lower overhead, uses native Python objects (pandas DataFrames), and requires no JVM. Spark is more mature for very large clusters and SQL-heavy workloads.

**Q: Does Dask support GPU computing?**
A: Yes. Dask integrates with RAPIDS cuDF for GPU-accelerated DataFrames and Dask-CUDA for multi-GPU scheduling.

## Sources
- https://github.com/dask/dask
- https://docs.dask.org/

---
Source: https://tokrepo.com/en/workflows/5c5393ec-3e26-11f1-9bc6-00163e2b0d79
Author: AI Open Source