Introduction
cuDF is an open-source GPU DataFrame library that is part of the NVIDIA RAPIDS ecosystem. It provides a familiar pandas-like API while executing operations on NVIDIA GPUs, delivering dramatic speedups for data loading, transformation, and aggregation tasks common in data science and feature engineering workflows.
What cuDF Does
- Accelerates DataFrame operations (filter, join, groupby, sort) on NVIDIA GPUs
- Provides a pandas-compatible API so existing code runs with minimal changes
- Reads and writes Parquet, CSV, ORC, JSON, and Apache Arrow formats on GPU
- Offers a
cudf.pandasaccelerator that automatically dispatches operations to GPU - Integrates with Dask for multi-GPU and multi-node distributed processing
Architecture Overview
cuDF stores columnar data in GPU memory using the Apache Arrow format. Operations are executed as CUDA kernels optimized for GPU parallelism. The library includes a JIT compiler that fuses custom UDFs into efficient GPU code. For multi-GPU workflows, cuDF integrates with Dask-cuDF to partition DataFrames across GPUs and coordinate shuffles. The cudf.pandas proxy layer intercepts pandas calls at runtime and routes supported operations to the GPU while falling back to pandas for unsupported ones.
Self-Hosting & Configuration
- Install via pip:
pip install cudf-cu12for CUDA 12 or use conda from the RAPIDS channel - Requires an NVIDIA GPU with compute capability 7.0+ (Volta or newer)
- Use
cudf.pandasas a drop-in accelerator:python -m cudf.pandas script.py - Configure the RMM memory manager for custom GPU memory pool strategies
- Scale to multiple GPUs with Dask:
dask.dataframe.read_parquet()using the cuDF backend
Key Features
- 10-100x speedup over pandas for large-scale data manipulation
cudf.pandasprovides zero-code-change GPU acceleration for existing scripts- Native Parquet and ORC readers that decompress and decode directly on GPU
- String processing, regex, and datetime operations fully GPU-accelerated
- Seamless interop with CuPy, CuML, and other RAPIDS libraries via
__cuda_array_interface__
Comparison with Similar Tools
- pandas — CPU-only; cuDF provides the same API with GPU acceleration
- Polars — Fast CPU DataFrame in Rust; cuDF leverages GPU parallelism for even larger speedups
- PySpark — Distributed CPU processing; cuDF + Dask provides GPU-accelerated distributed DataFrames
- Modin — Parallelizes pandas on CPU cores; cuDF parallelizes on GPU cores
- Vaex — Out-of-core CPU DataFrames; cuDF processes in-GPU-memory for lower latency
FAQ
Q: Do I need to rewrite my pandas code to use cuDF?
A: No. Use python -m cudf.pandas to accelerate existing pandas scripts without code changes. For new code, the cuDF API mirrors pandas closely.
Q: How much GPU memory do I need? A: Your dataset must fit in GPU memory. For larger-than-memory workloads, use Dask-cuDF to partition across multiple GPUs.
Q: Can cuDF handle string and text data? A: Yes, cuDF provides GPU-accelerated string operations including regex, split, replace, and contains.
Q: Which file formats are supported? A: Parquet, CSV, ORC, JSON, and Apache Arrow IPC, all with GPU-accelerated readers and writers.