Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 3, 2026·3 min de lectura

cuDF — GPU-Accelerated DataFrame Library by NVIDIA RAPIDS

cuDF is a GPU-accelerated DataFrame library from the NVIDIA RAPIDS suite that provides a pandas-like API for data manipulation at 10-100x the speed on NVIDIA GPUs.

Introduction

cuDF is an open-source GPU DataFrame library that is part of the NVIDIA RAPIDS ecosystem. It provides a familiar pandas-like API while executing operations on NVIDIA GPUs, delivering dramatic speedups for data loading, transformation, and aggregation tasks common in data science and feature engineering workflows.

What cuDF Does

  • Accelerates DataFrame operations (filter, join, groupby, sort) on NVIDIA GPUs
  • Provides a pandas-compatible API so existing code runs with minimal changes
  • Reads and writes Parquet, CSV, ORC, JSON, and Apache Arrow formats on GPU
  • Offers a cudf.pandas accelerator that automatically dispatches operations to GPU
  • Integrates with Dask for multi-GPU and multi-node distributed processing

Architecture Overview

cuDF stores columnar data in GPU memory using the Apache Arrow format. Operations are executed as CUDA kernels optimized for GPU parallelism. The library includes a JIT compiler that fuses custom UDFs into efficient GPU code. For multi-GPU workflows, cuDF integrates with Dask-cuDF to partition DataFrames across GPUs and coordinate shuffles. The cudf.pandas proxy layer intercepts pandas calls at runtime and routes supported operations to the GPU while falling back to pandas for unsupported ones.

Self-Hosting & Configuration

  • Install via pip: pip install cudf-cu12 for CUDA 12 or use conda from the RAPIDS channel
  • Requires an NVIDIA GPU with compute capability 7.0+ (Volta or newer)
  • Use cudf.pandas as a drop-in accelerator: python -m cudf.pandas script.py
  • Configure the RMM memory manager for custom GPU memory pool strategies
  • Scale to multiple GPUs with Dask: dask.dataframe.read_parquet() using the cuDF backend

Key Features

  • 10-100x speedup over pandas for large-scale data manipulation
  • cudf.pandas provides zero-code-change GPU acceleration for existing scripts
  • Native Parquet and ORC readers that decompress and decode directly on GPU
  • String processing, regex, and datetime operations fully GPU-accelerated
  • Seamless interop with CuPy, CuML, and other RAPIDS libraries via __cuda_array_interface__

Comparison with Similar Tools

  • pandas — CPU-only; cuDF provides the same API with GPU acceleration
  • Polars — Fast CPU DataFrame in Rust; cuDF leverages GPU parallelism for even larger speedups
  • PySpark — Distributed CPU processing; cuDF + Dask provides GPU-accelerated distributed DataFrames
  • Modin — Parallelizes pandas on CPU cores; cuDF parallelizes on GPU cores
  • Vaex — Out-of-core CPU DataFrames; cuDF processes in-GPU-memory for lower latency

FAQ

Q: Do I need to rewrite my pandas code to use cuDF? A: No. Use python -m cudf.pandas to accelerate existing pandas scripts without code changes. For new code, the cuDF API mirrors pandas closely.

Q: How much GPU memory do I need? A: Your dataset must fit in GPU memory. For larger-than-memory workloads, use Dask-cuDF to partition across multiple GPUs.

Q: Can cuDF handle string and text data? A: Yes, cuDF provides GPU-accelerated string operations including regex, split, replace, and contains.

Q: Which file formats are supported? A: Parquet, CSV, ORC, JSON, and Apache Arrow IPC, all with GPU-accelerated readers and writers.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados