# cuDF — GPU-Accelerated DataFrame Library by NVIDIA RAPIDS > cuDF is a GPU-accelerated DataFrame library from the NVIDIA RAPIDS suite that provides a pandas-like API for data manipulation at 10-100x the speed on NVIDIA GPUs. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # cuDF — GPU-Accelerated DataFrame Library by NVIDIA RAPIDS ## Quick Use ```bash pip install cudf-cu12 python -c " import cudf gdf = cudf.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) print(gdf.groupby('a').sum()) " # Or use cudf.pandas for zero-code-change acceleration python -m cudf.pandas your_script.py ``` ## Introduction cuDF is an open-source GPU DataFrame library that is part of the NVIDIA RAPIDS ecosystem. It provides a familiar pandas-like API while executing operations on NVIDIA GPUs, delivering dramatic speedups for data loading, transformation, and aggregation tasks common in data science and feature engineering workflows. ## What cuDF Does - Accelerates DataFrame operations (filter, join, groupby, sort) on NVIDIA GPUs - Provides a pandas-compatible API so existing code runs with minimal changes - Reads and writes Parquet, CSV, ORC, JSON, and Apache Arrow formats on GPU - Offers a `cudf.pandas` accelerator that automatically dispatches operations to GPU - Integrates with Dask for multi-GPU and multi-node distributed processing ## Architecture Overview cuDF stores columnar data in GPU memory using the Apache Arrow format. Operations are executed as CUDA kernels optimized for GPU parallelism. The library includes a JIT compiler that fuses custom UDFs into efficient GPU code. For multi-GPU workflows, cuDF integrates with Dask-cuDF to partition DataFrames across GPUs and coordinate shuffles. The `cudf.pandas` proxy layer intercepts pandas calls at runtime and routes supported operations to the GPU while falling back to pandas for unsupported ones. ## Self-Hosting & Configuration - Install via pip: `pip install cudf-cu12` for CUDA 12 or use conda from the RAPIDS channel - Requires an NVIDIA GPU with compute capability 7.0+ (Volta or newer) - Use `cudf.pandas` as a drop-in accelerator: `python -m cudf.pandas script.py` - Configure the RMM memory manager for custom GPU memory pool strategies - Scale to multiple GPUs with Dask: `dask.dataframe.read_parquet()` using the cuDF backend ## Key Features - 10-100x speedup over pandas for large-scale data manipulation - `cudf.pandas` provides zero-code-change GPU acceleration for existing scripts - Native Parquet and ORC readers that decompress and decode directly on GPU - String processing, regex, and datetime operations fully GPU-accelerated - Seamless interop with CuPy, CuML, and other RAPIDS libraries via `__cuda_array_interface__` ## Comparison with Similar Tools - **pandas** — CPU-only; cuDF provides the same API with GPU acceleration - **Polars** — Fast CPU DataFrame in Rust; cuDF leverages GPU parallelism for even larger speedups - **PySpark** — Distributed CPU processing; cuDF + Dask provides GPU-accelerated distributed DataFrames - **Modin** — Parallelizes pandas on CPU cores; cuDF parallelizes on GPU cores - **Vaex** — Out-of-core CPU DataFrames; cuDF processes in-GPU-memory for lower latency ## FAQ **Q: Do I need to rewrite my pandas code to use cuDF?** A: No. Use `python -m cudf.pandas` to accelerate existing pandas scripts without code changes. For new code, the cuDF API mirrors pandas closely. **Q: How much GPU memory do I need?** A: Your dataset must fit in GPU memory. For larger-than-memory workloads, use Dask-cuDF to partition across multiple GPUs. **Q: Can cuDF handle string and text data?** A: Yes, cuDF provides GPU-accelerated string operations including regex, split, replace, and contains. **Q: Which file formats are supported?** A: Parquet, CSV, ORC, JSON, and Apache Arrow IPC, all with GPU-accelerated readers and writers. ## Sources - https://github.com/rapidsai/cudf - https://docs.rapids.ai/api/cudf/stable/ --- Source: https://tokrepo.com/en/workflows/cudf-gpu-accelerated-dataframe-library-nvidia-rapids-8fee0711 Author: AI Open Source