How do I install cuDF — GPU-Accelerated DataFrame Library by NVIDIA RAPIDS?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

cuDF — GPU-Accelerated DataFrame Library by NVIDIA RAPIDS

Introduction

cuDF is an open-source GPU DataFrame library that is part of the NVIDIA RAPIDS ecosystem. It provides a familiar pandas-like API while executing operations on NVIDIA GPUs, delivering dramatic speedups for data loading, transformation, and aggregation tasks common in data science and feature engineering workflows.

What cuDF Does

Accelerates DataFrame operations (filter, join, groupby, sort) on NVIDIA GPUs
Provides a pandas-compatible API so existing code runs with minimal changes
Reads and writes Parquet, CSV, ORC, JSON, and Apache Arrow formats on GPU
Offers a cudf.pandas accelerator that automatically dispatches operations to GPU
Integrates with Dask for multi-GPU and multi-node distributed processing

Architecture Overview

cuDF stores columnar data in GPU memory using the Apache Arrow format. Operations are executed as CUDA kernels optimized for GPU parallelism. The library includes a JIT compiler that fuses custom UDFs into efficient GPU code. For multi-GPU workflows, cuDF integrates with Dask-cuDF to partition DataFrames across GPUs and coordinate shuffles. The cudf.pandas proxy layer intercepts pandas calls at runtime and routes supported operations to the GPU while falling back to pandas for unsupported ones.

Self-Hosting & Configuration

Install via pip: pip install cudf-cu12 for CUDA 12 or use conda from the RAPIDS channel
Requires an NVIDIA GPU with compute capability 7.0+ (Volta or newer)
Use cudf.pandas as a drop-in accelerator: python -m cudf.pandas script.py
Configure the RMM memory manager for custom GPU memory pool strategies
Scale to multiple GPUs with Dask: dask.dataframe.read_parquet() using the cuDF backend

Key Features

10-100x speedup over pandas for large-scale data manipulation
cudf.pandas provides zero-code-change GPU acceleration for existing scripts
Native Parquet and ORC readers that decompress and decode directly on GPU
String processing, regex, and datetime operations fully GPU-accelerated
Seamless interop with CuPy, CuML, and other RAPIDS libraries via __cuda_array_interface__

Comparison with Similar Tools

pandas — CPU-only; cuDF provides the same API with GPU acceleration
Polars — Fast CPU DataFrame in Rust; cuDF leverages GPU parallelism for even larger speedups
PySpark — Distributed CPU processing; cuDF + Dask provides GPU-accelerated distributed DataFrames
Modin — Parallelizes pandas on CPU cores; cuDF parallelizes on GPU cores
Vaex — Out-of-core CPU DataFrames; cuDF processes in-GPU-memory for lower latency

FAQ

Q: Do I need to rewrite my pandas code to use cuDF? A: No. Use python -m cudf.pandas to accelerate existing pandas scripts without code changes. For new code, the cuDF API mirrors pandas closely.

Q: How much GPU memory do I need? A: Your dataset must fit in GPU memory. For larger-than-memory workloads, use Dask-cuDF to partition across multiple GPUs.

Q: Can cuDF handle string and text data? A: Yes, cuDF provides GPU-accelerated string operations including regex, split, replace, and contains.

Q: Which file formats are supported? A: Parquet, CSV, ORC, JSON, and Apache Arrow IPC, all with GPU-accelerated readers and writers.

cuDF — GPU-Accelerated DataFrame Library by NVIDIA RAPIDS

Introduction

What cuDF Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

LMCache — Supercharge LLM Inference with KV Cache Sharing

OpenVINO — Optimize and Deploy AI Inference Across Intel Hardware

nano-vllm — Lightweight LLM Serving Engine