Skills2026年5月3日·1 分钟阅读

cuDF — GPU-Accelerated DataFrame Library by NVIDIA RAPIDS

cuDF is a GPU-accelerated DataFrame library from the NVIDIA RAPIDS suite that provides a pandas-like API for data manipulation at 10-100x the speed on NVIDIA GPUs.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
cuDF GPU DataFrames
通用 CLI 安装命令
npx tokrepo install 8fee0711-470d-11f1-9bc6-00163e2b0d79

Introduction

cuDF is an open-source GPU DataFrame library that is part of the NVIDIA RAPIDS ecosystem. It provides a familiar pandas-like API while executing operations on NVIDIA GPUs, delivering dramatic speedups for data loading, transformation, and aggregation tasks common in data science and feature engineering workflows.

What cuDF Does

  • Accelerates DataFrame operations (filter, join, groupby, sort) on NVIDIA GPUs
  • Provides a pandas-compatible API so existing code runs with minimal changes
  • Reads and writes Parquet, CSV, ORC, JSON, and Apache Arrow formats on GPU
  • Offers a cudf.pandas accelerator that automatically dispatches operations to GPU
  • Integrates with Dask for multi-GPU and multi-node distributed processing

Architecture Overview

cuDF stores columnar data in GPU memory using the Apache Arrow format. Operations are executed as CUDA kernels optimized for GPU parallelism. The library includes a JIT compiler that fuses custom UDFs into efficient GPU code. For multi-GPU workflows, cuDF integrates with Dask-cuDF to partition DataFrames across GPUs and coordinate shuffles. The cudf.pandas proxy layer intercepts pandas calls at runtime and routes supported operations to the GPU while falling back to pandas for unsupported ones.

Self-Hosting & Configuration

  • Install via pip: pip install cudf-cu12 for CUDA 12 or use conda from the RAPIDS channel
  • Requires an NVIDIA GPU with compute capability 7.0+ (Volta or newer)
  • Use cudf.pandas as a drop-in accelerator: python -m cudf.pandas script.py
  • Configure the RMM memory manager for custom GPU memory pool strategies
  • Scale to multiple GPUs with Dask: dask.dataframe.read_parquet() using the cuDF backend

Key Features

  • 10-100x speedup over pandas for large-scale data manipulation
  • cudf.pandas provides zero-code-change GPU acceleration for existing scripts
  • Native Parquet and ORC readers that decompress and decode directly on GPU
  • String processing, regex, and datetime operations fully GPU-accelerated
  • Seamless interop with CuPy, CuML, and other RAPIDS libraries via __cuda_array_interface__

Comparison with Similar Tools

  • pandas — CPU-only; cuDF provides the same API with GPU acceleration
  • Polars — Fast CPU DataFrame in Rust; cuDF leverages GPU parallelism for even larger speedups
  • PySpark — Distributed CPU processing; cuDF + Dask provides GPU-accelerated distributed DataFrames
  • Modin — Parallelizes pandas on CPU cores; cuDF parallelizes on GPU cores
  • Vaex — Out-of-core CPU DataFrames; cuDF processes in-GPU-memory for lower latency

FAQ

Q: Do I need to rewrite my pandas code to use cuDF? A: No. Use python -m cudf.pandas to accelerate existing pandas scripts without code changes. For new code, the cuDF API mirrors pandas closely.

Q: How much GPU memory do I need? A: Your dataset must fit in GPU memory. For larger-than-memory workloads, use Dask-cuDF to partition across multiple GPUs.

Q: Can cuDF handle string and text data? A: Yes, cuDF provides GPU-accelerated string operations including regex, split, replace, and contains.

Q: Which file formats are supported? A: Parquet, CSV, ORC, JSON, and Apache Arrow IPC, all with GPU-accelerated readers and writers.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产