Configs2026年4月15日·1 分钟阅读

Apache Arrow — Columnar In-Memory Format and Compute Runtime

A cross-language columnar format, zero-copy IPC and compute library that has become the common data-plane for DuckDB, Polars, Pandas 2.x, Spark, Snowflake clients and most modern analytics tools.

Introduction

Apache Arrow defines a language-independent columnar memory layout for flat and hierarchical data, plus libraries in C++, Java, Rust, Go, Python, R, JS, C# and more that read, write, transport and process it without serialization. It turns analytics tools from data-copying islands into a cooperative ecosystem.

What Arrow Does

  • Specifies a CPU-friendly columnar format that all implementations share bit-for-bit.
  • Provides readers/writers for Parquet, Feather/IPC, CSV, JSON, ORC and Avro.
  • Ships Arrow Compute: vectorised, SIMD-accelerated kernels for filter, math, string, temporal and aggregate ops.
  • Exposes Arrow Flight, a gRPC-based high-throughput RPC protocol for moving Arrow batches.
  • Supplies Arrow DataFusion, a full query engine in Rust used by Ballista, Dask, InfluxDB 3 and more.

Architecture Overview

Arrays are stored in contiguous buffers per column with separate validity bitmaps; nested types compose by reference, and dictionaries dedupe repeated values. A RecordBatch bundles columns; a Table is a virtual concatenation. IPC uses the same in-memory layout on the wire, so producers and consumers share buffers via memory-mapping or zero-copy gRPC. Compute kernels iterate these buffers with SIMD and bit-packing tricks.

Self-Hosting & Configuration

  • Install per language: pip install pyarrow, cargo add arrow, npm i apache-arrow, etc.
  • Use pyarrow.dataset to read partitioned Parquet/CSV lakes without loading everything.
  • Point Arrow Flight SQL clients at DuckDB, Dremio, InfluxDB 3 or Ballista for remote SQL.
  • Share memory between processes with Arrow IPC files or Plasma-style shared buffers.
  • Enable memory pools (jemalloc/mimalloc) to keep allocation overhead low on large batches.

Key Features

  • Zero-copy hand-off between Pandas 2.x, Polars, DuckDB, Spark 3.4+, and Rust tools.
  • SIMD-accelerated compute kernels on AVX2/AVX-512/Neon.
  • Arrow Flight: 10x faster than ODBC/JDBC for moving tabular data across the network.
  • Extension types and user-defined scalar/aggregate functions in compute.
  • First-class support for nested and list columns — no JSON escape hatch needed.

Comparison with Similar Tools

  • Parquet / ORC — on-disk columnar formats; Arrow is the in-memory complement.
  • Protocol Buffers / Avro — row-oriented, not vectorised compute-friendly.
  • Pandas pre-2.0 — NumPy-backed; Arrow-backed Pandas removes copies and null headaches.
  • NumPy — numeric arrays only; no strings, structs, unions or nulls.
  • Feather v1 — early Arrow format, now superseded by Arrow IPC v2.

FAQ

Q: Is Arrow a database? A: No, it is a format + compute library that databases and DataFrame tools share. Q: How is Arrow different from Parquet? A: Parquet is an on-disk, compressed layout. Arrow is uncompressed, CPU-optimised, in-memory. Q: Can I use Arrow for streaming? A: Yes — Arrow IPC streaming format and Flight support unbounded RecordBatch streams. Q: Do Polars and DuckDB really share buffers? A: Yes — df.to_arrow() or duckdb.arrow() returns the same memory without copying.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产