# Apache Arrow — Columnar In-Memory Format and Compute Runtime > A cross-language columnar format, zero-copy IPC and compute library that has become the common data-plane for DuckDB, Polars, Pandas 2.x, Spark, Snowflake clients and most modern analytics tools. ## Install Save in your project root: # Apache Arrow — Columnar In-Memory Format and Compute Runtime ## Quick Use ```bash pip install "pyarrow>=16" ``` ```python import pyarrow as pa import pyarrow.compute as pc import pyarrow.parquet as pq # Build a table in-memory tbl = pa.table({"id": [1,2,3], "price": [9.99, 19.5, 4.25]}) # Vectorised math and filter — no Python loops tbl = tbl.append_column("tax", pc.multiply(tbl["price"], 0.07)) tbl = tbl.filter(pc.greater(tbl["tax"], 0.5)) # Zero-copy write to Parquet pq.write_table(tbl, "/tmp/orders.parquet") ``` ## Introduction Apache Arrow defines a language-independent columnar memory layout for flat and hierarchical data, plus libraries in C++, Java, Rust, Go, Python, R, JS, C# and more that read, write, transport and process it without serialization. It turns analytics tools from data-copying islands into a cooperative ecosystem. ## What Arrow Does - Specifies a CPU-friendly columnar format that all implementations share bit-for-bit. - Provides readers/writers for Parquet, Feather/IPC, CSV, JSON, ORC and Avro. - Ships Arrow Compute: vectorised, SIMD-accelerated kernels for filter, math, string, temporal and aggregate ops. - Exposes Arrow Flight, a gRPC-based high-throughput RPC protocol for moving Arrow batches. - Supplies Arrow DataFusion, a full query engine in Rust used by Ballista, Dask, InfluxDB 3 and more. ## Architecture Overview Arrays are stored in contiguous buffers per column with separate validity bitmaps; nested types compose by reference, and dictionaries dedupe repeated values. A RecordBatch bundles columns; a Table is a virtual concatenation. IPC uses the same in-memory layout on the wire, so producers and consumers share buffers via memory-mapping or zero-copy gRPC. Compute kernels iterate these buffers with SIMD and bit-packing tricks. ## Self-Hosting & Configuration - Install per language: `pip install pyarrow`, `cargo add arrow`, `npm i apache-arrow`, etc. - Use `pyarrow.dataset` to read partitioned Parquet/CSV lakes without loading everything. - Point Arrow Flight SQL clients at DuckDB, Dremio, InfluxDB 3 or Ballista for remote SQL. - Share memory between processes with Arrow IPC files or Plasma-style shared buffers. - Enable memory pools (jemalloc/mimalloc) to keep allocation overhead low on large batches. ## Key Features - Zero-copy hand-off between Pandas 2.x, Polars, DuckDB, Spark 3.4+, and Rust tools. - SIMD-accelerated compute kernels on AVX2/AVX-512/Neon. - Arrow Flight: 10x faster than ODBC/JDBC for moving tabular data across the network. - Extension types and user-defined scalar/aggregate functions in compute. - First-class support for nested and list columns — no JSON escape hatch needed. ## Comparison with Similar Tools - **Parquet / ORC** — on-disk columnar formats; Arrow is the in-memory complement. - **Protocol Buffers / Avro** — row-oriented, not vectorised compute-friendly. - **Pandas pre-2.0** — NumPy-backed; Arrow-backed Pandas removes copies and null headaches. - **NumPy** — numeric arrays only; no strings, structs, unions or nulls. - **Feather v1** — early Arrow format, now superseded by Arrow IPC v2. ## FAQ **Q:** Is Arrow a database? A: No, it is a format + compute library that databases and DataFrame tools share. **Q:** How is Arrow different from Parquet? A: Parquet is an on-disk, compressed layout. Arrow is uncompressed, CPU-optimised, in-memory. **Q:** Can I use Arrow for streaming? A: Yes — Arrow IPC streaming format and Flight support unbounded RecordBatch streams. **Q:** Do Polars and DuckDB really share buffers? A: Yes — `df.to_arrow()` or `duckdb.arrow()` returns the same memory without copying. ## Sources - https://github.com/apache/arrow - https://arrow.apache.org/docs/ --- Source: https://tokrepo.com/en/workflows/2ea0c9b8-3920-11f1-9bc6-00163e2b0d79 Author: AI Open Source