# Apache Arrow — Columnar In-Memory Format and Compute Runtime

> A cross-language columnar format, zero-copy IPC and compute library that has become the common data-plane for DuckDB, Polars, Pandas 2.x, Spark, Snowflake clients and most modern analytics tools.

## Install

Save in your project root:

# Apache Arrow — Columnar In-Memory Format and Compute Runtime

## Quick Use
```bash
pip install "pyarrow>=16"
```
```python
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.parquet as pq

# Build a table in-memory
tbl = pa.table({"id": [1,2,3], "price": [9.99, 19.5, 4.25]})

# Vectorised math and filter — no Python loops
tbl = tbl.append_column("tax", pc.multiply(tbl["price"], 0.07))
tbl = tbl.filter(pc.greater(tbl["tax"], 0.5))

# Zero-copy write to Parquet
pq.write_table(tbl, "/tmp/orders.parquet")
```

## Introduction
Apache Arrow defines a language-independent columnar memory layout for flat and hierarchical data, plus libraries in C++, Java, Rust, Go, Python, R, JS, C# and more that read, write, transport and process it without serialization. It turns analytics tools from data-copying islands into a cooperative ecosystem.

## What Arrow Does
- Specifies a CPU-friendly columnar format that all implementations share bit-for-bit.
- Provides readers/writers for Parquet, Feather/IPC, CSV, JSON, ORC and Avro.
- Ships Arrow Compute: vectorised, SIMD-accelerated kernels for filter, math, string, temporal and aggregate ops.
- Exposes Arrow Flight, a gRPC-based high-throughput RPC protocol for moving Arrow batches.
- Supplies Arrow DataFusion, a full query engine in Rust used by Ballista, Dask, InfluxDB 3 and more.

## Architecture Overview
Arrays are stored in contiguous buffers per column with separate validity bitmaps; nested types compose by reference, and dictionaries dedupe repeated values. A RecordBatch bundles columns; a Table is a virtual concatenation. IPC uses the same in-memory layout on the wire, so producers and consumers share buffers via memory-mapping or zero-copy gRPC. Compute kernels iterate these buffers with SIMD and bit-packing tricks.

## Self-Hosting & Configuration
- Install per language: `pip install pyarrow`, `cargo add arrow`, `npm i apache-arrow`, etc.
- Use `pyarrow.dataset` to read partitioned Parquet/CSV lakes without loading everything.
- Point Arrow Flight SQL clients at DuckDB, Dremio, InfluxDB 3 or Ballista for remote SQL.
- Share memory between processes with Arrow IPC files or Plasma-style shared buffers.
- Enable memory pools (jemalloc/mimalloc) to keep allocation overhead low on large batches.

## Key Features
- Zero-copy hand-off between Pandas 2.x, Polars, DuckDB, Spark 3.4+, and Rust tools.
- SIMD-accelerated compute kernels on AVX2/AVX-512/Neon.
- Arrow Flight: 10x faster than ODBC/JDBC for moving tabular data across the network.
- Extension types and user-defined scalar/aggregate functions in compute.
- First-class support for nested and list columns — no JSON escape hatch needed.

## Comparison with Similar Tools
- **Parquet / ORC** — on-disk columnar formats; Arrow is the in-memory complement.
- **Protocol Buffers / Avro** — row-oriented, not vectorised compute-friendly.
- **Pandas pre-2.0** — NumPy-backed; Arrow-backed Pandas removes copies and null headaches.
- **NumPy** — numeric arrays only; no strings, structs, unions or nulls.
- **Feather v1** — early Arrow format, now superseded by Arrow IPC v2.

## FAQ
**Q:** Is Arrow a database?
A: No, it is a format + compute library that databases and DataFrame tools share.
**Q:** How is Arrow different from Parquet?
A: Parquet is an on-disk, compressed layout. Arrow is uncompressed, CPU-optimised, in-memory.
**Q:** Can I use Arrow for streaming?
A: Yes — Arrow IPC streaming format and Flight support unbounded RecordBatch streams.
**Q:** Do Polars and DuckDB really share buffers?
A: Yes — `df.to_arrow()` or `duckdb.arrow()` returns the same memory without copying.

## Sources
- https://github.com/apache/arrow
- https://arrow.apache.org/docs/

---
Source: https://tokrepo.com/en/workflows/2ea0c9b8-3920-11f1-9bc6-00163e2b0d79
Author: AI Open Source