# Vaex — Out-of-Core DataFrames for Billion-Row Datasets

> Vaex is a Python library for lazy, out-of-core DataFrames that processes billion-row datasets on a single machine using memory-mapped files, expression systems, and just-in-time compilation.

## Install

Save as a script file and run:

# Vaex — Out-of-Core DataFrames for Billion-Row Datasets

## Quick Use
```bash
pip install vaex
python -c "
import vaex
df = vaex.open('large_dataset.hdf5')
print(df.count())
print(df.mean(df.column_a))
filtered = df[df.column_b > 100]
filtered.export_hdf5('filtered.hdf5')
"
```

## Introduction
Vaex is a Python DataFrame library designed for datasets that do not fit in memory. It uses memory-mapped files and lazy evaluation to compute statistics, filter, and transform billions of rows on a laptop without loading the full dataset into RAM.

## What Vaex Does
- Opens multi-gigabyte HDF5, Arrow, Parquet, and CSV files without loading them into memory
- Computes aggregations like mean, sum, count, and histograms using streaming evaluation
- Applies virtual columns and expressions that are evaluated lazily on access
- Filters and groups data with a pandas-like API that stays out-of-core
- Renders interactive visualizations of billion-point datasets in Jupyter notebooks

## Architecture Overview
Vaex memory-maps data files so the OS pages data in and out as needed. Expressions form a computation graph that is evaluated lazily in chunks using C++ and NumPy. Aggregations process data in fixed-size blocks, accumulating results without materializing intermediate arrays. JIT compilation via Pythran or LLVM accelerates custom expressions. The result is near-instant opening of large files and constant memory usage regardless of dataset size.

## Self-Hosting & Configuration
- Install via pip; vaex-core is the base, vaex-hdf5 and vaex-arrow add format support
- Convert CSV to HDF5 with vaex.from_csv() for optimal memory-mapped performance
- Use vaex.open() to lazily open HDF5, Parquet, or Arrow files
- Set chunk_size on aggregation calls to control memory-speed trade-offs
- Deploy the vaex-server component for remote DataFrame access over HTTP

## Key Features
- Opens a 1 TB HDF5 file in under a second with zero memory overhead
- Lazy evaluation means no computation runs until results are explicitly requested
- String operations run in C++ for fast text processing on large columns
- Built-in plotting renders density maps of billion-point datasets interactively
- Converts between HDF5, Arrow, Parquet, and CSV formats seamlessly

## Comparison with Similar Tools
- **pandas** — loads everything into memory; Vaex stays out-of-core for datasets that exceed RAM
- **Polars** — fast in-memory engine; Vaex targets datasets larger than available memory via memory mapping
- **Dask** — distributed lazy DataFrames; Vaex runs on a single machine with simpler setup
- **PySpark** — cluster-scale processing; Vaex is a lightweight single-node solution
- **DuckDB** — SQL-based analytics; Vaex uses a Python DataFrame API with lazy evaluation

## FAQ
**Q: What file format works best with Vaex?**
A: HDF5 gives the fastest memory-mapped access. Apache Arrow and Parquet are also well supported.

**Q: Can Vaex handle joins?**
A: Yes. Vaex supports inner, left, and right joins. For very large joins, ensure the key column is sorted for optimal performance.

**Q: Does Vaex work in Jupyter notebooks?**
A: Yes. It integrates with Jupyter and provides interactive histogram and scatter plot widgets for visual exploration.

**Q: Is Vaex still actively maintained?**
A: The core library is stable. Check the repository for recent commits and the changelog for the latest release status.

## Sources
- https://github.com/vaexio/vaex
- https://vaex.io/docs/

---
Source: https://tokrepo.com/en/workflows/asset-b1add616
Author: Script Depot