Introduction
Vaex is a Python DataFrame library designed for datasets that do not fit in memory. It uses memory-mapped files and lazy evaluation to compute statistics, filter, and transform billions of rows on a laptop without loading the full dataset into RAM.
What Vaex Does
- Opens multi-gigabyte HDF5, Arrow, Parquet, and CSV files without loading them into memory
- Computes aggregations like mean, sum, count, and histograms using streaming evaluation
- Applies virtual columns and expressions that are evaluated lazily on access
- Filters and groups data with a pandas-like API that stays out-of-core
- Renders interactive visualizations of billion-point datasets in Jupyter notebooks
Architecture Overview
Vaex memory-maps data files so the OS pages data in and out as needed. Expressions form a computation graph that is evaluated lazily in chunks using C++ and NumPy. Aggregations process data in fixed-size blocks, accumulating results without materializing intermediate arrays. JIT compilation via Pythran or LLVM accelerates custom expressions. The result is near-instant opening of large files and constant memory usage regardless of dataset size.
Self-Hosting & Configuration
- Install via pip; vaex-core is the base, vaex-hdf5 and vaex-arrow add format support
- Convert CSV to HDF5 with vaex.from_csv() for optimal memory-mapped performance
- Use vaex.open() to lazily open HDF5, Parquet, or Arrow files
- Set chunk_size on aggregation calls to control memory-speed trade-offs
- Deploy the vaex-server component for remote DataFrame access over HTTP
Key Features
- Opens a 1 TB HDF5 file in under a second with zero memory overhead
- Lazy evaluation means no computation runs until results are explicitly requested
- String operations run in C++ for fast text processing on large columns
- Built-in plotting renders density maps of billion-point datasets interactively
- Converts between HDF5, Arrow, Parquet, and CSV formats seamlessly
Comparison with Similar Tools
- pandas — loads everything into memory; Vaex stays out-of-core for datasets that exceed RAM
- Polars — fast in-memory engine; Vaex targets datasets larger than available memory via memory mapping
- Dask — distributed lazy DataFrames; Vaex runs on a single machine with simpler setup
- PySpark — cluster-scale processing; Vaex is a lightweight single-node solution
- DuckDB — SQL-based analytics; Vaex uses a Python DataFrame API with lazy evaluation
FAQ
Q: What file format works best with Vaex? A: HDF5 gives the fastest memory-mapped access. Apache Arrow and Parquet are also well supported.
Q: Can Vaex handle joins? A: Yes. Vaex supports inner, left, and right joins. For very large joins, ensure the key column is sorted for optimal performance.
Q: Does Vaex work in Jupyter notebooks? A: Yes. It integrates with Jupyter and provides interactive histogram and scatter plot widgets for visual exploration.
Q: Is Vaex still actively maintained? A: The core library is stable. Check the repository for recent commits and the changelog for the latest release status.