ScriptsMay 10, 2026·3 min read

Vaex — Out-of-Core DataFrames for Billion-Row Datasets

Vaex is a Python library for lazy, out-of-core DataFrames that processes billion-row datasets on a single machine using memory-mapped files, expression systems, and just-in-time compilation.

Introduction

Vaex is a Python DataFrame library designed for datasets that do not fit in memory. It uses memory-mapped files and lazy evaluation to compute statistics, filter, and transform billions of rows on a laptop without loading the full dataset into RAM.

What Vaex Does

  • Opens multi-gigabyte HDF5, Arrow, Parquet, and CSV files without loading them into memory
  • Computes aggregations like mean, sum, count, and histograms using streaming evaluation
  • Applies virtual columns and expressions that are evaluated lazily on access
  • Filters and groups data with a pandas-like API that stays out-of-core
  • Renders interactive visualizations of billion-point datasets in Jupyter notebooks

Architecture Overview

Vaex memory-maps data files so the OS pages data in and out as needed. Expressions form a computation graph that is evaluated lazily in chunks using C++ and NumPy. Aggregations process data in fixed-size blocks, accumulating results without materializing intermediate arrays. JIT compilation via Pythran or LLVM accelerates custom expressions. The result is near-instant opening of large files and constant memory usage regardless of dataset size.

Self-Hosting & Configuration

  • Install via pip; vaex-core is the base, vaex-hdf5 and vaex-arrow add format support
  • Convert CSV to HDF5 with vaex.from_csv() for optimal memory-mapped performance
  • Use vaex.open() to lazily open HDF5, Parquet, or Arrow files
  • Set chunk_size on aggregation calls to control memory-speed trade-offs
  • Deploy the vaex-server component for remote DataFrame access over HTTP

Key Features

  • Opens a 1 TB HDF5 file in under a second with zero memory overhead
  • Lazy evaluation means no computation runs until results are explicitly requested
  • String operations run in C++ for fast text processing on large columns
  • Built-in plotting renders density maps of billion-point datasets interactively
  • Converts between HDF5, Arrow, Parquet, and CSV formats seamlessly

Comparison with Similar Tools

  • pandas — loads everything into memory; Vaex stays out-of-core for datasets that exceed RAM
  • Polars — fast in-memory engine; Vaex targets datasets larger than available memory via memory mapping
  • Dask — distributed lazy DataFrames; Vaex runs on a single machine with simpler setup
  • PySpark — cluster-scale processing; Vaex is a lightweight single-node solution
  • DuckDB — SQL-based analytics; Vaex uses a Python DataFrame API with lazy evaluation

FAQ

Q: What file format works best with Vaex? A: HDF5 gives the fastest memory-mapped access. Apache Arrow and Parquet are also well supported.

Q: Can Vaex handle joins? A: Yes. Vaex supports inner, left, and right joins. For very large joins, ensure the key column is sorted for optimal performance.

Q: Does Vaex work in Jupyter notebooks? A: Yes. It integrates with Jupyter and provides interactive histogram and scatter plot widgets for visual exploration.

Q: Is Vaex still actively maintained? A: The core library is stable. Check the repository for recent commits and the changelog for the latest release status.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets