Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 10, 2026·3 min de lectura

Vaex — Out-of-Core DataFrames for Billion-Row Datasets

Vaex is a Python library for lazy, out-of-core DataFrames that processes billion-row datasets on a single machine using memory-mapped files, expression systems, and just-in-time compilation.

Introduction

Vaex is a Python DataFrame library designed for datasets that do not fit in memory. It uses memory-mapped files and lazy evaluation to compute statistics, filter, and transform billions of rows on a laptop without loading the full dataset into RAM.

What Vaex Does

  • Opens multi-gigabyte HDF5, Arrow, Parquet, and CSV files without loading them into memory
  • Computes aggregations like mean, sum, count, and histograms using streaming evaluation
  • Applies virtual columns and expressions that are evaluated lazily on access
  • Filters and groups data with a pandas-like API that stays out-of-core
  • Renders interactive visualizations of billion-point datasets in Jupyter notebooks

Architecture Overview

Vaex memory-maps data files so the OS pages data in and out as needed. Expressions form a computation graph that is evaluated lazily in chunks using C++ and NumPy. Aggregations process data in fixed-size blocks, accumulating results without materializing intermediate arrays. JIT compilation via Pythran or LLVM accelerates custom expressions. The result is near-instant opening of large files and constant memory usage regardless of dataset size.

Self-Hosting & Configuration

  • Install via pip; vaex-core is the base, vaex-hdf5 and vaex-arrow add format support
  • Convert CSV to HDF5 with vaex.from_csv() for optimal memory-mapped performance
  • Use vaex.open() to lazily open HDF5, Parquet, or Arrow files
  • Set chunk_size on aggregation calls to control memory-speed trade-offs
  • Deploy the vaex-server component for remote DataFrame access over HTTP

Key Features

  • Opens a 1 TB HDF5 file in under a second with zero memory overhead
  • Lazy evaluation means no computation runs until results are explicitly requested
  • String operations run in C++ for fast text processing on large columns
  • Built-in plotting renders density maps of billion-point datasets interactively
  • Converts between HDF5, Arrow, Parquet, and CSV formats seamlessly

Comparison with Similar Tools

  • pandas — loads everything into memory; Vaex stays out-of-core for datasets that exceed RAM
  • Polars — fast in-memory engine; Vaex targets datasets larger than available memory via memory mapping
  • Dask — distributed lazy DataFrames; Vaex runs on a single machine with simpler setup
  • PySpark — cluster-scale processing; Vaex is a lightweight single-node solution
  • DuckDB — SQL-based analytics; Vaex uses a Python DataFrame API with lazy evaluation

FAQ

Q: What file format works best with Vaex? A: HDF5 gives the fastest memory-mapped access. Apache Arrow and Parquet are also well supported.

Q: Can Vaex handle joins? A: Yes. Vaex supports inner, left, and right joins. For very large joins, ensure the key column is sorted for optimal performance.

Q: Does Vaex work in Jupyter notebooks? A: Yes. It integrates with Jupyter and provides interactive histogram and scatter plot widgets for visual exploration.

Q: Is Vaex still actively maintained? A: The core library is stable. Check the repository for recent commits and the changelog for the latest release status.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados