# Vaex — Out-of-Core DataFrames for Billion-Row Datasets > Vaex is a Python library for lazy, out-of-core DataFrames that processes billion-row datasets on a single machine using memory-mapped files, expression systems, and just-in-time compilation. ## Install Save as a script file and run: # Vaex — Out-of-Core DataFrames for Billion-Row Datasets ## Quick Use ```bash pip install vaex python -c " import vaex df = vaex.open('large_dataset.hdf5') print(df.count()) print(df.mean(df.column_a)) filtered = df[df.column_b > 100] filtered.export_hdf5('filtered.hdf5') " ``` ## Introduction Vaex is a Python DataFrame library designed for datasets that do not fit in memory. It uses memory-mapped files and lazy evaluation to compute statistics, filter, and transform billions of rows on a laptop without loading the full dataset into RAM. ## What Vaex Does - Opens multi-gigabyte HDF5, Arrow, Parquet, and CSV files without loading them into memory - Computes aggregations like mean, sum, count, and histograms using streaming evaluation - Applies virtual columns and expressions that are evaluated lazily on access - Filters and groups data with a pandas-like API that stays out-of-core - Renders interactive visualizations of billion-point datasets in Jupyter notebooks ## Architecture Overview Vaex memory-maps data files so the OS pages data in and out as needed. Expressions form a computation graph that is evaluated lazily in chunks using C++ and NumPy. Aggregations process data in fixed-size blocks, accumulating results without materializing intermediate arrays. JIT compilation via Pythran or LLVM accelerates custom expressions. The result is near-instant opening of large files and constant memory usage regardless of dataset size. ## Self-Hosting & Configuration - Install via pip; vaex-core is the base, vaex-hdf5 and vaex-arrow add format support - Convert CSV to HDF5 with vaex.from_csv() for optimal memory-mapped performance - Use vaex.open() to lazily open HDF5, Parquet, or Arrow files - Set chunk_size on aggregation calls to control memory-speed trade-offs - Deploy the vaex-server component for remote DataFrame access over HTTP ## Key Features - Opens a 1 TB HDF5 file in under a second with zero memory overhead - Lazy evaluation means no computation runs until results are explicitly requested - String operations run in C++ for fast text processing on large columns - Built-in plotting renders density maps of billion-point datasets interactively - Converts between HDF5, Arrow, Parquet, and CSV formats seamlessly ## Comparison with Similar Tools - **pandas** — loads everything into memory; Vaex stays out-of-core for datasets that exceed RAM - **Polars** — fast in-memory engine; Vaex targets datasets larger than available memory via memory mapping - **Dask** — distributed lazy DataFrames; Vaex runs on a single machine with simpler setup - **PySpark** — cluster-scale processing; Vaex is a lightweight single-node solution - **DuckDB** — SQL-based analytics; Vaex uses a Python DataFrame API with lazy evaluation ## FAQ **Q: What file format works best with Vaex?** A: HDF5 gives the fastest memory-mapped access. Apache Arrow and Parquet are also well supported. **Q: Can Vaex handle joins?** A: Yes. Vaex supports inner, left, and right joins. For very large joins, ensure the key column is sorted for optimal performance. **Q: Does Vaex work in Jupyter notebooks?** A: Yes. It integrates with Jupyter and provides interactive histogram and scatter plot widgets for visual exploration. **Q: Is Vaex still actively maintained?** A: The core library is stable. Check the repository for recent commits and the changelog for the latest release status. ## Sources - https://github.com/vaexio/vaex - https://vaex.io/docs/ --- Source: https://tokrepo.com/en/workflows/asset-b1add616 Author: Script Depot