ScriptsApr 16, 2026·3 min read

Apache DataFusion — Fast In-Process SQL Query Engine in Rust

An extensible query engine written in Rust that uses Apache Arrow as its in-memory format, enabling fast analytical SQL queries embeddable in any application.

TL;DR
DataFusion embeds a fast SQL query engine in your Rust application using Apache Arrow for columnar in-memory processing.
§01

What it is

Apache DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory columnar format. It provides a SQL interface for running analytical queries directly inside your application, without a separate database server. DataFusion handles query parsing, optimization, and execution with support for Parquet, CSV, JSON, and Avro file formats.

DataFusion targets developers building data-intensive applications in Rust who need SQL capabilities without the overhead of an external database. It suits embedded analytics, data lake query engines, and custom database products.

§02

How it saves time or tokens

This workflow provides the Cargo dependency and a working Rust example that reads a CSV file and runs SQL queries. Instead of setting up a database server for analytical queries, you add a single crate to your Cargo.toml and query data files directly.

§03

How to use

  1. Add DataFusion to your Cargo.toml:
cargo add datafusion tokio
  1. Write a query:
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();
    ctx.register_csv("sales", "sales.csv", CsvReadOptions::new()).await?;

    let df = ctx.sql("SELECT category, SUM(amount) as total FROM sales GROUP BY category ORDER BY total DESC").await?;
    df.show().await?;
    Ok(())
}
  1. Run your application:
cargo run
§04

Example

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    // Register a Parquet file as a table
    ctx.register_parquet("logs", "access_logs.parquet", ParquetReadOptions::default()).await?;

    // Run analytical queries
    let top_paths = ctx.sql("
        SELECT path, COUNT(*) as hits, AVG(response_time_ms) as avg_rt
        FROM logs
        WHERE status_code = 200
        GROUP BY path
        ORDER BY hits DESC
        LIMIT 10
    ").await?;

    top_paths.show().await?;
    Ok(())
}
§05

Related on TokRepo

§06

Common pitfalls

  • DataFusion is an in-process engine, not a database server. It does not persist data or manage transactions. Use it for analytical queries on files or for building custom database products.
  • Large datasets that exceed available memory cause OOM errors. DataFusion streams results but still needs memory for intermediate aggregations.
  • The async API requires Tokio runtime. Make sure your application uses #[tokio::main] or creates a Tokio runtime explicitly.

Frequently Asked Questions

How does DataFusion compare to DuckDB?+

Both are in-process analytical engines. DataFusion is written in Rust and designed as an embeddable library for building custom query systems. DuckDB is written in C++ and positioned as a standalone analytical database. DataFusion offers more extensibility; DuckDB offers more out-of-the-box features.

What file formats does DataFusion support?+

DataFusion natively reads Parquet, CSV, JSON, and Avro files. It also supports registering custom table providers for any data source. The Arrow format enables efficient columnar processing regardless of the source format.

Can I use DataFusion from Python?+

Yes. The datafusion-python package provides Python bindings. Install with pip install datafusion. The Python API mirrors the Rust API with SessionContext, DataFrame, and SQL execution.

Is DataFusion production-ready?+

Yes. DataFusion is an Apache Software Foundation project used in production by companies building data infrastructure. It powers parts of InfluxDB IOx, Comet (Spark accelerator), and other data products.

Does DataFusion support joins?+

Yes. DataFusion supports inner, left, right, outer, cross, and semi joins. The query optimizer chooses between hash join and sort-merge join based on data size and available memory.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets