How do I install Apache DataFusion — Fast In-Process SQL Query Engine in Rust?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ScriptsApr 16, 2026·3 min read

Apache DataFusion — Fast In-Process SQL Query Engine in Rust

An extensible query engine written in Rust that uses Apache Arrow as its in-memory format, enabling fast analytical SQL queries embeddable in any application.

Script Depot · Community

TL;DR

DataFusion embeds a fast SQL query engine in your Rust application using Apache Arrow for columnar in-memory processing.

§01

What it is

Apache DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory columnar format. It provides a SQL interface for running analytical queries directly inside your application, without a separate database server. DataFusion handles query parsing, optimization, and execution with support for Parquet, CSV, JSON, and Avro file formats.

DataFusion targets developers building data-intensive applications in Rust who need SQL capabilities without the overhead of an external database. It suits embedded analytics, data lake query engines, and custom database products.

§02

How it saves time or tokens

This workflow provides the Cargo dependency and a working Rust example that reads a CSV file and runs SQL queries. Instead of setting up a database server for analytical queries, you add a single crate to your Cargo.toml and query data files directly.

§03

How to use

Add DataFusion to your Cargo.toml:

cargo add datafusion tokio

Write a query:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();
    ctx.register_csv("sales", "sales.csv", CsvReadOptions::new()).await?;

    let df = ctx.sql("SELECT category, SUM(amount) as total FROM sales GROUP BY category ORDER BY total DESC").await?;
    df.show().await?;
    Ok(())
}

Run your application:

cargo run

§04

Example

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    // Register a Parquet file as a table
    ctx.register_parquet("logs", "access_logs.parquet", ParquetReadOptions::default()).await?;

    // Run analytical queries
    let top_paths = ctx.sql("
        SELECT path, COUNT(*) as hits, AVG(response_time_ms) as avg_rt
        FROM logs
        WHERE status_code = 200
        GROUP BY path
        ORDER BY hits DESC
        LIMIT 10
    ").await?;

    top_paths.show().await?;
    Ok(())
}

§05

Related on TokRepo

Database tools -- Data processing and query engines
AI tools for research -- Analytical tools for data research

§06

Common pitfalls

DataFusion is an in-process engine, not a database server. It does not persist data or manage transactions. Use it for analytical queries on files or for building custom database products.
Large datasets that exceed available memory cause OOM errors. DataFusion streams results but still needs memory for intermediate aggregations.
The async API requires Tokio runtime. Make sure your application uses #[tokio::main] or creates a Tokio runtime explicitly.

Frequently Asked Questions

How does DataFusion compare to DuckDB?+

Both are in-process analytical engines. DataFusion is written in Rust and designed as an embeddable library for building custom query systems. DuckDB is written in C++ and positioned as a standalone analytical database. DataFusion offers more extensibility; DuckDB offers more out-of-the-box features.

What file formats does DataFusion support?+

DataFusion natively reads Parquet, CSV, JSON, and Avro files. It also supports registering custom table providers for any data source. The Arrow format enables efficient columnar processing regardless of the source format.

Can I use DataFusion from Python?+

Yes. The datafusion-python package provides Python bindings. Install with pip install datafusion. The Python API mirrors the Rust API with SessionContext, DataFrame, and SQL execution.

Is DataFusion production-ready?+

Yes. DataFusion is an Apache Software Foundation project used in production by companies building data infrastructure. It powers parts of InfluxDB IOx, Comet (Spark accelerator), and other data products.

Does DataFusion support joins?+

Yes. DataFusion supports inner, left, right, outer, cross, and semi joins. The query optimizer chooses between hash join and sort-merge join based on data size and available memory.

Citations (3)

DataFusion GitHub— Apache DataFusion is an extensible query engine using Apache Arrow
DataFusion Documentation— Supports Parquet, CSV, JSON, and Avro formats
Apache Arrow— Apache Arrow columnar in-memory format

Related on TokRepo

Database tools Research tools Featured workflows

Discussion

No comments yet. Be the first to share your thoughts.

Apache DataFusion — Fast In-Process SQL Query Engine in Rust

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

Frequently Asked Questions

Citations (3)

Related on TokRepo

Discussion

Related Assets

Moodle — Open-Source Learning Management System

Sylius — Headless E-Commerce Framework on Symfony

Akaunting — Free Self-Hosted Accounting Software