Apache DataFusion — Fast In-Process SQL Query Engine in Rust
An extensible query engine written in Rust that uses Apache Arrow as its in-memory format, enabling fast analytical SQL queries embeddable in any application.
What it is
Apache DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory columnar format. It provides a SQL interface for running analytical queries directly inside your application, without a separate database server. DataFusion handles query parsing, optimization, and execution with support for Parquet, CSV, JSON, and Avro file formats.
DataFusion targets developers building data-intensive applications in Rust who need SQL capabilities without the overhead of an external database. It suits embedded analytics, data lake query engines, and custom database products.
How it saves time or tokens
This workflow provides the Cargo dependency and a working Rust example that reads a CSV file and runs SQL queries. Instead of setting up a database server for analytical queries, you add a single crate to your Cargo.toml and query data files directly.
How to use
- Add DataFusion to your Cargo.toml:
cargo add datafusion tokio
- Write a query:
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
ctx.register_csv("sales", "sales.csv", CsvReadOptions::new()).await?;
let df = ctx.sql("SELECT category, SUM(amount) as total FROM sales GROUP BY category ORDER BY total DESC").await?;
df.show().await?;
Ok(())
}
- Run your application:
cargo run
Example
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
// Register a Parquet file as a table
ctx.register_parquet("logs", "access_logs.parquet", ParquetReadOptions::default()).await?;
// Run analytical queries
let top_paths = ctx.sql("
SELECT path, COUNT(*) as hits, AVG(response_time_ms) as avg_rt
FROM logs
WHERE status_code = 200
GROUP BY path
ORDER BY hits DESC
LIMIT 10
").await?;
top_paths.show().await?;
Ok(())
}
Related on TokRepo
- Database tools -- Data processing and query engines
- AI tools for research -- Analytical tools for data research
Common pitfalls
- DataFusion is an in-process engine, not a database server. It does not persist data or manage transactions. Use it for analytical queries on files or for building custom database products.
- Large datasets that exceed available memory cause OOM errors. DataFusion streams results but still needs memory for intermediate aggregations.
- The async API requires Tokio runtime. Make sure your application uses #[tokio::main] or creates a Tokio runtime explicitly.
Frequently Asked Questions
Both are in-process analytical engines. DataFusion is written in Rust and designed as an embeddable library for building custom query systems. DuckDB is written in C++ and positioned as a standalone analytical database. DataFusion offers more extensibility; DuckDB offers more out-of-the-box features.
DataFusion natively reads Parquet, CSV, JSON, and Avro files. It also supports registering custom table providers for any data source. The Arrow format enables efficient columnar processing regardless of the source format.
Yes. The datafusion-python package provides Python bindings. Install with pip install datafusion. The Python API mirrors the Rust API with SessionContext, DataFrame, and SQL execution.
Yes. DataFusion is an Apache Software Foundation project used in production by companies building data infrastructure. It powers parts of InfluxDB IOx, Comet (Spark accelerator), and other data products.
Yes. DataFusion supports inner, left, right, outer, cross, and semi joins. The query optimizer chooses between hash join and sort-merge join based on data size and available memory.
Citations (3)
- DataFusion GitHub— Apache DataFusion is an extensible query engine using Apache Arrow
- DataFusion Documentation— Supports Parquet, CSV, JSON, and Avro formats
- Apache Arrow— Apache Arrow columnar in-memory format
Related on TokRepo
Discussion
Related Assets
Moodle — Open-Source Learning Management System
The most widely used open-source learning platform, providing course management, assessments, and collaboration tools for educators and organizations worldwide.
Sylius — Headless E-Commerce Framework on Symfony
An open-source headless e-commerce platform built on Symfony and API Platform, designed for developers who need a customizable and API-first commerce solution.
Akaunting — Free Self-Hosted Accounting Software
A free, open-source online accounting application built on Laravel for small businesses and freelancers to manage invoices, expenses, and financial reports.