DuckDB — Fast In-Process Analytical SQL Database
DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.
What it is
DuckDB is an in-process analytical database designed for fast SQL queries over local data files. It reads CSV, Parquet, JSON, and Excel files directly without loading them into a separate database server. DuckDB embeds into your application like SQLite but is optimized for analytical queries (aggregations, joins, window functions) rather than transactional workloads.
It targets data scientists, analysts, and developers who work with local data files and want SQL query capabilities without setting up a database server. DuckDB runs in Python, R, Node.js, Java, Go, and as a standalone CLI.
How it saves time or tokens
DuckDB eliminates the load-transform-query cycle. Instead of importing a CSV into PostgreSQL to run SQL queries, query the CSV directly: SELECT * FROM 'data.csv' WHERE amount > 100. No schema definition, no import step, no server to start.
For data science workflows, DuckDB integrates with pandas DataFrames and Apache Arrow. Query DataFrames with SQL and get results back as DataFrames. This bridges the gap between SQL and Python data manipulation.
How to use
- Install DuckDB:
pip install duckdbfor Python,brew install duckdbfor CLI, or add the dependency to your project. - Query files directly with SQL:
duckdb -c "SELECT * FROM 'sales.parquet' LIMIT 10". No database creation or table definition needed. - For persistent storage, create a database file:
duckdb mydata.db. Tables and indexes persist between sessions.
Example
import duckdb
# Query a CSV file directly
result = duckdb.sql("""
SELECT
category,
count(*) as orders,
sum(amount) as total_revenue,
avg(amount) as avg_order
FROM 'orders.csv'
WHERE order_date >= '2026-01-01'
GROUP BY category
ORDER BY total_revenue DESC
""").fetchdf() # Returns a pandas DataFrame
# Query a Parquet file from S3
result = duckdb.sql("""
SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet')
WHERE region = 'US'
""")
# Query a pandas DataFrame with SQL
import pandas as pd
df = pd.read_csv('users.csv')
duckdb.sql("SELECT * FROM df WHERE age > 25").show()
Related on TokRepo
- AI tools for database — Database and analytical tools
- AI tools for research — Data analysis and research tools
Common pitfalls
- Trying to use DuckDB for concurrent transactional workloads. DuckDB supports a single writer at a time and is optimized for analytical queries. For multi-user OLTP workloads, use PostgreSQL or SQLite.
- Not leveraging Parquet format for large datasets. DuckDB queries Parquet files much faster than CSV because Parquet is columnar and compressed. Convert large CSV files to Parquet for repeated analysis.
- Assuming DuckDB needs a server. DuckDB runs entirely in-process. There is no server to start, no port to configure, and no connection string. It loads as a library in your application or runs as a CLI tool.
Frequently Asked Questions
SQLite is optimized for transactional (OLTP) workloads with many small reads and writes. DuckDB is optimized for analytical (OLAP) workloads with complex aggregations over large datasets. Both are embedded (no server) and store data in a single file. Choose DuckDB for data analysis and SQLite for application state.
Yes. DuckDB queries files from S3, GCS, Azure Blob Storage, and HTTP URLs directly. Install the httpfs extension and query remote Parquet or CSV files without downloading them first. DuckDB uses predicate pushdown and column pruning to minimize data transfer.
DuckDB handles datasets larger than available RAM using disk-based spilling. It processes data in chunks, so datasets of hundreds of gigabytes work on machines with modest memory. For truly massive datasets (terabytes), use ClickHouse or a distributed query engine.
Yes. DuckDB has extensions for spatial queries (PostGIS-compatible), full-text search, JSON, Excel, Parquet, ICeberg, Delta Lake, httpfs (remote file access), and more. Install extensions with `INSTALL extension_name; LOAD extension_name;`.
Yes. DuckDB integrates tightly with pandas. Query DataFrames directly with SQL (just reference the variable name in FROM clause), and get results back as DataFrames with .fetchdf(). This enables SQL for complex transformations while staying in the pandas ecosystem.
Citations (3)
- DuckDB GitHub Repository— DuckDB is an in-process analytical database that queries CSV, Parquet, and JSON …
- DuckDB Python API Documentation— DuckDB integrates with pandas DataFrames and Apache Arrow for Python data scienc…
- DuckDB Extensions Documentation— DuckDB supports extensions for spatial, full-text search, and remote file access
Related on TokRepo
Discussion
Related Assets
Moodle — Open-Source Learning Management System
The most widely used open-source learning platform, providing course management, assessments, and collaboration tools for educators and organizations worldwide.
Sylius — Headless E-Commerce Framework on Symfony
An open-source headless e-commerce platform built on Symfony and API Platform, designed for developers who need a customizable and API-first commerce solution.
Akaunting — Free Self-Hosted Accounting Software
A free, open-source online accounting application built on Laravel for small businesses and freelancers to manage invoices, expenses, and financial reports.