ScriptsApr 10, 2026·3 min read

DuckDB — Fast In-Process Analytical SQL Database

DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.

TL;DR
DuckDB is an in-process analytical SQL database that queries CSV, Parquet, and JSON files directly with zero setup and no server.
§01

What it is

DuckDB is an in-process analytical database designed for fast SQL queries over local data files. It reads CSV, Parquet, JSON, and Excel files directly without loading them into a separate database server. DuckDB embeds into your application like SQLite but is optimized for analytical queries (aggregations, joins, window functions) rather than transactional workloads.

It targets data scientists, analysts, and developers who work with local data files and want SQL query capabilities without setting up a database server. DuckDB runs in Python, R, Node.js, Java, Go, and as a standalone CLI.

§02

How it saves time or tokens

DuckDB eliminates the load-transform-query cycle. Instead of importing a CSV into PostgreSQL to run SQL queries, query the CSV directly: SELECT * FROM 'data.csv' WHERE amount > 100. No schema definition, no import step, no server to start.

For data science workflows, DuckDB integrates with pandas DataFrames and Apache Arrow. Query DataFrames with SQL and get results back as DataFrames. This bridges the gap between SQL and Python data manipulation.

§03

How to use

  1. Install DuckDB: pip install duckdb for Python, brew install duckdb for CLI, or add the dependency to your project.
  2. Query files directly with SQL: duckdb -c "SELECT * FROM 'sales.parquet' LIMIT 10". No database creation or table definition needed.
  3. For persistent storage, create a database file: duckdb mydata.db. Tables and indexes persist between sessions.
§04

Example

import duckdb

# Query a CSV file directly
result = duckdb.sql("""
    SELECT 
        category,
        count(*) as orders,
        sum(amount) as total_revenue,
        avg(amount) as avg_order
    FROM 'orders.csv'
    WHERE order_date >= '2026-01-01'
    GROUP BY category
    ORDER BY total_revenue DESC
""").fetchdf()  # Returns a pandas DataFrame

# Query a Parquet file from S3
result = duckdb.sql("""
    SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet')
    WHERE region = 'US'
""")

# Query a pandas DataFrame with SQL
import pandas as pd
df = pd.read_csv('users.csv')
duckdb.sql("SELECT * FROM df WHERE age > 25").show()
§05

Related on TokRepo

§06

Common pitfalls

  • Trying to use DuckDB for concurrent transactional workloads. DuckDB supports a single writer at a time and is optimized for analytical queries. For multi-user OLTP workloads, use PostgreSQL or SQLite.
  • Not leveraging Parquet format for large datasets. DuckDB queries Parquet files much faster than CSV because Parquet is columnar and compressed. Convert large CSV files to Parquet for repeated analysis.
  • Assuming DuckDB needs a server. DuckDB runs entirely in-process. There is no server to start, no port to configure, and no connection string. It loads as a library in your application or runs as a CLI tool.

Frequently Asked Questions

How does DuckDB compare to SQLite?+

SQLite is optimized for transactional (OLTP) workloads with many small reads and writes. DuckDB is optimized for analytical (OLAP) workloads with complex aggregations over large datasets. Both are embedded (no server) and store data in a single file. Choose DuckDB for data analysis and SQLite for application state.

Can DuckDB query remote files?+

Yes. DuckDB queries files from S3, GCS, Azure Blob Storage, and HTTP URLs directly. Install the httpfs extension and query remote Parquet or CSV files without downloading them first. DuckDB uses predicate pushdown and column pruning to minimize data transfer.

How large a dataset can DuckDB handle?+

DuckDB handles datasets larger than available RAM using disk-based spilling. It processes data in chunks, so datasets of hundreds of gigabytes work on machines with modest memory. For truly massive datasets (terabytes), use ClickHouse or a distributed query engine.

Does DuckDB support extensions?+

Yes. DuckDB has extensions for spatial queries (PostGIS-compatible), full-text search, JSON, Excel, Parquet, ICeberg, Delta Lake, httpfs (remote file access), and more. Install extensions with `INSTALL extension_name; LOAD extension_name;`.

Can I use DuckDB with pandas?+

Yes. DuckDB integrates tightly with pandas. Query DataFrames directly with SQL (just reference the variable name in FROM clause), and get results back as DataFrames with .fetchdf(). This enables SQL for complex transformations while staying in the pandas ecosystem.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets