ScriptsApr 10, 2026·1 min read

DuckDB — Fast In-Process Analytical SQL Database

DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.

SC
Script Depot · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

# Install CLI
brew install duckdb            # macOS
curl https://install.duckdb.org | sh  # Linux

# Start interactive shell
duckdb

# Query a CSV directly — no import needed
SELECT * FROM 'data.csv' LIMIT 10;
SELECT COUNT(*) FROM 'https://example.com/data.parquet';
Intro

DuckDB is an open-source, in-process SQL database designed for analytical queries. Think of it as "SQLite for analytics" — a single binary that embeds directly into your application, requires zero configuration, and queries data files (CSV, Parquet, JSON) directly at lightning speed.

With 37.3K+ GitHub stars and MIT license, DuckDB has revolutionized data analysis workflows by bringing database-class performance to laptops and scripts. It's used by data scientists, analysts, and developers for local analytics, ETL, and data exploration.

What DuckDB Does

  • In-Process: Embedded in your application like SQLite (no server)
  • Columnar: Column-oriented storage for fast analytical queries
  • Query Files Directly: Read CSV, Parquet, JSON, Excel without importing
  • SQL: Full ANSI SQL + modern extensions
  • Zero Setup: Single binary, no configuration, no daemon
  • Python/R Integration: First-class DataFrame (Pandas, Polars, Arrow) support
  • Vectorized Execution: Process data in batches for CPU cache efficiency
  • Extensions: Geospatial, HTTP, S3, full-text search, and more
  • ACID: Full ACID transactions with MVCC
  • Cross-Platform: Runs on macOS, Linux, Windows, WASM (browser!)

Architecture

┌─────────────────────────────────────┐
│      Your Application                │
│  ┌────────────┐  ┌──────────────┐   │
│  │  Python    │  │    Node.js    │   │
│  │  duckdb    │  │  duckdb       │   │
│  │  .execute()│  │  .all()       │   │
│  └──────┬─────┘  └──────┬───────┘   │
│         │               │            │
│         ▼               ▼            │
│  ┌────────────────────────────┐     │
│  │   DuckDB Engine (embedded) │     │
│  │   - Query Optimizer        │     │
│  │   - Vectorized Executor    │     │
│  │   - Columnar Storage       │     │
│  └────────────────────────────┘     │
└─────────────────────────────────────┘
          │
          ▼
    File System (CSV, Parquet, DuckDB file)

Basic Usage

CLI

-- Query a CSV file directly
SELECT
    customer_id,
    SUM(amount) AS total
FROM 'orders.csv'
WHERE date >= '2024-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;

-- Query a Parquet file
SELECT * FROM 's3://bucket/data/year=2024/*.parquet' LIMIT 100;

-- Query JSON
SELECT * FROM read_json_auto('events.json');

-- Create persistent database
.open mydb.duckdb

-- Create table from query
CREATE TABLE sales AS
SELECT * FROM 'sales.csv';

-- Query the table
SELECT region, SUM(revenue) FROM sales GROUP BY region;

Python

import duckdb

# In-memory database
con = duckdb.connect()

# Query a CSV directly
result = con.execute("""
    SELECT category, SUM(price) as total
    FROM 'products.csv'
    GROUP BY category
    ORDER BY total DESC
""").fetchall()

# Query a Pandas DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
result = con.execute("SELECT * FROM df WHERE value > 100").df()

# Join multiple files
con.execute("""
    SELECT o.order_id, c.customer_name, p.product_name
    FROM 'orders.csv' o
    JOIN 'customers.parquet' c ON o.customer_id = c.id
    JOIN 'products.json' p ON o.product_id = p.id
""").df()

# Persistent database
con = duckdb.connect('mydb.duckdb')
con.execute("CREATE TABLE events AS SELECT * FROM 'events.csv'")

Node.js

import { DuckDBInstance } from '@duckdb/node-api';

const instance = await DuckDBInstance.create(':memory:');
const connection = await instance.connect();

const result = await connection.run(`
    SELECT * FROM 'data.csv'
    WHERE price > 100
`);

const rows = await result.getRows();
console.log(rows);

Key Features

Query Any File Format

-- CSV with auto-detection
SELECT * FROM 'data.csv';

-- CSV with options
SELECT * FROM read_csv('data.csv',
    delim=',',
    header=true,
    columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);

-- Parquet (extremely fast)
SELECT * FROM 'data.parquet';
SELECT * FROM read_parquet('s3://bucket/year=*/month=*/*.parquet');

-- JSON (nested supported)
SELECT * FROM 'data.json';

-- Excel
INSTALL spatial;
LOAD spatial;
SELECT * FROM st_read('data.xlsx');

DataFrame Integration

# Seamless Pandas/Polars/Arrow integration
import pandas as pd
import polars as pl
import pyarrow as pa

pd_df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
pl_df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
arrow_table = pa.table({'a': [1,2,3], 'b': [4,5,6]})

# Query them all the same way
duckdb.query("SELECT SUM(a+b) FROM pd_df").df()
duckdb.query("SELECT SUM(a+b) FROM pl_df").pl()
duckdb.query("SELECT SUM(a+b) FROM arrow_table").arrow()

Advanced SQL

-- Window functions
SELECT
    product,
    date,
    revenue,
    SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS cumulative,
    AVG(revenue) OVER (PARTITION BY product ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM sales;

-- List/array operations
SELECT
    user_id,
    list_aggregate(events, 'count') AS event_count,
    list_filter(events, x -> x = 'click') AS clicks
FROM user_events;

-- Struct / Map types
SELECT
    user_id,
    profile.email,
    profile.address.city
FROM users;

-- Sampling
SELECT * FROM big_table USING SAMPLE 1% (bernoulli);

Performance

Benchmark: Scan 100M row Parquet file, aggregate by category

DuckDB (single laptop):  1.2 seconds
PostgreSQL (with index): 15 seconds
SQLite:                  45 seconds
Pandas (in-memory):      8 seconds (OOM on large files)
Spark (1-node):          25 seconds

DuckDB often outperforms distributed databases on a single machine for workloads under 1TB.

DuckDB vs Alternatives

Feature DuckDB SQLite ClickHouse Pandas
In-process Yes Yes No (server) Yes
Analytics speed Extremely fast Slow Very fast Medium
Storage Columnar Row Columnar In-memory
File formats CSV/Parquet/JSON/Excel CSV CSV/Parquet/S3 All Python
ACID Yes Yes Limited No
Max data size Out-of-core RAM limited Petabytes RAM limited
SQL Full ANSI Simple ANSI + ext Query via SQL

常见问题

Q: DuckDB 和 SQLite 哪个更快? A: 对于分析查询(聚合、分组、连接大表),DuckDB 快 10-100 倍(列式存储优势)。对于 OLTP(单行增删改查),SQLite 更快。DuckDB 是 SQLite 的分析对立面。

Q: 需要 ETL 步骤吗? A: 不需要!DuckDB 可以直接查询 CSV、Parquet、JSON 文件,无需先导入到数据库。这是它最大的优势——零 ETL 的数据分析。

Q: 能处理多大数据? A: DuckDB 支持 out-of-core 执行,可以处理超过 RAM 大小的数据集。实际测试可以在 16GB RAM 机器上处理 100GB 的 Parquet 文件。比 Pandas 强很多。

来源与致谢

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets