What DuckDB Does
- In-Process: Embedded in your application like SQLite (no server)
- Columnar: Column-oriented storage for fast analytical queries
- Query Files Directly: Read CSV, Parquet, JSON, Excel without importing
- SQL: Full ANSI SQL + modern extensions
- Zero Setup: Single binary, no configuration, no daemon
- Python/R Integration: First-class DataFrame (Pandas, Polars, Arrow) support
- Vectorized Execution: Process data in batches for CPU cache efficiency
- Extensions: Geospatial, HTTP, S3, full-text search, and more
- ACID: Full ACID transactions with MVCC
- Cross-Platform: Runs on macOS, Linux, Windows, WASM (browser!)
Architecture
┌─────────────────────────────────────┐
│ Your Application │
│ ┌────────────┐ ┌──────────────┐ │
│ │ Python │ │ Node.js │ │
│ │ duckdb │ │ duckdb │ │
│ │ .execute()│ │ .all() │ │
│ └──────┬─────┘ └──────┬───────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────┐ │
│ │ DuckDB Engine (embedded) │ │
│ │ - Query Optimizer │ │
│ │ - Vectorized Executor │ │
│ │ - Columnar Storage │ │
│ └────────────────────────────┘ │
└─────────────────────────────────────┘
│
▼
File System (CSV, Parquet, DuckDB file)Basic Usage
CLI
-- Query a CSV file directly
SELECT
customer_id,
SUM(amount) AS total
FROM 'orders.csv'
WHERE date >= '2024-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;
-- Query a Parquet file
SELECT * FROM 's3://bucket/data/year=2024/*.parquet' LIMIT 100;
-- Query JSON
SELECT * FROM read_json_auto('events.json');
-- Create persistent database
.open mydb.duckdb
-- Create table from query
CREATE TABLE sales AS
SELECT * FROM 'sales.csv';
-- Query the table
SELECT region, SUM(revenue) FROM sales GROUP BY region;Python
import duckdb
# In-memory database
con = duckdb.connect()
# Query a CSV directly
result = con.execute("""
SELECT category, SUM(price) as total
FROM 'products.csv'
GROUP BY category
ORDER BY total DESC
""").fetchall()
# Query a Pandas DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
result = con.execute("SELECT * FROM df WHERE value > 100").df()
# Join multiple files
con.execute("""
SELECT o.order_id, c.customer_name, p.product_name
FROM 'orders.csv' o
JOIN 'customers.parquet' c ON o.customer_id = c.id
JOIN 'products.json' p ON o.product_id = p.id
""").df()
# Persistent database
con = duckdb.connect('mydb.duckdb')
con.execute("CREATE TABLE events AS SELECT * FROM 'events.csv'")Node.js
import { DuckDBInstance } from '@duckdb/node-api';
const instance = await DuckDBInstance.create(':memory:');
const connection = await instance.connect();
const result = await connection.run(`
SELECT * FROM 'data.csv'
WHERE price > 100
`);
const rows = await result.getRows();
console.log(rows);Key Features
Query Any File Format
-- CSV with auto-detection
SELECT * FROM 'data.csv';
-- CSV with options
SELECT * FROM read_csv('data.csv',
delim=',',
header=true,
columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);
-- Parquet (extremely fast)
SELECT * FROM 'data.parquet';
SELECT * FROM read_parquet('s3://bucket/year=*/month=*/*.parquet');
-- JSON (nested supported)
SELECT * FROM 'data.json';
-- Excel
INSTALL spatial;
LOAD spatial;
SELECT * FROM st_read('data.xlsx');DataFrame Integration
# Seamless Pandas/Polars/Arrow integration
import pandas as pd
import polars as pl
import pyarrow as pa
pd_df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
pl_df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
arrow_table = pa.table({'a': [1,2,3], 'b': [4,5,6]})
# Query them all the same way
duckdb.query("SELECT SUM(a+b) FROM pd_df").df()
duckdb.query("SELECT SUM(a+b) FROM pl_df").pl()
duckdb.query("SELECT SUM(a+b) FROM arrow_table").arrow()Advanced SQL
-- Window functions
SELECT
product,
date,
revenue,
SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS cumulative,
AVG(revenue) OVER (PARTITION BY product ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM sales;
-- List/array operations
SELECT
user_id,
list_aggregate(events, 'count') AS event_count,
list_filter(events, x -> x = 'click') AS clicks
FROM user_events;
-- Struct / Map types
SELECT
user_id,
profile.email,
profile.address.city
FROM users;
-- Sampling
SELECT * FROM big_table USING SAMPLE 1% (bernoulli);Performance
Benchmark: Scan 100M row Parquet file, aggregate by category
DuckDB (single laptop): 1.2 seconds
PostgreSQL (with index): 15 seconds
SQLite: 45 seconds
Pandas (in-memory): 8 seconds (OOM on large files)
Spark (1-node): 25 secondsDuckDB often outperforms distributed databases on a single machine for workloads under 1TB.
DuckDB vs Alternatives
| Feature | DuckDB | SQLite | ClickHouse | Pandas |
|---|---|---|---|---|
| In-process | Yes | Yes | No (server) | Yes |
| Analytics speed | Extremely fast | Slow | Very fast | Medium |
| Storage | Columnar | Row | Columnar | In-memory |
| File formats | CSV/Parquet/JSON/Excel | CSV | CSV/Parquet/S3 | All Python |
| ACID | Yes | Yes | Limited | No |
| Max data size | Out-of-core | RAM limited | Petabytes | RAM limited |
| SQL | Full ANSI | Simple | ANSI + ext | Query via SQL |
常见问题
Q: DuckDB 和 SQLite 哪个更快? A: 对于分析查询(聚合、分组、连接大表),DuckDB 快 10-100 倍(列式存储优势)。对于 OLTP(单行增删改查),SQLite 更快。DuckDB 是 SQLite 的分析对立面。
Q: 需要 ETL 步骤吗? A: 不需要!DuckDB 可以直接查询 CSV、Parquet、JSON 文件,无需先导入到数据库。这是它最大的优势——零 ETL 的数据分析。
Q: 能处理多大数据? A: DuckDB 支持 out-of-core 执行,可以处理超过 RAM 大小的数据集。实际测试可以在 16GB RAM 机器上处理 100GB 的 Parquet 文件。比 Pandas 强很多。
来源与致谢
- GitHub: duckdb/duckdb — 37.3K+ ⭐ | MIT
- 官网: duckdb.org