What is DuckDB — Fast In-Process Analytical SQL Database?

DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.

Is DuckDB — Fast In-Process Analytical SQL Database free to use?

Yes. DuckDB — Fast In-Process Analytical SQL Database is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install DuckDB — Fast In-Process Analytical SQL Database?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

DuckDB — Fast In-Process Analytical SQL Database

What DuckDB Does

In-Process: Embedded in your application like SQLite (no server)
Columnar: Column-oriented storage for fast analytical queries
Query Files Directly: Read CSV, Parquet, JSON, Excel without importing
SQL: Full ANSI SQL + modern extensions
Zero Setup: Single binary, no configuration, no daemon
Python/R Integration: First-class DataFrame (Pandas, Polars, Arrow) support
Vectorized Execution: Process data in batches for CPU cache efficiency
Extensions: Geospatial, HTTP, S3, full-text search, and more
ACID: Full ACID transactions with MVCC
Cross-Platform: Runs on macOS, Linux, Windows, WASM (browser!)

Architecture

┌─────────────────────────────────────┐
│      Your Application                │
│  ┌────────────┐  ┌──────────────┐   │
│  │  Python    │  │    Node.js    │   │
│  │  duckdb    │  │  duckdb       │   │
│  │  .execute()│  │  .all()       │   │
│  └──────┬─────┘  └──────┬───────┘   │
│         │               │            │
│         ▼               ▼            │
│  ┌────────────────────────────┐     │
│  │   DuckDB Engine (embedded) │     │
│  │   - Query Optimizer        │     │
│  │   - Vectorized Executor    │     │
│  │   - Columnar Storage       │     │
│  └────────────────────────────┘     │
└─────────────────────────────────────┘
          │
          ▼
    File System (CSV, Parquet, DuckDB file)

Basic Usage

CLI

-- Query a CSV file directly
SELECT
    customer_id,
    SUM(amount) AS total
FROM 'orders.csv'
WHERE date >= '2024-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;

-- Query a Parquet file
SELECT * FROM 's3://bucket/data/year=2024/*.parquet' LIMIT 100;

-- Query JSON
SELECT * FROM read_json_auto('events.json');

-- Create persistent database
.open mydb.duckdb

-- Create table from query
CREATE TABLE sales AS
SELECT * FROM 'sales.csv';

-- Query the table
SELECT region, SUM(revenue) FROM sales GROUP BY region;

Python

import duckdb

# In-memory database
con = duckdb.connect()

# Query a CSV directly
result = con.execute("""
    SELECT category, SUM(price) as total
    FROM 'products.csv'
    GROUP BY category
    ORDER BY total DESC
""").fetchall()

# Query a Pandas DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
result = con.execute("SELECT * FROM df WHERE value > 100").df()

# Join multiple files
con.execute("""
    SELECT o.order_id, c.customer_name, p.product_name
    FROM 'orders.csv' o
    JOIN 'customers.parquet' c ON o.customer_id = c.id
    JOIN 'products.json' p ON o.product_id = p.id
""").df()

# Persistent database
con = duckdb.connect('mydb.duckdb')
con.execute("CREATE TABLE events AS SELECT * FROM 'events.csv'")

Node.js

import { DuckDBInstance } from '@duckdb/node-api';

const instance = await DuckDBInstance.create(':memory:');
const connection = await instance.connect();

const result = await connection.run(`
    SELECT * FROM 'data.csv'
    WHERE price > 100
`);

const rows = await result.getRows();
console.log(rows);

Key Features

Query Any File Format

-- CSV with auto-detection
SELECT * FROM 'data.csv';

-- CSV with options
SELECT * FROM read_csv('data.csv',
    delim=',',
    header=true,
    columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);

-- Parquet (extremely fast)
SELECT * FROM 'data.parquet';
SELECT * FROM read_parquet('s3://bucket/year=*/month=*/*.parquet');

-- JSON (nested supported)
SELECT * FROM 'data.json';

-- Excel
INSTALL spatial;
LOAD spatial;
SELECT * FROM st_read('data.xlsx');

DataFrame Integration

# Seamless Pandas/Polars/Arrow integration
import pandas as pd
import polars as pl
import pyarrow as pa

pd_df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
pl_df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
arrow_table = pa.table({'a': [1,2,3], 'b': [4,5,6]})

# Query them all the same way
duckdb.query("SELECT SUM(a+b) FROM pd_df").df()
duckdb.query("SELECT SUM(a+b) FROM pl_df").pl()
duckdb.query("SELECT SUM(a+b) FROM arrow_table").arrow()

Advanced SQL

-- Window functions
SELECT
    product,
    date,
    revenue,
    SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS cumulative,
    AVG(revenue) OVER (PARTITION BY product ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM sales;

-- List/array operations
SELECT
    user_id,
    list_aggregate(events, 'count') AS event_count,
    list_filter(events, x -> x = 'click') AS clicks
FROM user_events;

-- Struct / Map types
SELECT
    user_id,
    profile.email,
    profile.address.city
FROM users;

-- Sampling
SELECT * FROM big_table USING SAMPLE 1% (bernoulli);

Performance

Benchmark: Scan 100M row Parquet file, aggregate by category

DuckDB (single laptop):  1.2 seconds
PostgreSQL (with index): 15 seconds
SQLite:                  45 seconds
Pandas (in-memory):      8 seconds (OOM on large files)
Spark (1-node):          25 seconds

DuckDB often outperforms distributed databases on a single machine for workloads under 1TB.

DuckDB vs Alternatives

Feature	DuckDB	SQLite	ClickHouse	Pandas
In-process	Yes	Yes	No (server)	Yes
Analytics speed	Extremely fast	Slow	Very fast	Medium
Storage	Columnar	Row	Columnar	In-memory
File formats	CSV/Parquet/JSON/Excel	CSV	CSV/Parquet/S3	All Python
ACID	Yes	Yes	Limited	No
Max data size	Out-of-core	RAM limited	Petabytes	RAM limited
SQL	Full ANSI	Simple	ANSI + ext	Query via SQL

常见问题

Q: DuckDB 和 SQLite 哪个更快？ A: 对于分析查询（聚合、分组、连接大表），DuckDB 快 10-100 倍（列式存储优势）。对于 OLTP（单行增删改查），SQLite 更快。DuckDB 是 SQLite 的分析对立面。

Q: 需要 ETL 步骤吗？ A: 不需要！DuckDB 可以直接查询 CSV、Parquet、JSON 文件，无需先导入到数据库。这是它最大的优势——零 ETL 的数据分析。

Q: 能处理多大数据？ A: DuckDB 支持 out-of-core 执行，可以处理超过 RAM 大小的数据集。实际测试可以在 16GB RAM 机器上处理 100GB 的 Parquet 文件。比 Pandas 强很多。

来源与致谢

GitHub: duckdb/duckdb — 37.3K+ ⭐ | MIT
官网: duckdb.org