# DuckDB — Fast In-Process Analytical SQL Database > DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application. ## Install Save as a script file and run: ## Quick Use ```bash # Install CLI brew install duckdb # macOS curl https://install.duckdb.org | sh # Linux # Start interactive shell duckdb # Query a CSV directly — no import needed SELECT * FROM 'data.csv' LIMIT 10; SELECT COUNT(*) FROM 'https://example.com/data.parquet'; ``` ## Intro **DuckDB** is an open-source, in-process SQL database designed for analytical queries. Think of it as "SQLite for analytics" — a single binary that embeds directly into your application, requires zero configuration, and queries data files (CSV, Parquet, JSON) directly at lightning speed. With 37.3K+ GitHub stars and MIT license, DuckDB has revolutionized data analysis workflows by bringing database-class performance to laptops and scripts. It's used by data scientists, analysts, and developers for local analytics, ETL, and data exploration. ## What DuckDB Does - **In-Process**: Embedded in your application like SQLite (no server) - **Columnar**: Column-oriented storage for fast analytical queries - **Query Files Directly**: Read CSV, Parquet, JSON, Excel without importing - **SQL**: Full ANSI SQL + modern extensions - **Zero Setup**: Single binary, no configuration, no daemon - **Python/R Integration**: First-class DataFrame (Pandas, Polars, Arrow) support - **Vectorized Execution**: Process data in batches for CPU cache efficiency - **Extensions**: Geospatial, HTTP, S3, full-text search, and more - **ACID**: Full ACID transactions with MVCC - **Cross-Platform**: Runs on macOS, Linux, Windows, WASM (browser!) ## Architecture ``` ┌─────────────────────────────────────┐ │ Your Application │ │ ┌────────────┐ ┌──────────────┐ │ │ │ Python │ │ Node.js │ │ │ │ duckdb │ │ duckdb │ │ │ │ .execute()│ │ .all() │ │ │ └──────┬─────┘ └──────┬───────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌────────────────────────────┐ │ │ │ DuckDB Engine (embedded) │ │ │ │ - Query Optimizer │ │ │ │ - Vectorized Executor │ │ │ │ - Columnar Storage │ │ │ └────────────────────────────┘ │ └─────────────────────────────────────┘ │ ▼ File System (CSV, Parquet, DuckDB file) ``` ## Basic Usage ### CLI ```sql -- Query a CSV file directly SELECT customer_id, SUM(amount) AS total FROM 'orders.csv' WHERE date >= '2024-01-01' GROUP BY customer_id ORDER BY total DESC LIMIT 10; -- Query a Parquet file SELECT * FROM 's3://bucket/data/year=2024/*.parquet' LIMIT 100; -- Query JSON SELECT * FROM read_json_auto('events.json'); -- Create persistent database .open mydb.duckdb -- Create table from query CREATE TABLE sales AS SELECT * FROM 'sales.csv'; -- Query the table SELECT region, SUM(revenue) FROM sales GROUP BY region; ``` ### Python ```python import duckdb # In-memory database con = duckdb.connect() # Query a CSV directly result = con.execute(""" SELECT category, SUM(price) as total FROM 'products.csv' GROUP BY category ORDER BY total DESC """).fetchall() # Query a Pandas DataFrame import pandas as pd df = pd.read_csv('data.csv') result = con.execute("SELECT * FROM df WHERE value > 100").df() # Join multiple files con.execute(""" SELECT o.order_id, c.customer_name, p.product_name FROM 'orders.csv' o JOIN 'customers.parquet' c ON o.customer_id = c.id JOIN 'products.json' p ON o.product_id = p.id """).df() # Persistent database con = duckdb.connect('mydb.duckdb') con.execute("CREATE TABLE events AS SELECT * FROM 'events.csv'") ``` ### Node.js ```javascript import { DuckDBInstance } from '@duckdb/node-api'; const instance = await DuckDBInstance.create(':memory:'); const connection = await instance.connect(); const result = await connection.run(` SELECT * FROM 'data.csv' WHERE price > 100 `); const rows = await result.getRows(); console.log(rows); ``` ## Key Features ### Query Any File Format ```sql -- CSV with auto-detection SELECT * FROM 'data.csv'; -- CSV with options SELECT * FROM read_csv('data.csv', delim=',', header=true, columns={'id': 'INTEGER', 'name': 'VARCHAR'} ); -- Parquet (extremely fast) SELECT * FROM 'data.parquet'; SELECT * FROM read_parquet('s3://bucket/year=*/month=*/*.parquet'); -- JSON (nested supported) SELECT * FROM 'data.json'; -- Excel INSTALL spatial; LOAD spatial; SELECT * FROM st_read('data.xlsx'); ``` ### DataFrame Integration ```python # Seamless Pandas/Polars/Arrow integration import pandas as pd import polars as pl import pyarrow as pa pd_df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}) pl_df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]}) arrow_table = pa.table({'a': [1,2,3], 'b': [4,5,6]}) # Query them all the same way duckdb.query("SELECT SUM(a+b) FROM pd_df").df() duckdb.query("SELECT SUM(a+b) FROM pl_df").pl() duckdb.query("SELECT SUM(a+b) FROM arrow_table").arrow() ``` ### Advanced SQL ```sql -- Window functions SELECT product, date, revenue, SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS cumulative, AVG(revenue) OVER (PARTITION BY product ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS rolling_7day_avg FROM sales; -- List/array operations SELECT user_id, list_aggregate(events, 'count') AS event_count, list_filter(events, x -> x = 'click') AS clicks FROM user_events; -- Struct / Map types SELECT user_id, profile.email, profile.address.city FROM users; -- Sampling SELECT * FROM big_table USING SAMPLE 1% (bernoulli); ``` ## Performance ``` Benchmark: Scan 100M row Parquet file, aggregate by category DuckDB (single laptop): 1.2 seconds PostgreSQL (with index): 15 seconds SQLite: 45 seconds Pandas (in-memory): 8 seconds (OOM on large files) Spark (1-node): 25 seconds ``` DuckDB often outperforms distributed databases on a single machine for workloads under 1TB. ## DuckDB vs Alternatives | Feature | DuckDB | SQLite | ClickHouse | Pandas | |---------|--------|--------|------------|--------| | In-process | Yes | Yes | No (server) | Yes | | Analytics speed | Extremely fast | Slow | Very fast | Medium | | Storage | Columnar | Row | Columnar | In-memory | | File formats | CSV/Parquet/JSON/Excel | CSV | CSV/Parquet/S3 | All Python | | ACID | Yes | Yes | Limited | No | | Max data size | Out-of-core | RAM limited | Petabytes | RAM limited | | SQL | Full ANSI | Simple | ANSI + ext | Query via SQL | ## 常见问题 **Q: DuckDB 和 SQLite 哪个更快?** A: 对于分析查询(聚合、分组、连接大表),DuckDB 快 10-100 倍(列式存储优势)。对于 OLTP(单行增删改查),SQLite 更快。DuckDB 是 SQLite 的分析对立面。 **Q: 需要 ETL 步骤吗?** A: 不需要!DuckDB 可以直接查询 CSV、Parquet、JSON 文件,无需先导入到数据库。这是它最大的优势——零 ETL 的数据分析。 **Q: 能处理多大数据?** A: DuckDB 支持 out-of-core 执行,可以处理超过 RAM 大小的数据集。实际测试可以在 16GB RAM 机器上处理 100GB 的 Parquet 文件。比 Pandas 强很多。 ## 来源与致谢 - GitHub: [duckdb/duckdb](https://github.com/duckdb/duckdb) — 37.3K+ ⭐ | MIT - 官网: [duckdb.org](https://duckdb.org) --- Source: https://tokrepo.com/en/workflows/2fefa271-3535-11f1-9bc6-00163e2b0d79 Author: Script Depot