# DuckDB — Fast In-Process Analytical SQL Database

> DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.

## Install

Save as a script file and run:

## Quick Use

```bash
# Install CLI
brew install duckdb            # macOS
curl https://install.duckdb.org | sh  # Linux

# Start interactive shell
duckdb

# Query a CSV directly — no import needed
SELECT * FROM 'data.csv' LIMIT 10;
SELECT COUNT(*) FROM 'https://example.com/data.parquet';
```

## Intro

**DuckDB** is an open-source, in-process SQL database designed for analytical queries. Think of it as "SQLite for analytics" — a single binary that embeds directly into your application, requires zero configuration, and queries data files (CSV, Parquet, JSON) directly at lightning speed.

With 37.3K+ GitHub stars and MIT license, DuckDB has revolutionized data analysis workflows by bringing database-class performance to laptops and scripts. It's used by data scientists, analysts, and developers for local analytics, ETL, and data exploration.

## What DuckDB Does

- **In-Process**: Embedded in your application like SQLite (no server)
- **Columnar**: Column-oriented storage for fast analytical queries
- **Query Files Directly**: Read CSV, Parquet, JSON, Excel without importing
- **SQL**: Full ANSI SQL + modern extensions
- **Zero Setup**: Single binary, no configuration, no daemon
- **Python/R Integration**: First-class DataFrame (Pandas, Polars, Arrow) support
- **Vectorized Execution**: Process data in batches for CPU cache efficiency
- **Extensions**: Geospatial, HTTP, S3, full-text search, and more
- **ACID**: Full ACID transactions with MVCC
- **Cross-Platform**: Runs on macOS, Linux, Windows, WASM (browser!)

## Architecture

```
┌─────────────────────────────────────┐
│      Your Application                │
│  ┌────────────┐  ┌──────────────┐   │
│  │  Python    │  │    Node.js    │   │
│  │  duckdb    │  │  duckdb       │   │
│  │  .execute()│  │  .all()       │   │
│  └──────┬─────┘  └──────┬───────┘   │
│         │               │            │
│         ▼               ▼            │
│  ┌────────────────────────────┐     │
│  │   DuckDB Engine (embedded) │     │
│  │   - Query Optimizer        │     │
│  │   - Vectorized Executor    │     │
│  │   - Columnar Storage       │     │
│  └────────────────────────────┘     │
└─────────────────────────────────────┘
          │
          ▼
    File System (CSV, Parquet, DuckDB file)
```

## Basic Usage

### CLI

```sql
-- Query a CSV file directly
SELECT
    customer_id,
    SUM(amount) AS total
FROM 'orders.csv'
WHERE date >= '2024-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;

-- Query a Parquet file
SELECT * FROM 's3://bucket/data/year=2024/*.parquet' LIMIT 100;

-- Query JSON
SELECT * FROM read_json_auto('events.json');

-- Create persistent database
.open mydb.duckdb

-- Create table from query
CREATE TABLE sales AS
SELECT * FROM 'sales.csv';

-- Query the table
SELECT region, SUM(revenue) FROM sales GROUP BY region;
```

### Python

```python
import duckdb

# In-memory database
con = duckdb.connect()

# Query a CSV directly
result = con.execute("""
    SELECT category, SUM(price) as total
    FROM 'products.csv'
    GROUP BY category
    ORDER BY total DESC
""").fetchall()

# Query a Pandas DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
result = con.execute("SELECT * FROM df WHERE value > 100").df()

# Join multiple files
con.execute("""
    SELECT o.order_id, c.customer_name, p.product_name
    FROM 'orders.csv' o
    JOIN 'customers.parquet' c ON o.customer_id = c.id
    JOIN 'products.json' p ON o.product_id = p.id
""").df()

# Persistent database
con = duckdb.connect('mydb.duckdb')
con.execute("CREATE TABLE events AS SELECT * FROM 'events.csv'")
```

### Node.js

```javascript
import { DuckDBInstance } from '@duckdb/node-api';

const instance = await DuckDBInstance.create(':memory:');
const connection = await instance.connect();

const result = await connection.run(`
    SELECT * FROM 'data.csv'
    WHERE price > 100
`);

const rows = await result.getRows();
console.log(rows);
```

## Key Features

### Query Any File Format

```sql
-- CSV with auto-detection
SELECT * FROM 'data.csv';

-- CSV with options
SELECT * FROM read_csv('data.csv',
    delim=',',
    header=true,
    columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);

-- Parquet (extremely fast)
SELECT * FROM 'data.parquet';
SELECT * FROM read_parquet('s3://bucket/year=*/month=*/*.parquet');

-- JSON (nested supported)
SELECT * FROM 'data.json';

-- Excel
INSTALL spatial;
LOAD spatial;
SELECT * FROM st_read('data.xlsx');
```

### DataFrame Integration

```python
# Seamless Pandas/Polars/Arrow integration
import pandas as pd
import polars as pl
import pyarrow as pa

pd_df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
pl_df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
arrow_table = pa.table({'a': [1,2,3], 'b': [4,5,6]})

# Query them all the same way
duckdb.query("SELECT SUM(a+b) FROM pd_df").df()
duckdb.query("SELECT SUM(a+b) FROM pl_df").pl()
duckdb.query("SELECT SUM(a+b) FROM arrow_table").arrow()
```

### Advanced SQL

```sql
-- Window functions
SELECT
    product,
    date,
    revenue,
    SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS cumulative,
    AVG(revenue) OVER (PARTITION BY product ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM sales;

-- List/array operations
SELECT
    user_id,
    list_aggregate(events, 'count') AS event_count,
    list_filter(events, x -> x = 'click') AS clicks
FROM user_events;

-- Struct / Map types
SELECT
    user_id,
    profile.email,
    profile.address.city
FROM users;

-- Sampling
SELECT * FROM big_table USING SAMPLE 1% (bernoulli);
```

## Performance

```
Benchmark: Scan 100M row Parquet file, aggregate by category

DuckDB (single laptop):  1.2 seconds
PostgreSQL (with index): 15 seconds
SQLite:                  45 seconds
Pandas (in-memory):      8 seconds (OOM on large files)
Spark (1-node):          25 seconds
```

DuckDB often outperforms distributed databases on a single machine for workloads under 1TB.

## DuckDB vs Alternatives

| Feature | DuckDB | SQLite | ClickHouse | Pandas |
|---------|--------|--------|------------|--------|
| In-process | Yes | Yes | No (server) | Yes |
| Analytics speed | Extremely fast | Slow | Very fast | Medium |
| Storage | Columnar | Row | Columnar | In-memory |
| File formats | CSV/Parquet/JSON/Excel | CSV | CSV/Parquet/S3 | All Python |
| ACID | Yes | Yes | Limited | No |
| Max data size | Out-of-core | RAM limited | Petabytes | RAM limited |
| SQL | Full ANSI | Simple | ANSI + ext | Query via SQL |

## 常见问题

**Q: DuckDB 和 SQLite 哪个更快？**
A: 对于分析查询（聚合、分组、连接大表），DuckDB 快 10-100 倍（列式存储优势）。对于 OLTP（单行增删改查），SQLite 更快。DuckDB 是 SQLite 的分析对立面。

**Q: 需要 ETL 步骤吗？**
A: 不需要！DuckDB 可以直接查询 CSV、Parquet、JSON 文件，无需先导入到数据库。这是它最大的优势——零 ETL 的数据分析。

**Q: 能处理多大数据？**
A: DuckDB 支持 out-of-core 执行，可以处理超过 RAM 大小的数据集。实际测试可以在 16GB RAM 机器上处理 100GB 的 Parquet 文件。比 Pandas 强很多。

## 来源与致谢

- GitHub: [duckdb/duckdb](https://github.com/duckdb/duckdb) — 37.3K+ ⭐ | MIT
- 官网: [duckdb.org](https://duckdb.org)

---
Source: https://tokrepo.com/en/workflows/2fefa271-3535-11f1-9bc6-00163e2b0d79
Author: Script Depot