Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsApr 10, 2026·3 min de lectura

DuckDB — Fast In-Process Analytical SQL Database

DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.

Introducción

DuckDB is an open-source, in-process SQL database designed for analytical queries. Think of it as "SQLite for analytics" — a single binary that embeds directly into your application, requires zero configuration, and queries data files (CSV, Parquet, JSON) directly at lightning speed.

With 37.3K+ GitHub stars and MIT license, DuckDB has revolutionized data analysis workflows by bringing database-class performance to laptops and scripts. It's used by data scientists, analysts, and developers for local analytics, ETL, and data exploration.

What DuckDB Does

  • In-Process: Embedded in your application like SQLite (no server)
  • Columnar: Column-oriented storage for fast analytical queries
  • Query Files Directly: Read CSV, Parquet, JSON, Excel without importing
  • SQL: Full ANSI SQL + modern extensions
  • Zero Setup: Single binary, no configuration, no daemon
  • Python/R Integration: First-class DataFrame (Pandas, Polars, Arrow) support
  • Vectorized Execution: Process data in batches for CPU cache efficiency
  • Extensions: Geospatial, HTTP, S3, full-text search, and more
  • ACID: Full ACID transactions with MVCC
  • Cross-Platform: Runs on macOS, Linux, Windows, WASM (browser!)

Architecture

┌─────────────────────────────────────┐
│      Your Application                │
│  ┌────────────┐  ┌──────────────┐   │
│  │  Python    │  │    Node.js    │   │
│  │  duckdb    │  │  duckdb       │   │
│  │  .execute()│  │  .all()       │   │
│  └──────┬─────┘  └──────┬───────┘   │
│         │               │            │
│         ▼               ▼            │
│  ┌────────────────────────────┐     │
│  │   DuckDB Engine (embedded) │     │
│  │   - Query Optimizer        │     │
│  │   - Vectorized Executor    │     │
│  │   - Columnar Storage       │     │
│  └────────────────────────────┘     │
└─────────────────────────────────────┘
          │
          ▼
    File System (CSV, Parquet, DuckDB file)

Basic Usage

CLI

-- Query a CSV file directly
SELECT
    customer_id,
    SUM(amount) AS total
FROM 'orders.csv'
WHERE date >= '2024-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;

-- Query a Parquet file
SELECT * FROM 's3://bucket/data/year=2024/*.parquet' LIMIT 100;

-- Query JSON
SELECT * FROM read_json_auto('events.json');

-- Create persistent database
.open mydb.duckdb

-- Create table from query
CREATE TABLE sales AS
SELECT * FROM 'sales.csv';

-- Query the table
SELECT region, SUM(revenue) FROM sales GROUP BY region;

Python

import duckdb

# In-memory database
con = duckdb.connect()

# Query a CSV directly
result = con.execute("""
    SELECT category, SUM(price) as total
    FROM 'products.csv'
    GROUP BY category
    ORDER BY total DESC
""").fetchall()

# Query a Pandas DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
result = con.execute("SELECT * FROM df WHERE value > 100").df()

# Join multiple files
con.execute("""
    SELECT o.order_id, c.customer_name, p.product_name
    FROM 'orders.csv' o
    JOIN 'customers.parquet' c ON o.customer_id = c.id
    JOIN 'products.json' p ON o.product_id = p.id
""").df()

# Persistent database
con = duckdb.connect('mydb.duckdb')
con.execute("CREATE TABLE events AS SELECT * FROM 'events.csv'")

Node.js

import { DuckDBInstance } from '@duckdb/node-api';

const instance = await DuckDBInstance.create(':memory:');
const connection = await instance.connect();

const result = await connection.run(`
    SELECT * FROM 'data.csv'
    WHERE price > 100
`);

const rows = await result.getRows();
console.log(rows);

Key Features

Query Any File Format

-- CSV with auto-detection
SELECT * FROM 'data.csv';

-- CSV with options
SELECT * FROM read_csv('data.csv',
    delim=',',
    header=true,
    columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);

-- Parquet (extremely fast)
SELECT * FROM 'data.parquet';
SELECT * FROM read_parquet('s3://bucket/year=*/month=*/*.parquet');

-- JSON (nested supported)
SELECT * FROM 'data.json';

-- Excel
INSTALL spatial;
LOAD spatial;
SELECT * FROM st_read('data.xlsx');

DataFrame Integration

# Seamless Pandas/Polars/Arrow integration
import pandas as pd
import polars as pl
import pyarrow as pa

pd_df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
pl_df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
arrow_table = pa.table({'a': [1,2,3], 'b': [4,5,6]})

# Query them all the same way
duckdb.query("SELECT SUM(a+b) FROM pd_df").df()
duckdb.query("SELECT SUM(a+b) FROM pl_df").pl()
duckdb.query("SELECT SUM(a+b) FROM arrow_table").arrow()

Advanced SQL

-- Window functions
SELECT
    product,
    date,
    revenue,
    SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS cumulative,
    AVG(revenue) OVER (PARTITION BY product ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM sales;

-- List/array operations
SELECT
    user_id,
    list_aggregate(events, 'count') AS event_count,
    list_filter(events, x -> x = 'click') AS clicks
FROM user_events;

-- Struct / Map types
SELECT
    user_id,
    profile.email,
    profile.address.city
FROM users;

-- Sampling
SELECT * FROM big_table USING SAMPLE 1% (bernoulli);

Performance

Benchmark: Scan 100M row Parquet file, aggregate by category

DuckDB (single laptop):  1.2 seconds
PostgreSQL (with index): 15 seconds
SQLite:                  45 seconds
Pandas (in-memory):      8 seconds (OOM on large files)
Spark (1-node):          25 seconds

DuckDB often outperforms distributed databases on a single machine for workloads under 1TB.

DuckDB vs Alternatives

Feature DuckDB SQLite ClickHouse Pandas
In-process Yes Yes No (server) Yes
Analytics speed Extremely fast Slow Very fast Medium
Storage Columnar Row Columnar In-memory
File formats CSV/Parquet/JSON/Excel CSV CSV/Parquet/S3 All Python
ACID Yes Yes Limited No
Max data size Out-of-core RAM limited Petabytes RAM limited
SQL Full ANSI Simple ANSI + ext Query via SQL

FAQ

Q: Which is faster, DuckDB or SQLite? A: For analytical queries (aggregations, grouping, joining large tables), DuckDB is 10-100x faster thanks to columnar storage. For OLTP (single-row CRUD), SQLite is faster. DuckDB is the analytical counterpart to SQLite.

Q: Do I need an ETL step? A: No! DuckDB can query CSV, Parquet, and JSON files directly without first importing them into a database. This is its biggest strength — zero-ETL data analysis.

Q: How much data can it handle? A: DuckDB supports out-of-core execution and can handle datasets larger than RAM. In real-world tests, it can process 100GB Parquet files on a 16GB RAM machine — far beyond what Pandas can do.

Sources & Credits

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados