Esta página se muestra en inglés. Una traducción al español está en curso.
SkillsApr 10, 2026·3 min de lectura

DuckDB — Fast In-Process Analytical SQL Database

DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.

Listo para agents

Instalación con revisión previa

Este activo requiere revisión. El prompt copiado pide dry-run, muestra escrituras y continúa solo tras confirmación.

Needs Confirmation · 64/100Política: confirmar
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
step-1.md
Comando con revisión previa
npx -y tokrepo@latest install 2fefa271-3535-11f1-9bc6-00163e2b0d79 --target codex

Primero dry-run, confirma las escrituras y luego ejecuta este comando.

TL;DR
DuckDB is an in-process analytical SQL database that queries CSV, Parquet, and JSON files directly with zero setup and no server.
§01

What it is

DuckDB is an in-process analytical database designed for fast SQL queries over local data files. It reads CSV, Parquet, JSON, and Excel files directly without loading them into a separate database server. DuckDB embeds into your application like SQLite but is optimized for analytical queries (aggregations, joins, window functions) rather than transactional workloads.

It targets data scientists, analysts, and developers who work with local data files and want SQL query capabilities without setting up a database server. DuckDB runs in Python, R, Node.js, Java, Go, and as a standalone CLI.

§02

How it saves time or tokens

DuckDB eliminates the load-transform-query cycle. Instead of importing a CSV into PostgreSQL to run SQL queries, query the CSV directly: SELECT * FROM 'data.csv' WHERE amount > 100. No schema definition, no import step, no server to start.

For data science workflows, DuckDB integrates with pandas DataFrames and Apache Arrow. Query DataFrames with SQL and get results back as DataFrames. This bridges the gap between SQL and Python data manipulation.

§03

How to use

  1. Install DuckDB: pip install duckdb for Python, brew install duckdb for CLI, or add the dependency to your project.
  2. Query files directly with SQL: duckdb -c "SELECT * FROM 'sales.parquet' LIMIT 10". No database creation or table definition needed.
  3. For persistent storage, create a database file: duckdb mydata.db. Tables and indexes persist between sessions.
§04

Example

import duckdb

# Query a CSV file directly
result = duckdb.sql("""
    SELECT 
        category,
        count(*) as orders,
        sum(amount) as total_revenue,
        avg(amount) as avg_order
    FROM 'orders.csv'
    WHERE order_date >= '2026-01-01'
    GROUP BY category
    ORDER BY total_revenue DESC
""").fetchdf()  # Returns a pandas DataFrame

# Query a Parquet file from S3
result = duckdb.sql("""
    SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet')
    WHERE region = 'US'
""")

# Query a pandas DataFrame with SQL
import pandas as pd
df = pd.read_csv('users.csv')
duckdb.sql("SELECT * FROM df WHERE age > 25").show()
§05

Related on TokRepo

§06

Common pitfalls

  • Trying to use DuckDB for concurrent transactional workloads. DuckDB supports a single writer at a time and is optimized for analytical queries. For multi-user OLTP workloads, use PostgreSQL or SQLite.
  • Not leveraging Parquet format for large datasets. DuckDB queries Parquet files much faster than CSV because Parquet is columnar and compressed. Convert large CSV files to Parquet for repeated analysis.
  • Assuming DuckDB needs a server. DuckDB runs entirely in-process. There is no server to start, no port to configure, and no connection string. It loads as a library in your application or runs as a CLI tool.

Preguntas frecuentes

How does DuckDB compare to SQLite?+

SQLite is optimized for transactional (OLTP) workloads with many small reads and writes. DuckDB is optimized for analytical (OLAP) workloads with complex aggregations over large datasets. Both are embedded (no server) and store data in a single file. Choose DuckDB for data analysis and SQLite for application state.

Can DuckDB query remote files?+

Yes. DuckDB queries files from S3, GCS, Azure Blob Storage, and HTTP URLs directly. Install the httpfs extension and query remote Parquet or CSV files without downloading them first. DuckDB uses predicate pushdown and column pruning to minimize data transfer.

How large a dataset can DuckDB handle?+

DuckDB handles datasets larger than available RAM using disk-based spilling. It processes data in chunks, so datasets of hundreds of gigabytes work on machines with modest memory. For truly massive datasets (terabytes), use ClickHouse or a distributed query engine.

Does DuckDB support extensions?+

Yes. DuckDB has extensions for spatial queries (PostGIS-compatible), full-text search, JSON, Excel, Parquet, ICeberg, Delta Lake, httpfs (remote file access), and more. Install extensions with `INSTALL extension_name; LOAD extension_name;`.

Can I use DuckDB with pandas?+

Yes. DuckDB integrates tightly with pandas. Query DataFrames directly with SQL (just reference the variable name in FROM clause), and get results back as DataFrames with .fetchdf(). This enables SQL for complex transformations while staying in the pandas ecosystem.

Referencias (3)

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados