Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsApr 10, 2026·3 min de lectura

DuckDB — Fast In-Process Analytical SQL Database

DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.

Script Depot · Community

Listo para agents

Instalación con revisión previa

Este activo requiere revisión. El prompt copiado pide dry-run, muestra escrituras y continúa solo tras confirmación.

Needs Confirmation · 64/100Política: confirmar

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Established

Entrada

step-1.md

Comando con revisión previa

npx -y tokrepo@latest install 2fefa271-3535-11f1-9bc6-00163e2b0d79 --target codex

Primero dry-run, confirma las escrituras y luego ejecuta este comando.

TL;DR

DuckDB is an in-process analytical SQL database that queries CSV, Parquet, and JSON files directly with zero setup and no server.

§01

What it is

DuckDB is an in-process analytical database designed for fast SQL queries over local data files. It reads CSV, Parquet, JSON, and Excel files directly without loading them into a separate database server. DuckDB embeds into your application like SQLite but is optimized for analytical queries (aggregations, joins, window functions) rather than transactional workloads.

It targets data scientists, analysts, and developers who work with local data files and want SQL query capabilities without setting up a database server. DuckDB runs in Python, R, Node.js, Java, Go, and as a standalone CLI.

§02

How it saves time or tokens

DuckDB eliminates the load-transform-query cycle. Instead of importing a CSV into PostgreSQL to run SQL queries, query the CSV directly: SELECT * FROM 'data.csv' WHERE amount > 100. No schema definition, no import step, no server to start.

For data science workflows, DuckDB integrates with pandas DataFrames and Apache Arrow. Query DataFrames with SQL and get results back as DataFrames. This bridges the gap between SQL and Python data manipulation.

§03

How to use

Install DuckDB: pip install duckdb for Python, brew install duckdb for CLI, or add the dependency to your project.
Query files directly with SQL: duckdb -c "SELECT * FROM 'sales.parquet' LIMIT 10". No database creation or table definition needed.
For persistent storage, create a database file: duckdb mydata.db. Tables and indexes persist between sessions.

§04

Example

import duckdb

# Query a CSV file directly
result = duckdb.sql("""
    SELECT 
        category,
        count(*) as orders,
        sum(amount) as total_revenue,
        avg(amount) as avg_order
    FROM 'orders.csv'
    WHERE order_date >= '2026-01-01'
    GROUP BY category
    ORDER BY total_revenue DESC
""").fetchdf()  # Returns a pandas DataFrame

# Query a Parquet file from S3
result = duckdb.sql("""
    SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet')
    WHERE region = 'US'
""")

# Query a pandas DataFrame with SQL
import pandas as pd
df = pd.read_csv('users.csv')
duckdb.sql("SELECT * FROM df WHERE age > 25").show()

§05

Related on TokRepo

AI tools for database — Database and analytical tools
AI tools for research — Data analysis and research tools

§06

Common pitfalls

Trying to use DuckDB for concurrent transactional workloads. DuckDB supports a single writer at a time and is optimized for analytical queries. For multi-user OLTP workloads, use PostgreSQL or SQLite.
Not leveraging Parquet format for large datasets. DuckDB queries Parquet files much faster than CSV because Parquet is columnar and compressed. Convert large CSV files to Parquet for repeated analysis.
Assuming DuckDB needs a server. DuckDB runs entirely in-process. There is no server to start, no port to configure, and no connection string. It loads as a library in your application or runs as a CLI tool.

Preguntas frecuentes

How does DuckDB compare to SQLite?+

SQLite is optimized for transactional (OLTP) workloads with many small reads and writes. DuckDB is optimized for analytical (OLAP) workloads with complex aggregations over large datasets. Both are embedded (no server) and store data in a single file. Choose DuckDB for data analysis and SQLite for application state.

Can DuckDB query remote files?+

Yes. DuckDB queries files from S3, GCS, Azure Blob Storage, and HTTP URLs directly. Install the httpfs extension and query remote Parquet or CSV files without downloading them first. DuckDB uses predicate pushdown and column pruning to minimize data transfer.

How large a dataset can DuckDB handle?+

DuckDB handles datasets larger than available RAM using disk-based spilling. It processes data in chunks, so datasets of hundreds of gigabytes work on machines with modest memory. For truly massive datasets (terabytes), use ClickHouse or a distributed query engine.

Does DuckDB support extensions?+

Yes. DuckDB has extensions for spatial queries (PostGIS-compatible), full-text search, JSON, Excel, Parquet, ICeberg, Delta Lake, httpfs (remote file access), and more. Install extensions with `INSTALL extension_name; LOAD extension_name;`.

Can I use DuckDB with pandas?+

Yes. DuckDB integrates tightly with pandas. Query DataFrames directly with SQL (just reference the variable name in FROM clause), and get results back as DataFrames with .fetchdf(). This enables SQL for complex transformations while staying in the pandas ecosystem.

Referencias (3)

DuckDB GitHub Repository— DuckDB is an in-process analytical database that queries CSV, Parquet, and JSON …
DuckDB Python API Documentation— DuckDB integrates with pandas DataFrames and Apache Arrow for Python data scienc…
DuckDB Extensions Documentation— DuckDB supports extensions for spatial, full-text search, and remote file access

Relacionados en TokRepo

Database tools Research tools Featured workflows

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Apache DataFusion — Fast In-Process SQL Query Engine in Rust

An extensible query engine written in Rust that uses Apache Arrow as its in-memory format, enabling fast analytical SQL queries embeddable in any application.

Skills

Apache Software Foundation

SQLGlot — SQL Parser, Transpiler & Optimizer in Pure Python

SQLGlot is a no-dependency Python library that parses, transpiles, and optimizes SQL across 20+ dialects. Convert queries between Snowflake, BigQuery, DuckDB, Spark, Postgres, and more without touching the database.

Skills

Script Depot

Polars — Blazingly Fast DataFrame Library in Rust

Polars is an extremely fast DataFrame library written in Rust with bindings for Python, Node.js, and R. Uses Apache Arrow columnar format, lazy evaluation, and multi-threaded query execution. The modern alternative to pandas for data engineering and analytics.

Skills

Script Depot

PM2 — Production Process Manager for Node.js

PM2 is a daemon-based process manager for Node.js applications with built-in load balancing, log management, and zero-downtime reloads.

Skills

Script Depot