Skills2026年4月10日·1 分钟阅读

DuckDB — Fast In-Process Analytical SQL Database

DuckDB is a lightning-fast, in-process analytical database. Query CSV, Parquet, and JSON files with SQL — SQLite for analytics, zero setup, embedded in your application.

Agent 就绪

先审查再安装

这个资产需要先审查。复制的指令会要求 Agent dry-run、列出写入项,确认后再继续。

Needs Confirmation · 64/100策略:需确认
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
step-1.md
先审查命令
npx -y tokrepo@latest install 2fefa271-3535-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run,确认写入项后再运行此命令。

TL;DR
DuckDB is an in-process analytical SQL database that queries CSV, Parquet, and JSON files directly with zero setup and no server.
§01

What it is

DuckDB is an in-process analytical database designed for fast SQL queries over local data files. It reads CSV, Parquet, JSON, and Excel files directly without loading them into a separate database server. DuckDB embeds into your application like SQLite but is optimized for analytical queries (aggregations, joins, window functions) rather than transactional workloads.

It targets data scientists, analysts, and developers who work with local data files and want SQL query capabilities without setting up a database server. DuckDB runs in Python, R, Node.js, Java, Go, and as a standalone CLI.

§02

How it saves time or tokens

DuckDB eliminates the load-transform-query cycle. Instead of importing a CSV into PostgreSQL to run SQL queries, query the CSV directly: SELECT * FROM 'data.csv' WHERE amount > 100. No schema definition, no import step, no server to start.

For data science workflows, DuckDB integrates with pandas DataFrames and Apache Arrow. Query DataFrames with SQL and get results back as DataFrames. This bridges the gap between SQL and Python data manipulation.

§03

How to use

  1. Install DuckDB: pip install duckdb for Python, brew install duckdb for CLI, or add the dependency to your project.
  2. Query files directly with SQL: duckdb -c "SELECT * FROM 'sales.parquet' LIMIT 10". No database creation or table definition needed.
  3. For persistent storage, create a database file: duckdb mydata.db. Tables and indexes persist between sessions.
§04

Example

import duckdb

# Query a CSV file directly
result = duckdb.sql("""
    SELECT 
        category,
        count(*) as orders,
        sum(amount) as total_revenue,
        avg(amount) as avg_order
    FROM 'orders.csv'
    WHERE order_date >= '2026-01-01'
    GROUP BY category
    ORDER BY total_revenue DESC
""").fetchdf()  # Returns a pandas DataFrame

# Query a Parquet file from S3
result = duckdb.sql("""
    SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet')
    WHERE region = 'US'
""")

# Query a pandas DataFrame with SQL
import pandas as pd
df = pd.read_csv('users.csv')
duckdb.sql("SELECT * FROM df WHERE age > 25").show()
§05

Related on TokRepo

§06

Common pitfalls

  • Trying to use DuckDB for concurrent transactional workloads. DuckDB supports a single writer at a time and is optimized for analytical queries. For multi-user OLTP workloads, use PostgreSQL or SQLite.
  • Not leveraging Parquet format for large datasets. DuckDB queries Parquet files much faster than CSV because Parquet is columnar and compressed. Convert large CSV files to Parquet for repeated analysis.
  • Assuming DuckDB needs a server. DuckDB runs entirely in-process. There is no server to start, no port to configure, and no connection string. It loads as a library in your application or runs as a CLI tool.

常见问题

How does DuckDB compare to SQLite?+

SQLite is optimized for transactional (OLTP) workloads with many small reads and writes. DuckDB is optimized for analytical (OLAP) workloads with complex aggregations over large datasets. Both are embedded (no server) and store data in a single file. Choose DuckDB for data analysis and SQLite for application state.

Can DuckDB query remote files?+

Yes. DuckDB queries files from S3, GCS, Azure Blob Storage, and HTTP URLs directly. Install the httpfs extension and query remote Parquet or CSV files without downloading them first. DuckDB uses predicate pushdown and column pruning to minimize data transfer.

How large a dataset can DuckDB handle?+

DuckDB handles datasets larger than available RAM using disk-based spilling. It processes data in chunks, so datasets of hundreds of gigabytes work on machines with modest memory. For truly massive datasets (terabytes), use ClickHouse or a distributed query engine.

Does DuckDB support extensions?+

Yes. DuckDB has extensions for spatial queries (PostGIS-compatible), full-text search, JSON, Excel, Parquet, ICeberg, Delta Lake, httpfs (remote file access), and more. Install extensions with `INSTALL extension_name; LOAD extension_name;`.

Can I use DuckDB with pandas?+

Yes. DuckDB integrates tightly with pandas. Query DataFrames directly with SQL (just reference the variable name in FROM clause), and get results back as DataFrames with .fetchdf(). This enables SQL for complex transformations while staying in the pandas ecosystem.

引用来源 (3)

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产