TOKREPO · Arsenal IA
Nouveau · cette semaine

Boîte à Outils Agent du Data Engineer

Dix actifs pour apprendre aux agents IA votre dialecte SQL, structure dbt, sémantique du warehouse et orchestration — pour qu'ils arrêtent d'halluciner des colonnes et livrent vraiment des modèles.

10 ressources

What's in this pack

This is the stack a working data engineer wires up so an AI agent can actually help — not the parade of "text-to-SQL demos" that fall over the moment they meet a real warehouse with 800 tables, three dialects, and a dbt project with macros nobody documented.

Every pick here does one specific job in the agent loop: give the agent schema awareness (the MCPs), give it model awareness (dbt-mcp on top of dbt/SQLMesh), give it orchestration awareness (Airflow), give it lineage awareness (DataHub), and finally give it the one tool it desperately needs but never has — dialect awareness (SQLGlot). When all five are in place, the agent stops inventing STRING_AGG against Snowflake and starts producing models that compile on the first try.

The ordering matters because each layer scaffolds the next. Connect the warehouse MCP before you touch dbt-mcp — there's no point teaching an agent your dbt manifest if it can't even read the underlying tables.

Install in this order (raw SQL → models → orchestration → quality/lineage)

  1. Postgres MCP Pro — Postgres MCP with safe SQL, EXPLAIN plans, and index recommendations. Start here even if Postgres isn't your warehouse, because your operational DB almost certainly is Postgres and that's where most agent-driven exploration begins.
  2. BigQuery MCP — BigQuery MCP with a protected mode for PHI/PII guardrails. If you're on GCP, this is the warehouse connection. The guardrails matter the moment a non-engineer asks the agent "show me top customers."
  3. Trino MCP — Trino/Presto MCP with OAuth 2.1, useful when you have a federated warehouse setup (Iceberg + S3 + the legacy MySQL nobody's killed yet) and want one MCP that fans out.
  4. MCP Toolbox for Databases — Google's open-source MCP that fronts a configurable set of databases for an agent in one server. Use this when you'd rather declare "these five sources, these tools" once than wire five separate MCPs.
  5. dbt — the modeling layer. Even if you already use it, install the asset for the agent's reference and so the agent knows your project follows dbt conventions (models/, schema.yml, dbt_project.yml).
  6. dbt-mcp — the MCP that exposes your dbt project context (models, semantic layer, docs) to the agent. This is the unlock: now the agent answers "what columns does fct_orders have?" by reading your manifest instead of guessing.
  7. SQLMesh — the dbt alternative worth knowing about. Virtual data environments, real column-level lineage, semantic versioning. Install if you're greenfield or hitting dbt's limits; skip if your team is happily on dbt and stable.
  8. Apache Airflow — orchestration. Even with Dagster, Prefect, and the rest competing, Airflow is still where most production data pipelines land. Give the agent the asset so it generates DAGs in the dialect your team actually runs.
  9. DataHub — discovery + lineage + governance. Once the agent can write SQL and dbt models, the next failure mode is "agent broke a downstream dashboard because it didn't know that table fed it." DataHub solves the visibility gap.
  10. SQLGlot — pure-Python SQL parser and transpiler. The unsexy hero of this pack. When the agent learned SQL on Stack Overflow (read: half Snowflake, half Postgres) and you're on BigQuery, SQLGlot transpiles the query and your agent stops failing on dialect differences.

How they fit together

                  ┌── Postgres MCP Pro ──┐
                  │  (op DB + safe SQL)    │
                  └────────┬──────────────┘
                           │
   BigQuery MCP ──┐        │       ┌── Trino MCP
   (warehouse)    │        │       │  (federated)
                  └────────┼───────┘
                           ▼
              MCP Toolbox for Databases
                  (one server, many sources)
                           │
                           ▼
                  ┌─── dbt / SQLMesh ───┐
                  │  (your model layer)   │
                  └──────────┬───────────┘
                             │
                             ▼
                         dbt-mcp
              (agent reads manifest + semantic layer)
                             │
                             ▼
                    ┌──── Airflow ────┐
                    │ (DAGs schedule it)│
                    └────────┬─────────┘
                             │
                             ▼
                         DataHub
                  (lineage + impact analysis)
                             │
                             ▼
                         SQLGlot
        (transpiles between dialects whenever the agent guesses wrong)

The critical join is MCP layer + dbt-mcp: the warehouse MCPs expose physical schemas, dbt-mcp exposes the logical model layer on top. With both, the agent can answer "which model produces this column, and what's the underlying SQL?" — the question that turns an agent from a toy into a teammate.

Tradeoffs you'll hit

  • Airflow vs Dagster vs Prefect — Airflow has the deepest deployment footprint and the broadest hiring pool, so if you're a one-person data team in a larger org, you'll inherit it. Dagster has cleaner asset-oriented semantics and a much better dev loop. Prefect is the lightest weight. Default to Airflow unless you have a green field; the agent works fine with all three.
  • dbt vs SQLMesh — dbt has the ecosystem (every BI tool integrates), but its incremental-model story is fragile, and column-level lineage requires paid Cloud or third-party tools. SQLMesh has virtual environments, true lineage, and semantic versioning out of the box, but smaller community. New project: try SQLMesh. Existing dbt project at scale: stay.
  • ClickHouse vs BigQuery vs Snowflake — ClickHouse for real-time analytics where latency matters more than schema flexibility. BigQuery if you're on GCP and want zero-ops at the cost of slow joins. Snowflake if you want flexible compute and don't mind paying for it. The agent doesn't care, but the SQL it writes does — which is why SQLGlot is in this pack.
  • One warehouse MCP vs MCP Toolbox — MCP Toolbox is one server for many sources, which is operationally simpler but a bigger blast radius if it breaks. Direct per-warehouse MCPs are more failure-isolated but more services to babysit. Start with the toolbox; split out when you need stricter access boundaries.

Common pitfalls

  • LLM SQL hallucinations against real schemas — the failure mode is the agent inventing customer_id instead of cust_id. The fix is connecting the MCP first, then asking the agent to query — never the other way round. If the MCP isn't connected, the agent will guess; if it is, the agent will read.
  • Dialect drift across copy-pasted Stack Overflow examples — agent generates Postgres-flavored SQL on a BigQuery project (SUBSTRING vs SUBSTR, || vs CONCAT, window function syntax). Add SQLGlot to your pipeline as a transpile step and the noise drops dramatically.
  • Treating dbt-mcp as a write tool too early — start in read-only. Let the agent read your manifest, answer questions, propose changes in PR form. Granting write/CLI permissions to an agent against a production dbt project before you've watched it work for a week is how you get 200 broken models on a Tuesday.
  • Skipping lineage until something breaks — DataHub feels like overhead until the day an agent changes a column upstream and three dashboards silently start reporting wrong numbers. Install it before you need it; the cost of operating it is much smaller than the cost of finding out without it.
  • Forgetting PHI/PII guardrails — the BigQuery MCP's protected mode exists for a reason. If your warehouse has any regulated data and the agent is exposed to anyone outside the data team, turn on column-level masking from day one. Don't wait for a compliance ticket.
INSTALLER · UNE COMMANDE
$ tokrepo install pack/data-engineer-agent-toolbox
passez-la à votre agent — ou collez-la dans votre terminal
Ce qu'il contient

10 ressources prêtes à installer

MCP#01
dbt-mcp — dbt Context MCP Server for Agents

Give AI agents structured access to dbt project context and tools (SQL, semantic layer, docs search). Ships an experimental MCP bundle in releases.

by MCP Hub·107 views
$ tokrepo install dbt-mcp-dbt-context-mcp-server-for-agents
MCP#02
Postgres MCP Pro — Index Tuning + Safe SQL Tools

Postgres MCP Pro is an MCP server for PostgreSQL that runs safe SQL, explains plans, and recommends indexes so agents can tune databases faster.

by MCP Hub·68 views
$ tokrepo install postgres-mcp-pro-index-tuning-safe-sql-tools
MCP#03
BigQuery MCP — Protected Mode for PHI/PII Guardrails

BigQuery MCP runs BigQuery queries from Claude Desktop and can block sensitive columns in Protected Mode so PHI/PII never enters the LLM context.

by MCP Hub·56 views
$ tokrepo install bigquery-mcp-protected-mode-for-phi-pii-guardrails
MCP#04
Trino MCP — OAuth 2.1 + Query CLI for Data Warehouses

Trino MCP is a Go MCP server and CLI for Trino with OAuth 2.1 support, letting agents query catalogs and schemas with traceable user identity.

by MCP Hub·63 views
$ tokrepo install trino-mcp-oauth-2-1-query-cli-for-data-warehouses
Skill#05
MCP Toolbox for Databases — AI Agent Database Server

MCP Toolbox for Databases is an open-source MCP server by Google that gives AI agents secure, governed access to databases including PostgreSQL, MySQL, Spanner, BigQuery, and more.

by Script Depot·136 views
$ tokrepo install mcp-toolbox-databases-ai-agent-database-server-94e852dc
Skill#06
dbt — Data Build Tool for SQL Transformations

Open-source framework for modeling, testing, and documenting SQL transformations in the modern data warehouse.

by Script Depot·77 views
$ tokrepo install dbt-data-build-tool-sql-transformations-894f7271
Skill#07
SQLMesh — Scalable Data Transformation Framework for SQL

SQLMesh is an open-source data transformation framework that provides efficient, incremental builds, built-in data validation, and a virtual data environment system. It is backwards-compatible with dbt and designed to scale data pipelines without full table rebuilds.

by AI Open Source·70 views
$ tokrepo install sqlmesh-scalable-data-transformation-framework-sql-39a2e67f
Skill#08
Apache Airflow — Programmatic Workflow Orchestration Platform

Apache Airflow is the industry-standard platform for authoring, scheduling, and monitoring data workflows. Define DAGs in Python to orchestrate ETL pipelines, ML training, data processing, and any complex workflow with dependencies.

by Apache Software Foundation·130 views
$ tokrepo install apache-airflow-programmatic-workflow-orchestration-platform-00a6152f
Skill#09
DataHub — Open-Source Data Discovery & Governance Platform

DataHub is a modern metadata platform for discovering, governing, and observing your data stack. Built by LinkedIn and now a top-level project at Acryl Data, it unifies metadata from warehouses, lakes, dashboards, and ML pipelines into one searchable catalog.

by Script Depot·72 views
$ tokrepo install datahub-open-source-data-discovery-governance-platform-904ed27e
Skill#10
SQLGlot — SQL Parser, Transpiler & Optimizer in Pure Python

SQLGlot is a no-dependency Python library that parses, transpiles, and optimizes SQL across 20+ dialects. Convert queries between Snowflake, BigQuery, DuckDB, Spark, Postgres, and more without touching the database.

by Script Depot·156 views
$ tokrepo install sqlglot-sql-parser-transpiler-optimizer-pure-python-16b7b083
Questions fréquentes

Questions fréquentes

Do I need all 10 of these, or is there a minimum viable subset to start with?

Minimum viable is three: one warehouse MCP (Postgres MCP Pro if your op DB is Postgres, otherwise the BigQuery MCP if you're on GCP), dbt-mcp on top of your existing dbt project, and SQLGlot for dialect translation. With those three an agent can read real schemas, understand your models, and translate when it guesses wrong. Airflow, DataHub, SQLMesh, and the rest layer in as your stack grows — install them on the weeks they become necessary, not preemptively.

Why install both a warehouse MCP and MCP Toolbox for Databases — isn't that redundant?

Slightly, but intentionally. Direct per-warehouse MCPs (Postgres MCP Pro, BigQuery MCP, Trino MCP) give you tight per-source controls and clearer failure isolation. MCP Toolbox for Databases gives you one declarative server that fronts several sources for the agent. Most teams end up running the toolbox for read-heavy exploration and a dedicated MCP for the one warehouse where you care most about safety guardrails (usually the production OLTP Postgres). Pick based on operational preference.

How does dbt-mcp actually reduce hallucinations — what does it expose to the agent?

dbt-mcp gives the agent structured access to your dbt project artifacts: model definitions, schema.yml docs, semantic layer queries, and (optionally) compiled SQL. Instead of guessing what columns a model has, the agent reads the manifest. Instead of inventing metric names, it reads your semantic layer. The agent's answer is grounded in real entities you've already defined — which is the whole game when an LLM is querying data it wasn't trained on.

If I'm starting greenfield, should I just skip dbt and go straight to SQLMesh?

Maybe. SQLMesh has cleaner semantics for virtual environments, column-level lineage, and incremental models — areas where dbt has known sharp edges. The tradeoff is ecosystem: dbt integrates with essentially every BI tool, hiring is much easier, and the community is enormous. For a small team that's data-savvy and starting fresh, SQLMesh is a reasonable bet. For a team that needs to onboard analysts quickly or integrate with existing tooling, stay on dbt and adopt SQLMesh later if you hit its limits.

Is connecting an agent to a production warehouse via MCP actually safe?

It can be, but the defaults matter. Use read-only credentials by default. Use the warehouse MCP's safety features (Postgres MCP Pro's safe SQL mode, BigQuery MCP's protected PHI/PII mode). Separate dev and prod targets so the agent can't accidentally write to prod. Require human confirmation for any DDL or DML. Audit what the agent queries — most warehouses already log this, you just need to look. The risk is not zero, but it's manageable if you treat the agent like a new junior engineer who hasn't earned write access yet.

PLUS DANS L'ARSENAL

12 packs · 80+ ressources sélectionnées

Découvrez tous les packs curatés sur la page d'accueil

Retour à tous les packs