Boîte à Outils Agent du Data Engineer
Dix actifs pour apprendre aux agents IA votre dialecte SQL, structure dbt, sémantique du warehouse et orchestration — pour qu'ils arrêtent d'halluciner des colonnes et livrent vraiment des modèles.
What's in this pack
This is the stack a working data engineer wires up so an AI agent can actually help — not the parade of "text-to-SQL demos" that fall over the moment they meet a real warehouse with 800 tables, three dialects, and a dbt project with macros nobody documented.
Every pick here does one specific job in the agent loop: give the agent schema awareness (the MCPs), give it model awareness (dbt-mcp on top of dbt/SQLMesh), give it orchestration awareness (Airflow), give it lineage awareness (DataHub), and finally give it the one tool it desperately needs but never has — dialect awareness (SQLGlot). When all five are in place, the agent stops inventing STRING_AGG against Snowflake and starts producing models that compile on the first try.
The ordering matters because each layer scaffolds the next. Connect the warehouse MCP before you touch dbt-mcp — there's no point teaching an agent your dbt manifest if it can't even read the underlying tables.
Install in this order (raw SQL → models → orchestration → quality/lineage)
- Postgres MCP Pro — Postgres MCP with safe SQL, EXPLAIN plans, and index recommendations. Start here even if Postgres isn't your warehouse, because your operational DB almost certainly is Postgres and that's where most agent-driven exploration begins.
- BigQuery MCP — BigQuery MCP with a protected mode for PHI/PII guardrails. If you're on GCP, this is the warehouse connection. The guardrails matter the moment a non-engineer asks the agent "show me top customers."
- Trino MCP — Trino/Presto MCP with OAuth 2.1, useful when you have a federated warehouse setup (Iceberg + S3 + the legacy MySQL nobody's killed yet) and want one MCP that fans out.
- MCP Toolbox for Databases — Google's open-source MCP that fronts a configurable set of databases for an agent in one server. Use this when you'd rather declare "these five sources, these tools" once than wire five separate MCPs.
- dbt — the modeling layer. Even if you already use it, install the asset for the agent's reference and so the agent knows your project follows dbt conventions (
models/,schema.yml,dbt_project.yml). - dbt-mcp — the MCP that exposes your dbt project context (models, semantic layer, docs) to the agent. This is the unlock: now the agent answers "what columns does
fct_ordershave?" by reading your manifest instead of guessing. - SQLMesh — the dbt alternative worth knowing about. Virtual data environments, real column-level lineage, semantic versioning. Install if you're greenfield or hitting dbt's limits; skip if your team is happily on dbt and stable.
- Apache Airflow — orchestration. Even with Dagster, Prefect, and the rest competing, Airflow is still where most production data pipelines land. Give the agent the asset so it generates DAGs in the dialect your team actually runs.
- DataHub — discovery + lineage + governance. Once the agent can write SQL and dbt models, the next failure mode is "agent broke a downstream dashboard because it didn't know that table fed it." DataHub solves the visibility gap.
- SQLGlot — pure-Python SQL parser and transpiler. The unsexy hero of this pack. When the agent learned SQL on Stack Overflow (read: half Snowflake, half Postgres) and you're on BigQuery, SQLGlot transpiles the query and your agent stops failing on dialect differences.
How they fit together
┌── Postgres MCP Pro ──┐
│ (op DB + safe SQL) │
└────────┬──────────────┘
│
BigQuery MCP ──┐ │ ┌── Trino MCP
(warehouse) │ │ │ (federated)
└────────┼───────┘
▼
MCP Toolbox for Databases
(one server, many sources)
│
▼
┌─── dbt / SQLMesh ───┐
│ (your model layer) │
└──────────┬───────────┘
│
▼
dbt-mcp
(agent reads manifest + semantic layer)
│
▼
┌──── Airflow ────┐
│ (DAGs schedule it)│
└────────┬─────────┘
│
▼
DataHub
(lineage + impact analysis)
│
▼
SQLGlot
(transpiles between dialects whenever the agent guesses wrong)
The critical join is MCP layer + dbt-mcp: the warehouse MCPs expose physical schemas, dbt-mcp exposes the logical model layer on top. With both, the agent can answer "which model produces this column, and what's the underlying SQL?" — the question that turns an agent from a toy into a teammate.
Tradeoffs you'll hit
- Airflow vs Dagster vs Prefect — Airflow has the deepest deployment footprint and the broadest hiring pool, so if you're a one-person data team in a larger org, you'll inherit it. Dagster has cleaner asset-oriented semantics and a much better dev loop. Prefect is the lightest weight. Default to Airflow unless you have a green field; the agent works fine with all three.
- dbt vs SQLMesh — dbt has the ecosystem (every BI tool integrates), but its incremental-model story is fragile, and column-level lineage requires paid Cloud or third-party tools. SQLMesh has virtual environments, true lineage, and semantic versioning out of the box, but smaller community. New project: try SQLMesh. Existing dbt project at scale: stay.
- ClickHouse vs BigQuery vs Snowflake — ClickHouse for real-time analytics where latency matters more than schema flexibility. BigQuery if you're on GCP and want zero-ops at the cost of slow joins. Snowflake if you want flexible compute and don't mind paying for it. The agent doesn't care, but the SQL it writes does — which is why SQLGlot is in this pack.
- One warehouse MCP vs MCP Toolbox — MCP Toolbox is one server for many sources, which is operationally simpler but a bigger blast radius if it breaks. Direct per-warehouse MCPs are more failure-isolated but more services to babysit. Start with the toolbox; split out when you need stricter access boundaries.
Common pitfalls
- LLM SQL hallucinations against real schemas — the failure mode is the agent inventing
customer_idinstead ofcust_id. The fix is connecting the MCP first, then asking the agent to query — never the other way round. If the MCP isn't connected, the agent will guess; if it is, the agent will read. - Dialect drift across copy-pasted Stack Overflow examples — agent generates Postgres-flavored SQL on a BigQuery project (
SUBSTRINGvsSUBSTR,||vsCONCAT, window function syntax). Add SQLGlot to your pipeline as a transpile step and the noise drops dramatically. - Treating dbt-mcp as a write tool too early — start in read-only. Let the agent read your manifest, answer questions, propose changes in PR form. Granting write/CLI permissions to an agent against a production dbt project before you've watched it work for a week is how you get 200 broken models on a Tuesday.
- Skipping lineage until something breaks — DataHub feels like overhead until the day an agent changes a column upstream and three dashboards silently start reporting wrong numbers. Install it before you need it; the cost of operating it is much smaller than the cost of finding out without it.
- Forgetting PHI/PII guardrails — the BigQuery MCP's protected mode exists for a reason. If your warehouse has any regulated data and the agent is exposed to anyone outside the data team, turn on column-level masking from day one. Don't wait for a compliance ticket.
10 ressources prêtes à installer
Questions fréquentes
Do I need all 10 of these, or is there a minimum viable subset to start with?
Minimum viable is three: one warehouse MCP (Postgres MCP Pro if your op DB is Postgres, otherwise the BigQuery MCP if you're on GCP), dbt-mcp on top of your existing dbt project, and SQLGlot for dialect translation. With those three an agent can read real schemas, understand your models, and translate when it guesses wrong. Airflow, DataHub, SQLMesh, and the rest layer in as your stack grows — install them on the weeks they become necessary, not preemptively.
Why install both a warehouse MCP and MCP Toolbox for Databases — isn't that redundant?
Slightly, but intentionally. Direct per-warehouse MCPs (Postgres MCP Pro, BigQuery MCP, Trino MCP) give you tight per-source controls and clearer failure isolation. MCP Toolbox for Databases gives you one declarative server that fronts several sources for the agent. Most teams end up running the toolbox for read-heavy exploration and a dedicated MCP for the one warehouse where you care most about safety guardrails (usually the production OLTP Postgres). Pick based on operational preference.
How does dbt-mcp actually reduce hallucinations — what does it expose to the agent?
dbt-mcp gives the agent structured access to your dbt project artifacts: model definitions, schema.yml docs, semantic layer queries, and (optionally) compiled SQL. Instead of guessing what columns a model has, the agent reads the manifest. Instead of inventing metric names, it reads your semantic layer. The agent's answer is grounded in real entities you've already defined — which is the whole game when an LLM is querying data it wasn't trained on.
If I'm starting greenfield, should I just skip dbt and go straight to SQLMesh?
Maybe. SQLMesh has cleaner semantics for virtual environments, column-level lineage, and incremental models — areas where dbt has known sharp edges. The tradeoff is ecosystem: dbt integrates with essentially every BI tool, hiring is much easier, and the community is enormous. For a small team that's data-savvy and starting fresh, SQLMesh is a reasonable bet. For a team that needs to onboard analysts quickly or integrate with existing tooling, stay on dbt and adopt SQLMesh later if you hit its limits.
Is connecting an agent to a production warehouse via MCP actually safe?
It can be, but the defaults matter. Use read-only credentials by default. Use the warehouse MCP's safety features (Postgres MCP Pro's safe SQL mode, BigQuery MCP's protected PHI/PII mode). Separate dev and prod targets so the agent can't accidentally write to prod. Require human confirmation for any DDL or DML. Audit what the agent queries — most warehouses already log this, you just need to look. The risk is not zero, but it's manageable if you treat the agent like a new junior engineer who hasn't earned write access yet.
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs