Configs2026年5月21日·1 分钟阅读

Amundsen — Open-Source Data Discovery and Metadata Platform

A data discovery and metadata engine by LF AI & Data that helps data teams find, understand, and trust their data assets.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Amundsen Overview
通用 CLI 安装命令
npx tokrepo install bb17c50e-5551-11f1-9bc6-00163e2b0d79

Introduction

Amundsen is a data discovery and metadata platform originally built at Lyft and now maintained under LF AI & Data Foundation. It helps data engineers, analysts, and scientists find the right datasets by providing a search interface, data lineage, ownership tracking, and usage statistics across an organization's data warehouse and lake.

What Amundsen Does

  • Indexes metadata from databases, warehouses, dashboards, and feature stores into a searchable catalog
  • Ranks search results by usage popularity and relevance signals
  • Tracks table and column-level lineage across data pipelines
  • Displays data owners, descriptions, tags, and freshness badges
  • Integrates with Airflow, dbt, Spark, and other tools to ingest metadata automatically

Architecture Overview

Amundsen consists of three microservices: a frontend service (Flask), a search service backed by Elasticsearch, and a metadata service backed by a graph database (Neo4j or Apache Atlas). Databuilder is a separate ETL framework that extracts metadata from source systems and loads it into the metadata and search stores. The frontend communicates with the backend services via REST APIs.

Self-Hosting & Configuration

  • Deploy with Docker Compose for quick evaluation or Helm charts for Kubernetes production setups
  • Configure Databuilder extractors to connect to your Hive, PostgreSQL, BigQuery, Snowflake, or Redshift sources
  • Choose Neo4j or Apache Atlas as the metadata graph backend depending on your infrastructure
  • Set up Airflow DAGs to run Databuilder jobs on a schedule for continuous metadata ingestion
  • Customize the frontend with environment variables for branding, authentication, and feature flags

Key Features

  • Popularity-based search ranking surfaces the most-used tables first
  • Column-level descriptions and tags help analysts understand schema semantics
  • Data preview shows sample rows without leaving the catalog UI
  • Programmatic descriptions allow dbt or Airflow to push documentation automatically
  • Badge system highlights certified, deprecated, or PII-containing datasets

Comparison with Similar Tools

  • DataHub — DataHub is a more recent metadata platform with a richer UI; Amundsen is lighter and simpler to deploy
  • Apache Atlas — Atlas focuses on governance and lineage for Hadoop; Amundsen adds a discovery-first search experience
  • OpenMetadata — OpenMetadata is a newer all-in-one platform; Amundsen has a longer production track record at Lyft-scale
  • Datahub by LinkedIn — LinkedIn DataHub offers fine-grained access control; Amundsen focuses on search and discovery
  • Marquez — Marquez is a lineage-focused metadata service; Amundsen provides a full search and catalog UI

FAQ

Q: What databases can Amundsen index? A: Amundsen supports Hive, PostgreSQL, MySQL, Redshift, BigQuery, Snowflake, Presto, Delta Lake, and many others through Databuilder extractors.

Q: Does Amundsen support data lineage? A: Yes. Amundsen displays table-level and column-level lineage when the metadata is ingested from tools like Airflow or dbt.

Q: Can I add custom metadata to tables? A: Yes. You can add tags, descriptions, owners, and badges both through the UI and programmatically via the metadata API.

Q: How does Amundsen handle authentication? A: Amundsen supports OIDC-based authentication and can integrate with your existing SSO provider.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产