How do I install Amundsen — Open-Source Data Discovery and Metadata Platform?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Amundsen — Open-Source Data Discovery and Metadata Platform

Introduction

Amundsen is a data discovery and metadata platform originally built at Lyft and now maintained under LF AI & Data Foundation. It helps data engineers, analysts, and scientists find the right datasets by providing a search interface, data lineage, ownership tracking, and usage statistics across an organization's data warehouse and lake.

What Amundsen Does

Indexes metadata from databases, warehouses, dashboards, and feature stores into a searchable catalog
Ranks search results by usage popularity and relevance signals
Tracks table and column-level lineage across data pipelines
Displays data owners, descriptions, tags, and freshness badges
Integrates with Airflow, dbt, Spark, and other tools to ingest metadata automatically

Architecture Overview

Amundsen consists of three microservices: a frontend service (Flask), a search service backed by Elasticsearch, and a metadata service backed by a graph database (Neo4j or Apache Atlas). Databuilder is a separate ETL framework that extracts metadata from source systems and loads it into the metadata and search stores. The frontend communicates with the backend services via REST APIs.

Self-Hosting & Configuration

Deploy with Docker Compose for quick evaluation or Helm charts for Kubernetes production setups
Configure Databuilder extractors to connect to your Hive, PostgreSQL, BigQuery, Snowflake, or Redshift sources
Choose Neo4j or Apache Atlas as the metadata graph backend depending on your infrastructure
Set up Airflow DAGs to run Databuilder jobs on a schedule for continuous metadata ingestion
Customize the frontend with environment variables for branding, authentication, and feature flags

Key Features

Popularity-based search ranking surfaces the most-used tables first
Column-level descriptions and tags help analysts understand schema semantics
Data preview shows sample rows without leaving the catalog UI
Programmatic descriptions allow dbt or Airflow to push documentation automatically
Badge system highlights certified, deprecated, or PII-containing datasets

Comparison with Similar Tools

DataHub — DataHub is a more recent metadata platform with a richer UI; Amundsen is lighter and simpler to deploy
Apache Atlas — Atlas focuses on governance and lineage for Hadoop; Amundsen adds a discovery-first search experience
OpenMetadata — OpenMetadata is a newer all-in-one platform; Amundsen has a longer production track record at Lyft-scale
Datahub by LinkedIn — LinkedIn DataHub offers fine-grained access control; Amundsen focuses on search and discovery
Marquez — Marquez is a lineage-focused metadata service; Amundsen provides a full search and catalog UI

FAQ

Q: What databases can Amundsen index? A: Amundsen supports Hive, PostgreSQL, MySQL, Redshift, BigQuery, Snowflake, Presto, Delta Lake, and many others through Databuilder extractors.

Q: Does Amundsen support data lineage? A: Yes. Amundsen displays table-level and column-level lineage when the metadata is ingested from tools like Airflow or dbt.

Q: Can I add custom metadata to tables? A: Yes. You can add tags, descriptions, owners, and badges both through the UI and programmatically via the metadata API.

Q: How does Amundsen handle authentication? A: Amundsen supports OIDC-based authentication and can integrate with your existing SSO provider.

Amundsen — Open-Source Data Discovery and Metadata Platform

这个资产可以被 Agent 直接读取和安装

Introduction

What Amundsen Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Cerebro — Open-Source Cross-Platform Productivity Launcher

MLflow — Open Source AI Engineering Platform

Windmill — Open-Source Internal Tool Platform

Kepler.gl — Open Source Geospatial Data Visualization