# Amundsen — Open-Source Data Discovery and Metadata Platform > A data discovery and metadata engine by LF AI & Data that helps data teams find, understand, and trust their data assets. ## Install Save in your project root: # Amundsen — Open-Source Data Discovery and Metadata Platform ## Quick Use ```bash git clone https://github.com/amundsen-io/amundsen.git cd amundsen docker-compose -f docker-amundsen.yml up -d # Access the UI at http://localhost:5000 # Search for tables, dashboards, and data owners ``` ## Introduction Amundsen is a data discovery and metadata platform originally built at Lyft and now maintained under LF AI & Data Foundation. It helps data engineers, analysts, and scientists find the right datasets by providing a search interface, data lineage, ownership tracking, and usage statistics across an organization's data warehouse and lake. ## What Amundsen Does - Indexes metadata from databases, warehouses, dashboards, and feature stores into a searchable catalog - Ranks search results by usage popularity and relevance signals - Tracks table and column-level lineage across data pipelines - Displays data owners, descriptions, tags, and freshness badges - Integrates with Airflow, dbt, Spark, and other tools to ingest metadata automatically ## Architecture Overview Amundsen consists of three microservices: a frontend service (Flask), a search service backed by Elasticsearch, and a metadata service backed by a graph database (Neo4j or Apache Atlas). Databuilder is a separate ETL framework that extracts metadata from source systems and loads it into the metadata and search stores. The frontend communicates with the backend services via REST APIs. ## Self-Hosting & Configuration - Deploy with Docker Compose for quick evaluation or Helm charts for Kubernetes production setups - Configure Databuilder extractors to connect to your Hive, PostgreSQL, BigQuery, Snowflake, or Redshift sources - Choose Neo4j or Apache Atlas as the metadata graph backend depending on your infrastructure - Set up Airflow DAGs to run Databuilder jobs on a schedule for continuous metadata ingestion - Customize the frontend with environment variables for branding, authentication, and feature flags ## Key Features - Popularity-based search ranking surfaces the most-used tables first - Column-level descriptions and tags help analysts understand schema semantics - Data preview shows sample rows without leaving the catalog UI - Programmatic descriptions allow dbt or Airflow to push documentation automatically - Badge system highlights certified, deprecated, or PII-containing datasets ## Comparison with Similar Tools - **DataHub** — DataHub is a more recent metadata platform with a richer UI; Amundsen is lighter and simpler to deploy - **Apache Atlas** — Atlas focuses on governance and lineage for Hadoop; Amundsen adds a discovery-first search experience - **OpenMetadata** — OpenMetadata is a newer all-in-one platform; Amundsen has a longer production track record at Lyft-scale - **Datahub by LinkedIn** — LinkedIn DataHub offers fine-grained access control; Amundsen focuses on search and discovery - **Marquez** — Marquez is a lineage-focused metadata service; Amundsen provides a full search and catalog UI ## FAQ **Q: What databases can Amundsen index?** A: Amundsen supports Hive, PostgreSQL, MySQL, Redshift, BigQuery, Snowflake, Presto, Delta Lake, and many others through Databuilder extractors. **Q: Does Amundsen support data lineage?** A: Yes. Amundsen displays table-level and column-level lineage when the metadata is ingested from tools like Airflow or dbt. **Q: Can I add custom metadata to tables?** A: Yes. You can add tags, descriptions, owners, and badges both through the UI and programmatically via the metadata API. **Q: How does Amundsen handle authentication?** A: Amundsen supports OIDC-based authentication and can integrate with your existing SSO provider. ## Sources - https://github.com/amundsen-io/amundsen - https://www.amundsen.io --- Source: https://tokrepo.com/en/workflows/asset-bb17c50e Author: AI Open Source