# DataHub — Open-Source Data Discovery & Governance Platform > DataHub is a modern metadata platform for discovering, governing, and observing your data stack. Built by LinkedIn and now a top-level project at Acryl Data, it unifies metadata from warehouses, lakes, dashboards, and ML pipelines into one searchable catalog. ## Install Save as a script file and run: # DataHub — Open-Source Data Discovery & Governance Platform ## Quick Use ```bash python3 -m pip install --upgrade acryl-datahub datahub docker quickstart # UI at http://localhost:9002, default login: datahub / datahub datahub ingest -c my_recipe.yml ``` ## Introduction DataHub gives every data team a single pane of glass for metadata. It was born inside LinkedIn to manage thousands of datasets and was open-sourced to solve the same problem everywhere. It ingests metadata automatically, tracks lineage end-to-end, and enforces governance policies at scale. ## What DataHub Does - Automatically ingests metadata from 50+ sources (Snowflake, BigQuery, dbt, Airflow, Spark, Kafka, and more) - Renders column-level lineage across warehouses, ETL jobs, and dashboards - Enables fine-grained access policies, tags, glossary terms, and domains - Provides a powerful search experience with faceted filters and ranking - Supports real-time metadata changes via its stream-first architecture ## Architecture Overview DataHub is built on a stream-first design. A metadata change log (MCL) powered by Kafka feeds an Elasticsearch index for search, a Neo4j or Elasticsearch graph for lineage, and a MySQL or PostgreSQL store for primary persistence. The frontend is a React app that talks to a GraphQL gateway backed by the Java-based GMS (Generalized Metadata Service). Ingestion runs as pluggable Python recipes that push metadata through the rest framework or Kafka. ## Self-Hosting & Configuration - Deploy via Docker Compose for evaluation or Helm chart for production Kubernetes clusters - Configure ingestion recipes in YAML pointing to each source (warehouse, BI tool, orchestrator) - Customize authentication with OIDC providers (Okta, Azure AD, Google) - Tune Elasticsearch and Kafka settings for large-scale deployments - Set up actions (Slack alerts, auto-tagging) through the Actions framework ## Key Features - Real-time metadata ingestion with change-based architecture - Column-level lineage across the entire data stack - Business glossary and domain management for governance - Impact analysis shows downstream effects before schema changes - Fine-grained RBAC with policy-based access control ## Comparison with Similar Tools - **Amundsen** — simpler catalog UI but lacks real-time ingestion and governance features - **OpenMetadata** — similar scope but younger community and fewer production references - **Atlan** — commercial catalog with polished UX; DataHub is the open-source alternative - **Apache Atlas** — Hadoop-centric and less actively maintained outside Cloudera - **Marquez** — focused on lineage only; DataHub covers search, governance, and observability too ## FAQ **Q: How does DataHub differ from a data catalog?** A: DataHub is a metadata platform that includes cataloging, lineage, governance, and observability. Traditional catalogs focus on search alone. **Q: Can DataHub handle thousands of datasets?** A: Yes. It was designed at LinkedIn scale with millions of metadata entities and is used in production by companies like Optum, Saxo Bank, and Grofers. **Q: Does it replace dbt docs?** A: No, it complements dbt by ingesting dbt metadata and rendering it alongside warehouse, dashboard, and pipeline context. **Q: What languages are ingestion recipes written in?** A: Recipes are YAML configs. Custom sources and transformers are written in Python. ## Sources - https://github.com/datahub-project/datahub - https://datahubproject.io/docs/ --- Source: https://tokrepo.com/en/workflows/904ed27e-39eb-11f1-9bc6-00163e2b0d79 Author: Script Depot