Introduction
DataHub gives every data team a single pane of glass for metadata. It was born inside LinkedIn to manage thousands of datasets and was open-sourced to solve the same problem everywhere. It ingests metadata automatically, tracks lineage end-to-end, and enforces governance policies at scale.
What DataHub Does
- Automatically ingests metadata from 50+ sources (Snowflake, BigQuery, dbt, Airflow, Spark, Kafka, and more)
- Renders column-level lineage across warehouses, ETL jobs, and dashboards
- Enables fine-grained access policies, tags, glossary terms, and domains
- Provides a powerful search experience with faceted filters and ranking
- Supports real-time metadata changes via its stream-first architecture
Architecture Overview
DataHub is built on a stream-first design. A metadata change log (MCL) powered by Kafka feeds an Elasticsearch index for search, a Neo4j or Elasticsearch graph for lineage, and a MySQL or PostgreSQL store for primary persistence. The frontend is a React app that talks to a GraphQL gateway backed by the Java-based GMS (Generalized Metadata Service). Ingestion runs as pluggable Python recipes that push metadata through the rest framework or Kafka.
Self-Hosting & Configuration
- Deploy via Docker Compose for evaluation or Helm chart for production Kubernetes clusters
- Configure ingestion recipes in YAML pointing to each source (warehouse, BI tool, orchestrator)
- Customize authentication with OIDC providers (Okta, Azure AD, Google)
- Tune Elasticsearch and Kafka settings for large-scale deployments
- Set up actions (Slack alerts, auto-tagging) through the Actions framework
Key Features
- Real-time metadata ingestion with change-based architecture
- Column-level lineage across the entire data stack
- Business glossary and domain management for governance
- Impact analysis shows downstream effects before schema changes
- Fine-grained RBAC with policy-based access control
Comparison with Similar Tools
- Amundsen — simpler catalog UI but lacks real-time ingestion and governance features
- OpenMetadata — similar scope but younger community and fewer production references
- Atlan — commercial catalog with polished UX; DataHub is the open-source alternative
- Apache Atlas — Hadoop-centric and less actively maintained outside Cloudera
- Marquez — focused on lineage only; DataHub covers search, governance, and observability too
FAQ
Q: How does DataHub differ from a data catalog? A: DataHub is a metadata platform that includes cataloging, lineage, governance, and observability. Traditional catalogs focus on search alone.
Q: Can DataHub handle thousands of datasets? A: Yes. It was designed at LinkedIn scale with millions of metadata entities and is used in production by companies like Optum, Saxo Bank, and Grofers.
Q: Does it replace dbt docs? A: No, it complements dbt by ingesting dbt metadata and rendering it alongside warehouse, dashboard, and pipeline context.
Q: What languages are ingestion recipes written in? A: Recipes are YAML configs. Custom sources and transformers are written in Python.