ScriptsApr 16, 2026·3 min read

DataHub — Open-Source Data Discovery & Governance Platform

DataHub is a modern metadata platform for discovering, governing, and observing your data stack. Built by LinkedIn and now a top-level project at Acryl Data, it unifies metadata from warehouses, lakes, dashboards, and ML pipelines into one searchable catalog.

Introduction

DataHub gives every data team a single pane of glass for metadata. It was born inside LinkedIn to manage thousands of datasets and was open-sourced to solve the same problem everywhere. It ingests metadata automatically, tracks lineage end-to-end, and enforces governance policies at scale.

What DataHub Does

  • Automatically ingests metadata from 50+ sources (Snowflake, BigQuery, dbt, Airflow, Spark, Kafka, and more)
  • Renders column-level lineage across warehouses, ETL jobs, and dashboards
  • Enables fine-grained access policies, tags, glossary terms, and domains
  • Provides a powerful search experience with faceted filters and ranking
  • Supports real-time metadata changes via its stream-first architecture

Architecture Overview

DataHub is built on a stream-first design. A metadata change log (MCL) powered by Kafka feeds an Elasticsearch index for search, a Neo4j or Elasticsearch graph for lineage, and a MySQL or PostgreSQL store for primary persistence. The frontend is a React app that talks to a GraphQL gateway backed by the Java-based GMS (Generalized Metadata Service). Ingestion runs as pluggable Python recipes that push metadata through the rest framework or Kafka.

Self-Hosting & Configuration

  • Deploy via Docker Compose for evaluation or Helm chart for production Kubernetes clusters
  • Configure ingestion recipes in YAML pointing to each source (warehouse, BI tool, orchestrator)
  • Customize authentication with OIDC providers (Okta, Azure AD, Google)
  • Tune Elasticsearch and Kafka settings for large-scale deployments
  • Set up actions (Slack alerts, auto-tagging) through the Actions framework

Key Features

  • Real-time metadata ingestion with change-based architecture
  • Column-level lineage across the entire data stack
  • Business glossary and domain management for governance
  • Impact analysis shows downstream effects before schema changes
  • Fine-grained RBAC with policy-based access control

Comparison with Similar Tools

  • Amundsen — simpler catalog UI but lacks real-time ingestion and governance features
  • OpenMetadata — similar scope but younger community and fewer production references
  • Atlan — commercial catalog with polished UX; DataHub is the open-source alternative
  • Apache Atlas — Hadoop-centric and less actively maintained outside Cloudera
  • Marquez — focused on lineage only; DataHub covers search, governance, and observability too

FAQ

Q: How does DataHub differ from a data catalog? A: DataHub is a metadata platform that includes cataloging, lineage, governance, and observability. Traditional catalogs focus on search alone.

Q: Can DataHub handle thousands of datasets? A: Yes. It was designed at LinkedIn scale with millions of metadata entities and is used in production by companies like Optum, Saxo Bank, and Grofers.

Q: Does it replace dbt docs? A: No, it complements dbt by ingesting dbt metadata and rendering it alongside warehouse, dashboard, and pipeline context.

Q: What languages are ingestion recipes written in? A: Recipes are YAML configs. Custom sources and transformers are written in Python.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets