Scripts2026年4月16日·1 分钟阅读

DataHub — Open-Source Data Discovery & Governance Platform

DataHub is a modern metadata platform for discovering, governing, and observing your data stack. Built by LinkedIn and now a top-level project at Acryl Data, it unifies metadata from warehouses, lakes, dashboards, and ML pipelines into one searchable catalog.

Introduction

DataHub gives every data team a single pane of glass for metadata. It was born inside LinkedIn to manage thousands of datasets and was open-sourced to solve the same problem everywhere. It ingests metadata automatically, tracks lineage end-to-end, and enforces governance policies at scale.

What DataHub Does

  • Automatically ingests metadata from 50+ sources (Snowflake, BigQuery, dbt, Airflow, Spark, Kafka, and more)
  • Renders column-level lineage across warehouses, ETL jobs, and dashboards
  • Enables fine-grained access policies, tags, glossary terms, and domains
  • Provides a powerful search experience with faceted filters and ranking
  • Supports real-time metadata changes via its stream-first architecture

Architecture Overview

DataHub is built on a stream-first design. A metadata change log (MCL) powered by Kafka feeds an Elasticsearch index for search, a Neo4j or Elasticsearch graph for lineage, and a MySQL or PostgreSQL store for primary persistence. The frontend is a React app that talks to a GraphQL gateway backed by the Java-based GMS (Generalized Metadata Service). Ingestion runs as pluggable Python recipes that push metadata through the rest framework or Kafka.

Self-Hosting & Configuration

  • Deploy via Docker Compose for evaluation or Helm chart for production Kubernetes clusters
  • Configure ingestion recipes in YAML pointing to each source (warehouse, BI tool, orchestrator)
  • Customize authentication with OIDC providers (Okta, Azure AD, Google)
  • Tune Elasticsearch and Kafka settings for large-scale deployments
  • Set up actions (Slack alerts, auto-tagging) through the Actions framework

Key Features

  • Real-time metadata ingestion with change-based architecture
  • Column-level lineage across the entire data stack
  • Business glossary and domain management for governance
  • Impact analysis shows downstream effects before schema changes
  • Fine-grained RBAC with policy-based access control

Comparison with Similar Tools

  • Amundsen — simpler catalog UI but lacks real-time ingestion and governance features
  • OpenMetadata — similar scope but younger community and fewer production references
  • Atlan — commercial catalog with polished UX; DataHub is the open-source alternative
  • Apache Atlas — Hadoop-centric and less actively maintained outside Cloudera
  • Marquez — focused on lineage only; DataHub covers search, governance, and observability too

FAQ

Q: How does DataHub differ from a data catalog? A: DataHub is a metadata platform that includes cataloging, lineage, governance, and observability. Traditional catalogs focus on search alone.

Q: Can DataHub handle thousands of datasets? A: Yes. It was designed at LinkedIn scale with millions of metadata entities and is used in production by companies like Optum, Saxo Bank, and Grofers.

Q: Does it replace dbt docs? A: No, it complements dbt by ingesting dbt metadata and rendering it alongside warehouse, dashboard, and pipeline context.

Q: What languages are ingestion recipes written in? A: Recipes are YAML configs. Custom sources and transformers are written in Python.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产