How do I install DataHub — Open-Source Data Discovery & Governance Platform?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

DataHub — Open-Source Data Discovery & Governance Platform

Introduction

DataHub gives every data team a single pane of glass for metadata. It was born inside LinkedIn to manage thousands of datasets and was open-sourced to solve the same problem everywhere. It ingests metadata automatically, tracks lineage end-to-end, and enforces governance policies at scale.

What DataHub Does

Automatically ingests metadata from 50+ sources (Snowflake, BigQuery, dbt, Airflow, Spark, Kafka, and more)
Renders column-level lineage across warehouses, ETL jobs, and dashboards
Enables fine-grained access policies, tags, glossary terms, and domains
Provides a powerful search experience with faceted filters and ranking
Supports real-time metadata changes via its stream-first architecture

Architecture Overview

DataHub is built on a stream-first design. A metadata change log (MCL) powered by Kafka feeds an Elasticsearch index for search, a Neo4j or Elasticsearch graph for lineage, and a MySQL or PostgreSQL store for primary persistence. The frontend is a React app that talks to a GraphQL gateway backed by the Java-based GMS (Generalized Metadata Service). Ingestion runs as pluggable Python recipes that push metadata through the rest framework or Kafka.

Self-Hosting & Configuration

Deploy via Docker Compose for evaluation or Helm chart for production Kubernetes clusters
Configure ingestion recipes in YAML pointing to each source (warehouse, BI tool, orchestrator)
Customize authentication with OIDC providers (Okta, Azure AD, Google)
Tune Elasticsearch and Kafka settings for large-scale deployments
Set up actions (Slack alerts, auto-tagging) through the Actions framework

Key Features

Real-time metadata ingestion with change-based architecture
Column-level lineage across the entire data stack
Business glossary and domain management for governance
Impact analysis shows downstream effects before schema changes
Fine-grained RBAC with policy-based access control

Comparison with Similar Tools

Amundsen — simpler catalog UI but lacks real-time ingestion and governance features
OpenMetadata — similar scope but younger community and fewer production references
Atlan — commercial catalog with polished UX; DataHub is the open-source alternative
Apache Atlas — Hadoop-centric and less actively maintained outside Cloudera
Marquez — focused on lineage only; DataHub covers search, governance, and observability too

FAQ

Q: How does DataHub differ from a data catalog? A: DataHub is a metadata platform that includes cataloging, lineage, governance, and observability. Traditional catalogs focus on search alone.

Q: Can DataHub handle thousands of datasets? A: Yes. It was designed at LinkedIn scale with millions of metadata entities and is used in production by companies like Optum, Saxo Bank, and Grofers.

Q: Does it replace dbt docs? A: No, it complements dbt by ingesting dbt metadata and rendering it alongside warehouse, dashboard, and pipeline context.

Q: What languages are ingestion recipes written in? A: Recipes are YAML configs. Custom sources and transformers are written in Python.

DataHub — Open-Source Data Discovery & Governance Platform

Introduction

What DataHub Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Heimdall — Application Dashboard for Your Server

Healthchecks — Cron Job Monitoring with Smart Alerts

Shiori — Simple Self-Hosted Bookmark Manager