# DataHub — Open-Source Data Discovery & Governance Platform

> DataHub is a modern metadata platform for discovering, governing, and observing your data stack. Built by LinkedIn and now a top-level project at Acryl Data, it unifies metadata from warehouses, lakes, dashboards, and ML pipelines into one searchable catalog.

## Install

Save as a script file and run:

# DataHub — Open-Source Data Discovery & Governance Platform

## Quick Use
```bash
python3 -m pip install --upgrade acryl-datahub
datahub docker quickstart
# UI at http://localhost:9002, default login: datahub / datahub
datahub ingest -c my_recipe.yml
```

## Introduction
DataHub gives every data team a single pane of glass for metadata. It was born inside LinkedIn to manage thousands of datasets and was open-sourced to solve the same problem everywhere. It ingests metadata automatically, tracks lineage end-to-end, and enforces governance policies at scale.

## What DataHub Does
- Automatically ingests metadata from 50+ sources (Snowflake, BigQuery, dbt, Airflow, Spark, Kafka, and more)
- Renders column-level lineage across warehouses, ETL jobs, and dashboards
- Enables fine-grained access policies, tags, glossary terms, and domains
- Provides a powerful search experience with faceted filters and ranking
- Supports real-time metadata changes via its stream-first architecture

## Architecture Overview
DataHub is built on a stream-first design. A metadata change log (MCL) powered by Kafka feeds an Elasticsearch index for search, a Neo4j or Elasticsearch graph for lineage, and a MySQL or PostgreSQL store for primary persistence. The frontend is a React app that talks to a GraphQL gateway backed by the Java-based GMS (Generalized Metadata Service). Ingestion runs as pluggable Python recipes that push metadata through the rest framework or Kafka.

## Self-Hosting & Configuration
- Deploy via Docker Compose for evaluation or Helm chart for production Kubernetes clusters
- Configure ingestion recipes in YAML pointing to each source (warehouse, BI tool, orchestrator)
- Customize authentication with OIDC providers (Okta, Azure AD, Google)
- Tune Elasticsearch and Kafka settings for large-scale deployments
- Set up actions (Slack alerts, auto-tagging) through the Actions framework

## Key Features
- Real-time metadata ingestion with change-based architecture
- Column-level lineage across the entire data stack
- Business glossary and domain management for governance
- Impact analysis shows downstream effects before schema changes
- Fine-grained RBAC with policy-based access control

## Comparison with Similar Tools
- **Amundsen** — simpler catalog UI but lacks real-time ingestion and governance features
- **OpenMetadata** — similar scope but younger community and fewer production references
- **Atlan** — commercial catalog with polished UX; DataHub is the open-source alternative
- **Apache Atlas** — Hadoop-centric and less actively maintained outside Cloudera
- **Marquez** — focused on lineage only; DataHub covers search, governance, and observability too

## FAQ
**Q: How does DataHub differ from a data catalog?**
A: DataHub is a metadata platform that includes cataloging, lineage, governance, and observability. Traditional catalogs focus on search alone.

**Q: Can DataHub handle thousands of datasets?**
A: Yes. It was designed at LinkedIn scale with millions of metadata entities and is used in production by companies like Optum, Saxo Bank, and Grofers.

**Q: Does it replace dbt docs?**
A: No, it complements dbt by ingesting dbt metadata and rendering it alongside warehouse, dashboard, and pipeline context.

**Q: What languages are ingestion recipes written in?**
A: Recipes are YAML configs. Custom sources and transformers are written in Python.

## Sources
- https://github.com/datahub-project/datahub
- https://datahubproject.io/docs/

---
Source: https://tokrepo.com/en/workflows/904ed27e-39eb-11f1-9bc6-00163e2b0d79
Author: Script Depot