ScriptsApr 16, 2026·3 min read

DataHub — Open-Source Data Discovery & Governance Platform

DataHub is a modern metadata platform for discovering, governing, and observing your data stack. Built by LinkedIn and now a top-level project at Acryl Data, it unifies metadata from warehouses, lakes, dashboards, and ML pipelines into one searchable catalog.

TL;DR
Open-source metadata platform by LinkedIn for data discovery, governance, and lineage across 50+ sources.
§01

What it is

DataHub is a modern metadata platform originally built inside LinkedIn and now maintained by Acryl Data. It provides a single pane of glass for discovering, governing, and observing data assets across your entire stack. The platform ingests metadata automatically from 50+ sources including Snowflake, BigQuery, dbt, Airflow, Spark, and Kafka.

DataHub is built for data engineers, analysts, and platform teams who need to understand what data exists, where it flows, and who owns it. It replaces tribal knowledge with a searchable catalog that tracks lineage, enforces policies, and surfaces data quality signals.

§02

How it saves time or tokens

DataHub eliminates the 'ask around' pattern for finding data. Instead of messaging five people to locate a table or understand a column's meaning, you search the catalog. Column-level lineage tells you exactly which upstream jobs feed a dashboard metric, cutting incident investigation from hours to minutes. Automated ingestion recipes run on schedule, keeping the catalog current without manual effort.

§03

How to use

  1. Install the CLI and start DataHub locally:
python3 -m pip install --upgrade acryl-datahub
datahub docker quickstart
  1. Access the UI at http://localhost:9002 with default credentials (datahub/datahub).
  1. Create an ingestion recipe YAML for your data source and run it:
datahub ingest -c my_recipe.yml
§04

Example

A basic ingestion recipe for a PostgreSQL database:

source:
  type: postgres
  config:
    host_port: 'localhost:5432'
    database: 'analytics'
    username: 'datahub'
    password: '${POSTGRES_PASSWORD}'

sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'

This recipe connects to a Postgres instance, extracts table schemas, column descriptions, and relationships, then pushes the metadata to your DataHub instance.

§05

Related on TokRepo

§06

Common pitfalls

  • The Docker quickstart requires Docker Compose and allocates significant memory. Ensure at least 8GB available RAM before starting.
  • Ingestion recipes need explicit schema filtering for large warehouses. Without filters, initial ingestion can take hours on databases with thousands of tables.
  • DataHub uses Kafka internally for its stream-first architecture. In production deployments, monitoring Kafka lag is critical for metadata freshness.
  • Always check the official documentation for the latest version-specific changes and migration guides before upgrading in production environments.
  • For team deployments, establish clear guidelines on configuration and usage patterns to ensure consistency across developers.

Frequently Asked Questions

What data sources does DataHub support?+

DataHub supports 50+ sources including Snowflake, BigQuery, Redshift, dbt, Airflow, Spark, Kafka, Tableau, Looker, PostgreSQL, MySQL, and many more. Each source has a dedicated ingestion plugin that extracts schemas, lineage, and usage statistics.

Can DataHub track column-level lineage?+

Yes. DataHub renders column-level lineage across warehouses, ETL jobs, and dashboards. This means you can trace a specific column in a dashboard back through transformation layers to its original source table.

How does DataHub compare to open-source alternatives like Amundsen?+

DataHub and Amundsen both originated at LinkedIn, but DataHub has become the more actively maintained project. DataHub offers a stream-first architecture, richer lineage capabilities, and a larger plugin ecosystem for data source integration.

Is DataHub suitable for small teams?+

Yes. The Docker quickstart runs the full platform locally. Small teams can start with a single-node deployment and scale to production when needed. The CLI-based ingestion makes it straightforward to add sources incrementally.

What is the architecture behind DataHub?+

DataHub uses a stream-first design with Kafka for metadata change events, Elasticsearch for search, Neo4j or Elasticsearch for graph queries, and MySQL or PostgreSQL for primary storage. The frontend is React with a GraphQL gateway backed by a Java metadata service.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets