DataHub — Open-Source Data Discovery & Governance Platform
DataHub is a modern metadata platform for discovering, governing, and observing your data stack. Built by LinkedIn and now a top-level project at Acryl Data, it unifies metadata from warehouses, lakes, dashboards, and ML pipelines into one searchable catalog.
What it is
DataHub is a modern metadata platform originally built inside LinkedIn and now maintained by Acryl Data. It provides a single pane of glass for discovering, governing, and observing data assets across your entire stack. The platform ingests metadata automatically from 50+ sources including Snowflake, BigQuery, dbt, Airflow, Spark, and Kafka.
DataHub is built for data engineers, analysts, and platform teams who need to understand what data exists, where it flows, and who owns it. It replaces tribal knowledge with a searchable catalog that tracks lineage, enforces policies, and surfaces data quality signals.
How it saves time or tokens
DataHub eliminates the 'ask around' pattern for finding data. Instead of messaging five people to locate a table or understand a column's meaning, you search the catalog. Column-level lineage tells you exactly which upstream jobs feed a dashboard metric, cutting incident investigation from hours to minutes. Automated ingestion recipes run on schedule, keeping the catalog current without manual effort.
How to use
- Install the CLI and start DataHub locally:
python3 -m pip install --upgrade acryl-datahub
datahub docker quickstart
- Access the UI at
http://localhost:9002with default credentials (datahub/datahub).
- Create an ingestion recipe YAML for your data source and run it:
datahub ingest -c my_recipe.yml
Example
A basic ingestion recipe for a PostgreSQL database:
source:
type: postgres
config:
host_port: 'localhost:5432'
database: 'analytics'
username: 'datahub'
password: '${POSTGRES_PASSWORD}'
sink:
type: datahub-rest
config:
server: 'http://localhost:8080'
This recipe connects to a Postgres instance, extracts table schemas, column descriptions, and relationships, then pushes the metadata to your DataHub instance.
Related on TokRepo
- AI Tools for Database — Database management and query tools that pair well with DataHub
- AI Tools for Documentation — Documentation generators for data catalogs and API references
Common pitfalls
- The Docker quickstart requires Docker Compose and allocates significant memory. Ensure at least 8GB available RAM before starting.
- Ingestion recipes need explicit schema filtering for large warehouses. Without filters, initial ingestion can take hours on databases with thousands of tables.
- DataHub uses Kafka internally for its stream-first architecture. In production deployments, monitoring Kafka lag is critical for metadata freshness.
- Always check the official documentation for the latest version-specific changes and migration guides before upgrading in production environments.
- For team deployments, establish clear guidelines on configuration and usage patterns to ensure consistency across developers.
Frequently Asked Questions
DataHub supports 50+ sources including Snowflake, BigQuery, Redshift, dbt, Airflow, Spark, Kafka, Tableau, Looker, PostgreSQL, MySQL, and many more. Each source has a dedicated ingestion plugin that extracts schemas, lineage, and usage statistics.
Yes. DataHub renders column-level lineage across warehouses, ETL jobs, and dashboards. This means you can trace a specific column in a dashboard back through transformation layers to its original source table.
DataHub and Amundsen both originated at LinkedIn, but DataHub has become the more actively maintained project. DataHub offers a stream-first architecture, richer lineage capabilities, and a larger plugin ecosystem for data source integration.
Yes. The Docker quickstart runs the full platform locally. Small teams can start with a single-node deployment and scale to production when needed. The CLI-based ingestion makes it straightforward to add sources incrementally.
DataHub uses a stream-first design with Kafka for metadata change events, Elasticsearch for search, Neo4j or Elasticsearch for graph queries, and MySQL or PostgreSQL for primary storage. The frontend is React with a GraphQL gateway backed by a Java metadata service.
Citations (3)
- DataHub GitHub— DataHub is maintained by Acryl Data, originally built at LinkedIn
- DataHub Documentation— 50+ source connectors for automated metadata ingestion
- DataHub Architecture— Stream-first architecture with Kafka, Elasticsearch, and Neo4j
Related on TokRepo
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.