Apache Druid — Real-Time Analytics Database for Event-Driven Data

Introduction

Druid was built at Metamarkets (acquired by Snap) in 2011 to answer a very specific need: interactive exploration of multi-billion-row event streams with sub-second response times. It's the OLAP engine behind Netflix's internal dashboards, Airbnb's Superset installations, Target, Salesforce Marketing Cloud, and many more.

With over 14,000 GitHub stars, Druid is used when response times matter at any scale. It combines real-time streaming ingest (Kafka/Kinesis) with historical batch ingest (S3/HDFS) in one query layer.

What Druid Does

Druid splits a cluster into roles: Broker (routes queries), Router (optional front), Coordinator/Overlord (management), Historical (stores segments), MiddleManager/Indexer (ingestion). Data is stored as immutable segments — columnar, time-partitioned, highly compressed. Queries exploit segment locality and bitmap indexes for speed.

Architecture Overview

Streaming                    Batch
  Kafka / Kinesis              S3 / HDFS / local files
       \                          /
        \                        /
        [Indexer / MiddleManager]
                |
         [Immutable Segments]
          time-partitioned
          columnar + compressed
          bitmap indexes
                |
     [Historical servers]
                |
         [Broker] <-- client SQL queries
                |
     [Coordinator / Overlord]
       segment balancing + ingestion supervision
                |
     [Deep Storage: S3 / HDFS / GCS / Azure]
       canonical segment storage

Self-Hosting & Configuration

// Kafka supervisor spec: real-time ingest from a topic
{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "events",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": ["service", "status_code", "user_id", "region"]
      },
      "metricsSpec": [
        { "type": "count", "name": "count" },
        { "type": "longSum", "name": "bytes", "fieldName": "bytes" }
      ],
      "granularitySpec": {
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "topic": "events",
      "consumerProperties": { "bootstrap.servers": "kafka:9092" },
      "taskCount": 2,
      "replicas": 1
    },
    "tuningConfig": { "type": "kafka", "maxRowsPerSegment": 5000000 }
  }
}

Key Features

Sub-second queries — columnar store + bitmap indexes + segment pruning
Real-time + batch — Kafka/Kinesis streaming and S3/HDFS batch in one source
Roll-up at ingest — pre-aggregate events for massive storage savings
Approximate algorithms — HLL, theta sketches for fast distinct counts
Time-series optimizations — everything is partitioned and indexed by time
SQL + native JSON — SQL for BI tools, native API for custom clients
Horizontal scale — add Historicals for storage, Brokers for query concurrency
Deep storage — segments persist in cheap object storage

Comparison with Similar Tools

Feature	Druid	Pinot	ClickHouse	StarRocks	Snowflake
Streaming ingest	Yes	Yes	Via Kafka engine	Yes	Via Snowpipe
SQL	Subset	SQL	Full	MySQL	Full
Concurrency	Very High	Very High	Moderate	Very High	Very High
Updates	Limited	Limited	Limited	Yes	Yes
Roll-up at ingest	Yes	Yes	Via MV	Via MV	Via MV
Best For	User-facing analytics	User-facing analytics	Raw speed	Self-serve BI	Managed DW

FAQ

Q: Druid vs Pinot — they're so similar? A: Both target the "user-facing analytics" niche. Pinot's star schema index is a unique advantage for joins. Druid has a bigger mature community. Benchmarks often show similar P99 latencies.

Q: Druid vs ClickHouse? A: ClickHouse is faster per-node and simpler to operate. Druid has better streaming ingest, better concurrency, and better approximate-aggregate support. For user-facing dashboards serving thousands of concurrent users, Druid often wins.

Q: Is Druid hard to operate? A: Yes, historically — multiple roles, segment management, deep storage wiring. Imply.io (commercial Druid) or managed cloud services exist to reduce ops burden.

Q: Does Druid support joins? A: Limited joins (broadcast or lookup tables). For heavy join workloads, prefer ClickHouse or StarRocks. Druid's sweet spot is denormalized event data.

Sources

GitHub: https://github.com/apache/druid
Docs: https://druid.apache.org
Foundation: Apache Software Foundation
License: Apache-2.0

Apache Druid — Real-Time Analytics Database for Event-Driven Data

Introduction

What Druid Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Invidious — Alternative Privacy-First Frontend for YouTube

Zulip — Threaded Team Chat That Actually Scales to Thousands of Topics

PhotoPrism — AI-Powered Photo Management for the Self-Hosted Era