# Apache Druid — Real-Time Analytics Database for Event-Driven Data

> Apache Druid powers interactive analytics on real-time event data. With column-oriented storage, time-based partitioning, and a distributed architecture, it serves sub-second queries on trillions of events per day — the OLAP engine behind Netflix and Airbnb.

## Install

Save as a script file and run:

# Apache Druid — Real-Time Analytics Database

## Quick Use
```bash
# Download and run the Druid single-server quickstart
curl -O https://dlcdn.apache.org/druid/30.0.1/apache-druid-30.0.1-bin.tar.gz
tar -xzf apache-druid-30.0.1-bin.tar.gz
cd apache-druid-30.0.1
./bin/start-druid
# UI: http://localhost:8888
```

```sql
-- Druid speaks a subset of SQL via the /druid/v2/sql HTTP endpoint or web console
SELECT
  __time,
  service,
  status_code,
  COUNT(*) AS events,
  APPROX_COUNT_DISTINCT(user_id) AS unique_users
FROM events
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY 1, 2, 3
ORDER BY events DESC;
```

## Introduction
Druid was built at Metamarkets (acquired by Snap) in 2011 to answer a very specific need: interactive exploration of multi-billion-row event streams with sub-second response times. It's the OLAP engine behind Netflix's internal dashboards, Airbnb's Superset installations, Target, Salesforce Marketing Cloud, and many more.

With over 14,000 GitHub stars, Druid is used when response times matter at any scale. It combines real-time streaming ingest (Kafka/Kinesis) with historical batch ingest (S3/HDFS) in one query layer.

## What Druid Does
Druid splits a cluster into roles: **Broker** (routes queries), **Router** (optional front), **Coordinator/Overlord** (management), **Historical** (stores segments), **MiddleManager/Indexer** (ingestion). Data is stored as immutable **segments** — columnar, time-partitioned, highly compressed. Queries exploit segment locality and bitmap indexes for speed.

## Architecture Overview
```
Streaming                    Batch
  Kafka / Kinesis              S3 / HDFS / local files
       \                          /
        \                        /
        [Indexer / MiddleManager]
                |
         [Immutable Segments]
          time-partitioned
          columnar + compressed
          bitmap indexes
                |
     [Historical servers]
                |
         [Broker] <-- client SQL queries
                |
     [Coordinator / Overlord]
       segment balancing + ingestion supervision
                |
     [Deep Storage: S3 / HDFS / GCS / Azure]
       canonical segment storage
```

## Self-Hosting & Configuration
```json
// Kafka supervisor spec: real-time ingest from a topic
{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "events",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": ["service", "status_code", "user_id", "region"]
      },
      "metricsSpec": [
        { "type": "count", "name": "count" },
        { "type": "longSum", "name": "bytes", "fieldName": "bytes" }
      ],
      "granularitySpec": {
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "topic": "events",
      "consumerProperties": { "bootstrap.servers": "kafka:9092" },
      "taskCount": 2,
      "replicas": 1
    },
    "tuningConfig": { "type": "kafka", "maxRowsPerSegment": 5000000 }
  }
}
```

## Key Features
- **Sub-second queries** — columnar store + bitmap indexes + segment pruning
- **Real-time + batch** — Kafka/Kinesis streaming and S3/HDFS batch in one source
- **Roll-up at ingest** — pre-aggregate events for massive storage savings
- **Approximate algorithms** — HLL, theta sketches for fast distinct counts
- **Time-series optimizations** — everything is partitioned and indexed by time
- **SQL + native JSON** — SQL for BI tools, native API for custom clients
- **Horizontal scale** — add Historicals for storage, Brokers for query concurrency
- **Deep storage** — segments persist in cheap object storage

## Comparison with Similar Tools
| Feature | Druid | Pinot | ClickHouse | StarRocks | Snowflake |
|---|---|---|---|---|---|
| Streaming ingest | Yes | Yes | Via Kafka engine | Yes | Via Snowpipe |
| SQL | Subset | SQL | Full | MySQL | Full |
| Concurrency | Very High | Very High | Moderate | Very High | Very High |
| Updates | Limited | Limited | Limited | Yes | Yes |
| Roll-up at ingest | Yes | Yes | Via MV | Via MV | Via MV |
| Best For | User-facing analytics | User-facing analytics | Raw speed | Self-serve BI | Managed DW |

## FAQ
**Q: Druid vs Pinot — they're so similar?**
A: Both target the "user-facing analytics" niche. Pinot's star schema index is a unique advantage for joins. Druid has a bigger mature community. Benchmarks often show similar P99 latencies.

**Q: Druid vs ClickHouse?**
A: ClickHouse is faster per-node and simpler to operate. Druid has better streaming ingest, better concurrency, and better approximate-aggregate support. For user-facing dashboards serving thousands of concurrent users, Druid often wins.

**Q: Is Druid hard to operate?**
A: Yes, historically — multiple roles, segment management, deep storage wiring. Imply.io (commercial Druid) or managed cloud services exist to reduce ops burden.

**Q: Does Druid support joins?**
A: Limited joins (broadcast or lookup tables). For heavy join workloads, prefer ClickHouse or StarRocks. Druid's sweet spot is denormalized event data.

## Sources
- GitHub: https://github.com/apache/druid
- Docs: https://druid.apache.org
- Foundation: Apache Software Foundation
- License: Apache-2.0

---
Source: https://tokrepo.com/en/workflows/0963f669-37d2-11f1-9bc6-00163e2b0d79
Author: Script Depot