Apache Druid — Real-Time Analytics Database for Event-Driven Data
Apache Druid powers interactive analytics on real-time event data. With column-oriented storage, time-based partitioning, and a distributed architecture, it serves sub-second queries on trillions of events per day — the OLAP engine behind Netflix and Airbnb.
先审查再安装
这个资产需要先审查。复制的指令会要求 Agent dry-run、列出写入项,确认后再继续。
npx -y tokrepo@latest install 0963f669-37d2-11f1-9bc6-00163e2b0d79 --target codex先 dry-run,确认写入项后再运行此命令。
What it is
Apache Druid is an open-source, distributed analytics database designed for real-time event data. It uses column-oriented storage, time-based partitioning, and a shared-nothing architecture to serve sub-second queries on large-scale event streams.
Druid targets data engineers and analytics teams who need interactive dashboards and slice-and-dice analytics on high-volume event data such as clickstreams, logs, or IoT telemetry.
How it saves time or tokens
Druid ingests data in real time from Kafka, Kinesis, or batch sources and makes it queryable within seconds. Unlike traditional data warehouses that require ETL pipelines with minutes-to-hours latency, Druid provides near-instant query results on fresh data, reducing the feedback loop for operational analytics.
How to use
- Download and start Druid:
curl -O https://dlcdn.apache.org/druid/30.0.1/apache-druid-30.0.1-bin.tar.gz
tar -xzf apache-druid-30.0.1-bin.tar.gz
cd apache-druid-30.0.1
./bin/start-druid
- Open the web console at
http://localhost:8888.
- Load data via the console wizard or submit an ingestion spec via the API.
Example
-- Query via Druid SQL (web console or /druid/v2/sql endpoint)
SELECT
TIME_FLOOR(__time, 'PT1H') AS hour,
service,
COUNT(*) AS events,
SUM(duration_ms) AS total_duration
FROM request_logs
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '24' HOUR
GROUP BY 1, 2
ORDER BY events DESC
LIMIT 20
Related on TokRepo
- Database Tools -- More database and data infrastructure tools
- Monitoring Tools -- Real-time observability and analytics solutions
Common pitfalls
- Druid requires a minimum of 8GB RAM for the single-server quickstart. Production clusters need dedicated nodes for each Druid service (broker, coordinator, historical, middleManager).
- Druid SQL covers a subset of standard SQL. Complex joins and subqueries may not be supported. Check the SQL compatibility matrix before migrating queries.
- Real-time ingestion from Kafka requires careful tuning of task count and segment granularity to balance latency versus segment size.
常见问题
Druid supports real-time ingestion from Apache Kafka and Amazon Kinesis, plus batch ingestion from HDFS, S3, GCS, Azure Blob, and local files. It also supports push-based ingestion via its HTTP API.
Both are column-oriented analytics databases. Druid excels at real-time ingestion with sub-second query latency on time-series data. ClickHouse supports more SQL features and ad-hoc queries. The choice depends on whether real-time ingestion or SQL completeness matters more.
Yes. Druid provides a SQL interface via the /druid/v2/sql endpoint and the web console. It supports SELECT, WHERE, GROUP BY, ORDER BY, and common aggregation functions. Some advanced SQL features like window functions have limited support.
The single-server quickstart requires at least 8GB RAM and runs all Druid services in one process. Production deployments typically use dedicated nodes for each service role with 16GB+ RAM per node.
Druid is optimized for time-series and event analytics with real-time ingestion. It is not a general-purpose data warehouse. Use it alongside a warehouse like BigQuery or Snowflake, with Druid handling real-time operational dashboards.
引用来源 (3)
- Apache Druid GitHub— Apache Druid is a real-time analytics database for event-driven data
- Apache Druid Documentation— Druid architecture and documentation
- Apache Druid Design— Column-oriented storage for analytical workloads
讨论
相关资产
Apache Kafka — Distributed Event Streaming Platform
Apache Kafka is the open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, and mission-critical applications. Trillions of messages per day at LinkedIn, Netflix, Uber.
Apache Pinot — Real-Time Distributed OLAP Datastore
Apache Pinot is a real-time distributed OLAP datastore designed to deliver low-latency analytical queries at high throughput. It powers user-facing analytics at companies like LinkedIn, Uber, and Stripe by ingesting data from Kafka and batch sources.
Apache Hudi — Incremental Data Processing for Data Lakehouses
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.
Apache Flink — Stream Processing Framework for Real-Time Data
Apache Flink is the leading open-source framework for stateful stream processing. It processes unbounded data streams with exactly-once semantics, low latency, and high throughput — powering real-time analytics, fraud detection, and event-driven applications.