Apache Doris — Modern MPP Analytical Database for Real-Time Reporting
Apache Doris is a high-performance real-time analytical database. It combines MySQL-compatible SQL, sub-second query latency, and support for federated queries across data lakes, Hive, Iceberg, and Hudi — the open-source answer to Snowflake and BigQuery.
Installation avec revue préalable
Cet actif nécessite une revue. Le prompt copié demande un dry-run, affiche les écritures, puis continue seulement après confirmation.
npx -y tokrepo@latest install 0906d4d6-37d2-11f1-9bc6-00163e2b0d79 --target codexDry-run d'abord, confirmez les écritures, puis lancez cette commande.
What it is
Apache Doris is a high-performance, real-time analytical database built for online analytical processing (OLAP). It provides MySQL-compatible SQL, sub-second query latency on large datasets, and federated queries across data lakes including Hive, Iceberg, and Hudi.
Apache Doris targets data engineers and analysts who need fast dashboards, ad-hoc reporting, and real-time analytics without the complexity of a separate ETL pipeline.
How it saves time or tokens
Doris ingests data in real time and serves analytical queries immediately, eliminating the batch ETL window. You can query fresh data within seconds of ingestion. The MySQL protocol compatibility means existing BI tools (Grafana, Superset, Metabase) connect without custom drivers.
The built-in materialized views and rollup tables pre-aggregate common queries, reducing compute time for repeated dashboard requests.
How to use
- Deploy Doris: download the binary or use Docker
- Start the Frontend (FE) and Backend (BE) nodes
- Connect with any MySQL client:
mysql -h 127.0.0.1 -P 9030 -u root - Create tables and load data via Stream Load or Routine Load from Kafka
Example
-- Create an aggregate table for web analytics
CREATE TABLE page_views (
event_date DATE,
page_url VARCHAR(512),
user_id BIGINT,
view_count BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(event_date, page_url, user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 16
PROPERTIES ('replication_num' = '1');
-- Load data via Stream Load
-- curl -u root: -H 'format: json' -T data.json http://fe:8030/api/db/page_views/_stream_load
-- Query with standard SQL
SELECT page_url, SUM(view_count) as total_views
FROM page_views
WHERE event_date >= '2026-04-01'
GROUP BY page_url
ORDER BY total_views DESC
LIMIT 10;
Related on TokRepo
- Database tools -- Database management and analytics
- Monitoring tools -- Dashboards and real-time monitoring
Common pitfalls
- Doris requires separate FE and BE processes; minimum production setup is 3 FE nodes and 3 BE nodes for high availability
- Choosing the wrong data model (Aggregate, Unique, or Duplicate) affects query performance significantly; read the model guide before designing tables
- Stream Load has a 10 GB default limit per request; batch large imports into smaller chunks
Questions fréquentes
Both are columnar OLAP databases. Doris provides MySQL protocol compatibility and easier operations. ClickHouse offers more analytical functions and typically faster raw query performance. Doris is often preferred when MySQL compatibility and simpler operations matter more.
Yes. Doris supports Stream Load for HTTP-based ingestion, Routine Load for continuous Kafka consumption, and Broker Load for batch imports from HDFS or S3. Data is queryable within seconds of ingestion.
Yes. Doris supports federated queries across Hive, Iceberg, Hudi, and Delta Lake catalogs. You register external catalogs and query them with standard SQL alongside Doris internal tables.
A single FE and single BE node can run on a machine with 4 cores and 16 GB RAM for testing. Production deployments should have at least 3 FE nodes and 3 BE nodes with SSDs for optimal performance.
Yes. Apache Doris is an Apache Software Foundation top-level project under the Apache 2.0 license. Commercial distributions like SelectDB offer managed hosting and enterprise support.
Sources citées (3)
- Apache Doris GitHub— Apache Doris is a real-time analytical database
- Apache Doris Docs— MySQL protocol compatibility for BI tool integration
- Apache Doris Docs— Federated query support for Hive, Iceberg, and Hudi
En lien sur TokRepo
Fil de discussion
Actifs similaires
Apache Druid — Real-Time Analytics Database for Event-Driven Data
Apache Druid powers interactive analytics on real-time event data. With column-oriented storage, time-based partitioning, and a distributed architecture, it serves sub-second queries on trillions of events per day — the OLAP engine behind Netflix and Airbnb.
Apache Pinot — Real-Time Distributed OLAP Datastore
Apache Pinot is a real-time distributed OLAP datastore designed to deliver low-latency analytical queries at high throughput. It powers user-facing analytics at companies like LinkedIn, Uber, and Stripe by ingesting data from Kafka and batch sources.
Apache Flink — Stream Processing Framework for Real-Time Data
Apache Flink is the leading open-source framework for stateful stream processing. It processes unbounded data streams with exactly-once semantics, low latency, and high throughput — powering real-time analytics, fraud detection, and event-driven applications.
Apache Hudi — Incremental Data Processing for Data Lakehouses
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.