# Apache Doris — Modern MPP Analytical Database for Real-Time Reporting

> Apache Doris is a high-performance real-time analytical database. It combines MySQL-compatible SQL, sub-second query latency, and support for federated queries across data lakes, Hive, Iceberg, and Hudi — the open-source answer to Snowflake and BigQuery.

## Install

Save in your project root:

# Apache Doris — Modern MPP Analytical Database

## Quick Use
```bash
# Docker Compose for local cluster
# (one FE frontend + 3 BEs)
mkdir doris && cd doris
wget https://raw.githubusercontent.com/apache/doris/master/docker/runtime/docker-compose.yaml
docker-compose up -d

# Connect via MySQL client
mysql -h 127.0.0.1 -P 9030 -uroot
```

```sql
-- Feels like MySQL, runs on distributed MPP
CREATE DATABASE analytics;
USE analytics;

CREATE TABLE events (
  event_date DATE,
  user_id    BIGINT,
  event_type VARCHAR(32),
  amount     DECIMAL(18, 4)
)
DUPLICATE KEY(event_date, user_id)
PARTITION BY RANGE(event_date) ( /* daily partitions */ )
DISTRIBUTED BY HASH(user_id) BUCKETS 32
PROPERTIES ("replication_num" = "3");

SELECT event_type, SUM(amount)
FROM events
WHERE event_date >= '2026-04-01'
GROUP BY event_type
ORDER BY 2 DESC;
```

## Introduction
Apache Doris (originally Palo by Baidu) is the open-source MPP analytical database that rivals Snowflake and BigQuery for price and performance. It speaks MySQL wire protocol, so every BI tool and ORM works unchanged, and delivers sub-second queries over billions of rows on modest hardware.

With over 15,000 GitHub stars, Doris is used by Xiaomi, JD.com, Meituan, and more than 4,000 companies. The federated query engine also reads Hive, Iceberg, Hudi, and object storage — letting you run one SQL across your warehouse and data lake.

## What Doris Does
Doris runs a classic MPP architecture: **Frontend (FE)** nodes handle metadata, SQL parsing, and planning; **Backend (BE)** nodes store data columnar and execute queries in parallel. Data is stored in its own columnar format for hot queries, and federated connectors serve queries over Hive/Iceberg tables in place.

## Architecture Overview
```
Clients (MySQL protocol)
        |
  [FE Frontend]
   Metadata, parser, planner, scheduler
   Highly available via BDBJE Raft
        |
  +--------+--------+--------+
  |        |        |        |
 [BE]    [BE]    [BE]    [BE]
  Columnar storage
  Distributed execution
   Segment files per tablet
        |
  [Federated Connectors]
   Hive, Iceberg, Hudi,
   JDBC (Postgres/MySQL/Oracle),
   Elasticsearch, object storage
```

## Self-Hosting & Configuration
```sql
-- Multi-model table types
-- Duplicate (raw event log): keep every row
-- Aggregate (pre-aggregated metrics): auto-rollup
-- Unique (primary-key style): upsert semantics
CREATE TABLE daily_stats (
  dt DATE,
  dim1 VARCHAR(64),
  cnt BIGINT    SUM,
  users BITMAP  BITMAP_UNION
)
AGGREGATE KEY(dt, dim1)
DISTRIBUTED BY HASH(dim1) BUCKETS 10;

-- Query Iceberg table via catalog
CREATE CATALOG iceberg PROPERTIES (
  "type" = "iceberg",
  "iceberg.catalog.type" = "hive",
  "hive.metastore.uris" = "thrift://metastore:9083"
);

SELECT COUNT(*)
FROM iceberg.db.events
WHERE dt = '2026-04-14';
```

## Key Features
- **MySQL compatible** — wire protocol + syntax subset, no driver changes
- **Sub-second OLAP** — vectorized execution, cost-based optimizer
- **Real-time ingest** — Stream Load, Routine Load (Kafka), Flink CDC
- **Federated queries** — Hive, Iceberg, Hudi, JDBC, ES, object storage
- **High availability** — FE Raft replication + BE tablet replication
- **Materialized views** — auto-rewrite queries to use pre-aggregates
- **Column-level security** — row + column masking for BI tools
- **Apache top-level project** — neutral governance, active community

## Comparison with Similar Tools
| Feature | Doris | StarRocks | ClickHouse | Apache Pinot | Druid |
|---|---|---|---|---|---|
| Dialect | MySQL SQL | MySQL SQL | Own SQL | SQL | SQL |
| Transactions | Limited | Limited | Limited | No | No |
| Federated queries | Yes | Yes | Yes | Limited | Limited |
| Concurrency | Very High | Very High | Moderate | Very High | Very High |
| Real-time ingest | Yes | Yes | Yes (async) | Yes | Yes |
| Ease of ops | Low-Moderate | Moderate | Moderate | High | High |
| Best For | Self-serve BI + data lake | Self-serve BI | Raw speed analytics | User-facing analytics | User-facing analytics |

## FAQ
**Q: Doris vs StarRocks — they look identical?**
A: They share history (StarRocks forked from Doris). Today they're independent projects. Doris has broader community governance (ASF), StarRocks has faster query performance in many benchmarks. Evaluate both with your workload.

**Q: Doris vs ClickHouse?**
A: ClickHouse has the raw speed for single-server analytics; Doris has better high-concurrency, join-heavy, and MySQL-compatible experience. For dashboards with many users, Doris is often easier; for log analytics, ClickHouse often wins.

**Q: Can Doris replace my data warehouse?**
A: For many mid-sized setups, yes. Doris handles ingestion, OLAP queries, and federated lake queries in one engine. For petabyte-scale custom engineering, Snowflake/BigQuery still lead.

**Q: How does Doris handle upserts?**
A: Use the Unique Key model. Writes with the same key replace old values. Doris implements MOR (merge-on-read) and COW (copy-on-write) strategies depending on your workload.

## Sources
- GitHub: https://github.com/apache/doris
- Docs: https://doris.apache.org/docs
- Foundation: Apache Software Foundation
- License: Apache-2.0

---
Source: https://tokrepo.com/en/workflows/0906d4d6-37d2-11f1-9bc6-00163e2b0d79
Author: AI Open Source