# StarRocks — High-Performance Analytical Database with MySQL Protocol

> StarRocks is a next-generation MPP database that delivers extreme analytical query performance on large datasets. Benchmarks frequently show it as the fastest open-source OLAP engine — with full MySQL compatibility and support for data lake queries.

## Install

Save in your project root:

# StarRocks — Extreme-Performance MPP Analytical Database

## Quick Use
```bash
# All-in-one Docker for a quick test
docker run -d --name starrocks \
  -p 9030:9030 -p 8030:8030 -p 8040:8040 \
  starrocks/allin1-ubuntu:latest

# Connect via MySQL client (port 9030)
mysql -h 127.0.0.1 -P 9030 -uroot
```

```sql
-- Create a primary key table (supports upserts)
CREATE TABLE orders (
  order_id  BIGINT   NOT NULL,
  user_id   BIGINT,
  amount    DECIMAL(18,2),
  status    VARCHAR(32),
  ts        DATETIME
)
PRIMARY KEY(order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 32
PROPERTIES ("replication_num" = "3");

-- Create an asynchronous materialized view for sub-second dashboards
CREATE MATERIALIZED VIEW orders_hourly
REFRESH ASYNC START('2026-04-14 00:00:00') EVERY (INTERVAL 10 MINUTE)
AS
SELECT date_trunc('hour', ts) AS bucket,
       status,
       COUNT(*) AS cnt,
       SUM(amount) AS revenue
FROM orders
GROUP BY 1, 2;

-- Query via MV auto-rewrite: the optimizer routes qualifying queries to the MV
SELECT bucket, SUM(revenue)
FROM orders
WHERE ts >= '2026-04-01'
GROUP BY bucket;
```

## Introduction
StarRocks originated as a Chinese BI startup (forked from early Apache Doris) and has become the benchmark-topping open-source OLAP engine. With a full vectorized execution engine, CBO (cost-based optimizer), and query-time MV rewriting, it frequently outperforms ClickHouse, Doris, and Presto on TPC-DS and real workloads.

With over 11,000 GitHub stars, StarRocks is used by Trip.com, Airbnb, Pinterest, and Tencent. It speaks MySQL protocol (any BI tool connects), supports federated queries across Iceberg/Hive/Hudi, and runs both self-hosted and as CelerData (the managed service).

## What StarRocks Does
StarRocks ingests data into its own columnar format (native tables) or queries external lakes (Iceberg, Hive, Hudi, JDBC, object storage). Its CBO picks join orders and materialized view rewrites automatically. Vectorized execution makes full use of CPU SIMD registers. Primary Key tables allow real upserts, something most OLAP engines skimp on.

## Architecture Overview
```
BI tools (Tableau, Superset, Looker) -> MySQL wire protocol
        |
  [FE — Frontend nodes]
   SQL parsing, CBO, metadata
   HA via BDBJE
        |
  +--------+--------+
  |        |        |
 [BE]    [BE]    [BE]
  Vectorized engine
  Columnar storage (native)
  Tablet replication
        |
  [External Catalogs]
   Iceberg, Hive, Hudi,
   JDBC, object storage
        |
  [Materialized Views]
   async refresh
   optimizer auto-rewrites matching queries
```

## Self-Hosting & Configuration
```sql
-- Federated query: join StarRocks native table with Iceberg lake table
CREATE EXTERNAL CATALOG iceberg PROPERTIES (
  "type" = "iceberg",
  "iceberg.catalog.type" = "hive",
  "hive.metastore.uris" = "thrift://metastore:9083"
);

SELECT o.user_id, l.country, SUM(o.amount)
FROM orders o
JOIN iceberg.db.users u ON o.user_id = u.id
JOIN iceberg.db.locations l ON u.location_id = l.id
WHERE o.ts >= '2026-04-01'
GROUP BY o.user_id, l.country;

-- Stream ingest from Kafka (Routine Load)
CREATE ROUTINE LOAD orders_stream ON orders
COLUMNS (order_id, user_id, amount, status, ts)
PROPERTIES ("format" = "json", "jsonpaths" = '["$.order_id","$.user_id","$.amount","$.status","$.ts"]')
FROM KAFKA (
  "kafka_broker_list" = "kafka:9092",
  "kafka_topic" = "orders"
);
```

## Key Features
- **Vectorized + CBO** — best-in-class TPC-DS and real-world performance
- **MySQL compatible** — BI tools and ORMs work unchanged
- **Materialized view rewrites** — optimizer uses MVs transparently
- **Primary Key tables** — fast upserts + partial updates
- **Federated queries** — native + lake tables in one SQL
- **Real-time ingest** — Routine Load (Kafka), Flink CDC, Stream Load
- **Storage-compute separation** (3.0+) — elastic compute on object storage
- **Active development** — monthly releases, fast bug-fix cadence

## Comparison with Similar Tools
| Feature | StarRocks | Doris | ClickHouse | Snowflake | Presto/Trino |
|---|---|---|---|---|---|
| Dialect | MySQL | MySQL | Own | Snowflake SQL | ANSI SQL |
| Upserts | Yes (PK) | Yes (Unique) | Limited | Yes | No |
| MV rewrite | Yes (async MV) | Yes | Manual | Yes | No |
| Federated queries | Yes | Yes | Yes (via engines) | Yes (Iceberg) | Yes (focus) |
| Storage/compute separation | Yes (3.x) | Partial | Limited | Yes | Yes (compute only) |
| Best For | Real-time + lake OLAP | Self-serve BI | Raw scan speed | Managed DW | Federated SQL |

## FAQ
**Q: StarRocks vs Apache Doris — same project?**
A: StarRocks forked from an early Doris version and diverged significantly. StarRocks usually wins performance benchmarks; Doris has the ASF governance and broader community. Try both on your workload.

**Q: StarRocks vs ClickHouse?**
A: ClickHouse is simpler to run on a single node and often wins pure scan benchmarks. StarRocks has better concurrency, better join performance, MV rewrites, MySQL protocol, and federated lake queries.

**Q: Is storage-compute separation important?**
A: Very. In 3.x+ StarRocks can store primary data in S3/GCS and scale compute nodes elastically. This matches Snowflake's architecture and dramatically reduces costs for spiky workloads.

**Q: Is StarRocks open source?**
A: Yes, Apache-2.0. CelerData provides a managed/cloud version for those who prefer not to self-host.

## Sources
- GitHub: https://github.com/StarRocks/starrocks
- Docs: https://docs.starrocks.io
- Company: CelerData
- License: Apache-2.0

---
Source: https://tokrepo.com/en/workflows/0982a4ff-37d2-11f1-9bc6-00163e2b0d79
Author: AI Open Source