# ClickHouse — Open Source Real-Time Analytics Database

> ClickHouse is a lightning-fast, open-source column-oriented database for real-time analytics. Query billions of rows in milliseconds with SQL. Used by Cloudflare, Uber, eBay.

## Install

Save in your project root:

## Quick Use

```bash
docker run -d --name clickhouse 
  -p 8123:8123 -p 9000:9000 
  -v clickhouse-data:/var/lib/clickhouse 
  clickhouse/clickhouse-server:latest
```

Connect via HTTP or native client:
```bash
curl "http://localhost:8123/?query=SELECT+version()"
# Or
docker exec -it clickhouse clickhouse-client
```

## Intro

**ClickHouse** is an open-source, column-oriented database management system built for real-time analytical processing (OLAP) of huge datasets. Originally developed at Yandex, it can query billions of rows in milliseconds using SQL, making it the go-to choice for analytics, time-series data, logs, and metrics at scale.

With 46.8K+ GitHub stars and Apache-2.0 license, ClickHouse is used by Cloudflare, Uber, eBay, Spotify, and thousands of other companies to power real-time dashboards, analytics platforms, and observability stacks.

## What ClickHouse Does

- **Columnar Storage**: Data stored column-by-column, enabling massive compression and fast aggregations
- **SQL**: Familiar SQL interface with extensions for analytics
- **Real-time Inserts**: Ingest millions of rows per second
- **Distributed**: Horizontal scaling across cluster with sharding and replication
- **Materialized Views**: Pre-compute aggregations for instant queries
- **MergeTree Engines**: Multiple specialized table engines for different use cases
- **Compression**: 10x+ compression ratios with LZ4, ZSTD, and custom codecs
- **Data Import**: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more
- **Time-Series**: Optimized for time-series workloads with partitioning
- **Vector Search**: Built-in vector similarity search

## Architecture

```
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Apps /      │────▶│  ClickHouse  │────▶│  Columnar    │
│  BI Tools    │SQL  │  Server      │     │  Storage     │
└──────────────┘     │  (C++)       │     │  (MergeTree) │
                     └──────┬───────┘     └──────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
       ┌──────┴──┐   ┌─────┴───┐   ┌─────┴───┐
       │ Kafka   │   │   S3    │   │PostgreSQL│
       │ (Stream)│   │(Parquet)│   │  (Sync)  │
       └─────────┘   └─────────┘   └──────────┘
```

## Self-Hosting

### Docker Compose

```yaml
services:
  clickhouse:
    image: clickhouse/clickhouse-server:latest
    ports:
      - "8123:8123"    # HTTP interface
      - "9000:9000"    # Native client
    volumes:
      - clickhouse-data:/var/lib/clickhouse
      - clickhouse-logs:/var/log/clickhouse-server
      - ./config.xml:/etc/clickhouse-server/config.d/custom.xml
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    restart: unless-stopped

volumes:
  clickhouse-data:
  clickhouse-logs:
```

## Basic Usage

### Create Table

```sql
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_name String,
    page_url String,
    properties String,
    country_code FixedString(2),
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
TTL timestamp + INTERVAL 90 DAY;
```

### Insert Data

```sql
-- Single insert (supports batch inserts for performance)
INSERT INTO events VALUES
  ('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0),
  ('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99);

-- Insert from CSV (super fast)
INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV;

-- Insert from S3
INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet');
```

### Fast Analytics Queries

```sql
-- Count by country (scans billions of rows in seconds)
SELECT
    country_code,
    count() AS events,
    uniq(user_id) AS unique_users,
    sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10;

-- Time series with 1-minute buckets
SELECT
    toStartOfMinute(timestamp) AS minute,
    event_name,
    count() AS event_count
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute, event_name
ORDER BY minute;

-- Funnel analysis
SELECT
    sum(has_view) AS views,
    sum(has_click) AS clicks,
    sum(has_purchase) AS purchases,
    round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct,
    round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct
FROM (
    SELECT
        user_id,
        max(event_name = 'page_view') AS has_view,
        max(event_name = 'click') AS has_click,
        max(event_name = 'purchase') AS has_purchase
    FROM events
    WHERE timestamp >= today()
    GROUP BY user_id
);
```

## Key Features

### Materialized Views (Pre-computed Aggregates)

```sql
-- Create materialized view that updates in real-time
CREATE MATERIALIZED VIEW hourly_stats_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, country_code)
AS SELECT
    toStartOfHour(timestamp) AS hour,
    country_code,
    count() AS events,
    sum(revenue) AS revenue
FROM events
GROUP BY hour, country_code;

-- Query is now instant
SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7;
```

### Compression

```sql
CREATE TABLE events_compressed (
    timestamp DateTime CODEC(DoubleDelta, ZSTD),
    user_id UInt64 CODEC(DoubleDelta, LZ4),
    event_name LowCardinality(String),
    data String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY timestamp;
```

Typical compression ratios:
- Timestamps: 100x+ (DoubleDelta)
- Integers: 10-50x (Delta + LZ4)
- Strings: 5-20x (LZ4/ZSTD)
- LowCardinality strings: 100x+

### Table Engines

```
MergeTree — Default, general-purpose
ReplacingMergeTree — Deduplicate on insert
SummingMergeTree — Auto-sum rows with same key
AggregatingMergeTree — Advanced aggregations
CollapsingMergeTree — Handle updates/deletes
ReplicatedMergeTree — Multi-node replication
Distributed — Query across cluster shards
Kafka — Consume from Kafka topics
S3 — Read/write S3 files
```

### Kafka Integration

```sql
-- Stream data from Kafka
CREATE TABLE kafka_events (
    timestamp DateTime,
    user_id UInt64,
    event String
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse',
    kafka_format = 'JSONEachRow';

-- Materialized view to persist
CREATE MATERIALIZED VIEW events_consumer
TO events AS
SELECT * FROM kafka_events;
```

## ClickHouse vs Alternatives

| Feature | ClickHouse | PostgreSQL | BigQuery | Snowflake |
|---------|-----------|-----------|----------|-----------|
| Open Source | Yes (Apache-2.0) | Yes | No | No |
| Storage | Columnar | Row-based | Columnar | Columnar |
| Query Speed (analytics) | Extremely fast | Slow (large data) | Fast | Fast |
| Cost | Free (self-host) | Free | $$ per query | $$ compute+storage |
| SQL | ANSI + extensions | Full ANSI | Standard | Standard |
| Real-time inserts | Yes (millions/sec) | OK | Limited | Streaming |
| Best for | Analytics, logs | OLTP | Analytics | Analytics |

## 常见问题

**Q: ClickHouse 适合替代 PostgreSQL 吗？**
A: 不是替代，是补充。PostgreSQL 适合事务处理（OLTP），ClickHouse 适合分析查询（OLAP）。常见架构是主业务用 PostgreSQL，分析用 ClickHouse（从 PG 同步数据）。

**Q: 性能有多快？**
A: 单节点可以处理每秒 1M+ 行插入和数十亿行的聚合查询（几秒内）。集群可以处理 PB 级数据。Cloudflare 用 ClickHouse 处理每秒数千万 HTTP 请求日志。

**Q: 学习曲线怎样？**
A: 基本 SQL 操作与 PostgreSQL 相似，学习成本低。高级功能（表引擎选择、物化视图、分区策略）需要深入学习。官方文档非常详细。

## 来源与致谢

- GitHub: [ClickHouse/ClickHouse](https://github.com/ClickHouse/ClickHouse) — 46.8K+ ⭐ | Apache-2.0
- 官网: [clickhouse.com](https://clickhouse.com)

---
Source: https://tokrepo.com/en/workflows/2fce985b-3535-11f1-9bc6-00163e2b0d79
Author: AI Open Source