ConfigsApr 10, 2026·1 min read

ClickHouse — Open Source Real-Time Analytics Database

ClickHouse is a lightning-fast, open-source column-oriented database for real-time analytics. Query billions of rows in milliseconds with SQL. Used by Cloudflare, Uber, eBay.

AI
AI Open Source · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

docker run -d --name clickhouse 
  -p 8123:8123 -p 9000:9000 
  -v clickhouse-data:/var/lib/clickhouse 
  clickhouse/clickhouse-server:latest

Connect via HTTP or native client:

curl "http://localhost:8123/?query=SELECT+version()"
# Or
docker exec -it clickhouse clickhouse-client
Intro

ClickHouse is an open-source, column-oriented database management system built for real-time analytical processing (OLAP) of huge datasets. Originally developed at Yandex, it can query billions of rows in milliseconds using SQL, making it the go-to choice for analytics, time-series data, logs, and metrics at scale.

With 46.8K+ GitHub stars and Apache-2.0 license, ClickHouse is used by Cloudflare, Uber, eBay, Spotify, and thousands of other companies to power real-time dashboards, analytics platforms, and observability stacks.

What ClickHouse Does

  • Columnar Storage: Data stored column-by-column, enabling massive compression and fast aggregations
  • SQL: Familiar SQL interface with extensions for analytics
  • Real-time Inserts: Ingest millions of rows per second
  • Distributed: Horizontal scaling across cluster with sharding and replication
  • Materialized Views: Pre-compute aggregations for instant queries
  • MergeTree Engines: Multiple specialized table engines for different use cases
  • Compression: 10x+ compression ratios with LZ4, ZSTD, and custom codecs
  • Data Import: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more
  • Time-Series: Optimized for time-series workloads with partitioning
  • Vector Search: Built-in vector similarity search

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Apps /      │────▶│  ClickHouse  │────▶│  Columnar    │
│  BI Tools    │SQLServer      │     │  Storage     │
└──────────────┘     │  (C++)       │     │  (MergeTree) │
                     └──────┬───────┘     └──────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
       ┌──────┴──┐   ┌─────┴───┐   ┌─────┴───┐
       │ Kafka   │   │   S3    │   │PostgreSQL│
       │ (Stream)│   │(Parquet)│   │  (Sync)  │
       └─────────┘   └─────────┘   └──────────┘

Self-Hosting

Docker Compose

services:
  clickhouse:
    image: clickhouse/clickhouse-server:latest
    ports:
      - "8123:8123"    # HTTP interface
      - "9000:9000"    # Native client
    volumes:
      - clickhouse-data:/var/lib/clickhouse
      - clickhouse-logs:/var/log/clickhouse-server
      - ./config.xml:/etc/clickhouse-server/config.d/custom.xml
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    restart: unless-stopped

volumes:
  clickhouse-data:
  clickhouse-logs:

Basic Usage

Create Table

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_name String,
    page_url String,
    properties String,
    country_code FixedString(2),
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
TTL timestamp + INTERVAL 90 DAY;

Insert Data

-- Single insert (supports batch inserts for performance)
INSERT INTO events VALUES
  ('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0),
  ('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99);

-- Insert from CSV (super fast)
INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV;

-- Insert from S3
INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet');

Fast Analytics Queries

-- Count by country (scans billions of rows in seconds)
SELECT
    country_code,
    count() AS events,
    uniq(user_id) AS unique_users,
    sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10;

-- Time series with 1-minute buckets
SELECT
    toStartOfMinute(timestamp) AS minute,
    event_name,
    count() AS event_count
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute, event_name
ORDER BY minute;

-- Funnel analysis
SELECT
    sum(has_view) AS views,
    sum(has_click) AS clicks,
    sum(has_purchase) AS purchases,
    round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct,
    round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct
FROM (
    SELECT
        user_id,
        max(event_name = 'page_view') AS has_view,
        max(event_name = 'click') AS has_click,
        max(event_name = 'purchase') AS has_purchase
    FROM events
    WHERE timestamp >= today()
    GROUP BY user_id
);

Key Features

Materialized Views (Pre-computed Aggregates)

-- Create materialized view that updates in real-time
CREATE MATERIALIZED VIEW hourly_stats_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, country_code)
AS SELECT
    toStartOfHour(timestamp) AS hour,
    country_code,
    count() AS events,
    sum(revenue) AS revenue
FROM events
GROUP BY hour, country_code;

-- Query is now instant
SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7;

Compression

CREATE TABLE events_compressed (
    timestamp DateTime CODEC(DoubleDelta, ZSTD),
    user_id UInt64 CODEC(DoubleDelta, LZ4),
    event_name LowCardinality(String),
    data String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY timestamp;

Typical compression ratios:

  • Timestamps: 100x+ (DoubleDelta)
  • Integers: 10-50x (Delta + LZ4)
  • Strings: 5-20x (LZ4/ZSTD)
  • LowCardinality strings: 100x+

Table Engines

MergeTree — Default, general-purpose
ReplacingMergeTree — Deduplicate on insert
SummingMergeTree — Auto-sum rows with same key
AggregatingMergeTree — Advanced aggregations
CollapsingMergeTree — Handle updates/deletes
ReplicatedMergeTree — Multi-node replication
Distributed — Query across cluster shards
Kafka — Consume from Kafka topics
S3 — Read/write S3 files

Kafka Integration

-- Stream data from Kafka
CREATE TABLE kafka_events (
    timestamp DateTime,
    user_id UInt64,
    event String
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse',
    kafka_format = 'JSONEachRow';

-- Materialized view to persist
CREATE MATERIALIZED VIEW events_consumer
TO events AS
SELECT * FROM kafka_events;

ClickHouse vs Alternatives

Feature ClickHouse PostgreSQL BigQuery Snowflake
Open Source Yes (Apache-2.0) Yes No No
Storage Columnar Row-based Columnar Columnar
Query Speed (analytics) Extremely fast Slow (large data) Fast Fast
Cost Free (self-host) Free $$ per query $$ compute+storage
SQL ANSI + extensions Full ANSI Standard Standard
Real-time inserts Yes (millions/sec) OK Limited Streaming
Best for Analytics, logs OLTP Analytics Analytics

常见问题

Q: ClickHouse 适合替代 PostgreSQL 吗? A: 不是替代,是补充。PostgreSQL 适合事务处理(OLTP),ClickHouse 适合分析查询(OLAP)。常见架构是主业务用 PostgreSQL,分析用 ClickHouse(从 PG 同步数据)。

Q: 性能有多快? A: 单节点可以处理每秒 1M+ 行插入和数十亿行的聚合查询(几秒内)。集群可以处理 PB 级数据。Cloudflare 用 ClickHouse 处理每秒数千万 HTTP 请求日志。

Q: 学习曲线怎样? A: 基本 SQL 操作与 PostgreSQL 相似,学习成本低。高级功能(表引擎选择、物化视图、分区策略)需要深入学习。官方文档非常详细。

来源与致谢

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets