What ClickHouse Does
- Columnar Storage: Data stored column-by-column, enabling massive compression and fast aggregations
- SQL: Familiar SQL interface with extensions for analytics
- Real-time Inserts: Ingest millions of rows per second
- Distributed: Horizontal scaling across cluster with sharding and replication
- Materialized Views: Pre-compute aggregations for instant queries
- MergeTree Engines: Multiple specialized table engines for different use cases
- Compression: 10x+ compression ratios with LZ4, ZSTD, and custom codecs
- Data Import: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more
- Time-Series: Optimized for time-series workloads with partitioning
- Vector Search: Built-in vector similarity search
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Apps / │────▶│ ClickHouse │────▶│ Columnar │
│ BI Tools │SQL │ Server │ │ Storage │
└──────────────┘ │ (C++) │ │ (MergeTree) │
└──────┬───────┘ └──────────────┘
│
┌─────────────┼─────────────┐
│ │ │
┌──────┴──┐ ┌─────┴───┐ ┌─────┴───┐
│ Kafka │ │ S3 │ │PostgreSQL│
│ (Stream)│ │(Parquet)│ │ (Sync) │
└─────────┘ └─────────┘ └──────────┘Self-Hosting
Docker Compose
services:
clickhouse:
image: clickhouse/clickhouse-server:latest
ports:
- "8123:8123" # HTTP interface
- "9000:9000" # Native client
volumes:
- clickhouse-data:/var/lib/clickhouse
- clickhouse-logs:/var/log/clickhouse-server
- ./config.xml:/etc/clickhouse-server/config.d/custom.xml
ulimits:
nofile:
soft: 262144
hard: 262144
restart: unless-stopped
volumes:
clickhouse-data:
clickhouse-logs:Basic Usage
Create Table
CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_name String,
page_url String,
properties String,
country_code FixedString(2),
revenue Decimal(10, 2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
TTL timestamp + INTERVAL 90 DAY;Insert Data
-- Single insert (supports batch inserts for performance)
INSERT INTO events VALUES
('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0),
('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99);
-- Insert from CSV (super fast)
INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV;
-- Insert from S3
INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet');Fast Analytics Queries
-- Count by country (scans billions of rows in seconds)
SELECT
country_code,
count() AS events,
uniq(user_id) AS unique_users,
sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10;
-- Time series with 1-minute buckets
SELECT
toStartOfMinute(timestamp) AS minute,
event_name,
count() AS event_count
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute, event_name
ORDER BY minute;
-- Funnel analysis
SELECT
sum(has_view) AS views,
sum(has_click) AS clicks,
sum(has_purchase) AS purchases,
round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct,
round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct
FROM (
SELECT
user_id,
max(event_name = 'page_view') AS has_view,
max(event_name = 'click') AS has_click,
max(event_name = 'purchase') AS has_purchase
FROM events
WHERE timestamp >= today()
GROUP BY user_id
);Key Features
Materialized Views (Pre-computed Aggregates)
-- Create materialized view that updates in real-time
CREATE MATERIALIZED VIEW hourly_stats_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, country_code)
AS SELECT
toStartOfHour(timestamp) AS hour,
country_code,
count() AS events,
sum(revenue) AS revenue
FROM events
GROUP BY hour, country_code;
-- Query is now instant
SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7;Compression
CREATE TABLE events_compressed (
timestamp DateTime CODEC(DoubleDelta, ZSTD),
user_id UInt64 CODEC(DoubleDelta, LZ4),
event_name LowCardinality(String),
data String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY timestamp;Typical compression ratios:
- Timestamps: 100x+ (DoubleDelta)
- Integers: 10-50x (Delta + LZ4)
- Strings: 5-20x (LZ4/ZSTD)
- LowCardinality strings: 100x+
Table Engines
MergeTree — Default, general-purpose
ReplacingMergeTree — Deduplicate on insert
SummingMergeTree — Auto-sum rows with same key
AggregatingMergeTree — Advanced aggregations
CollapsingMergeTree — Handle updates/deletes
ReplicatedMergeTree — Multi-node replication
Distributed — Query across cluster shards
Kafka — Consume from Kafka topics
S3 — Read/write S3 filesKafka Integration
-- Stream data from Kafka
CREATE TABLE kafka_events (
timestamp DateTime,
user_id UInt64,
event String
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'events',
kafka_group_name = 'clickhouse',
kafka_format = 'JSONEachRow';
-- Materialized view to persist
CREATE MATERIALIZED VIEW events_consumer
TO events AS
SELECT * FROM kafka_events;ClickHouse vs Alternatives
| Feature | ClickHouse | PostgreSQL | BigQuery | Snowflake |
|---|---|---|---|---|
| Open Source | Yes (Apache-2.0) | Yes | No | No |
| Storage | Columnar | Row-based | Columnar | Columnar |
| Query Speed (analytics) | Extremely fast | Slow (large data) | Fast | Fast |
| Cost | Free (self-host) | Free | $$ per query | $$ compute+storage |
| SQL | ANSI + extensions | Full ANSI | Standard | Standard |
| Real-time inserts | Yes (millions/sec) | OK | Limited | Streaming |
| Best for | Analytics, logs | OLTP | Analytics | Analytics |
常见问题
Q: ClickHouse 适合替代 PostgreSQL 吗? A: 不是替代,是补充。PostgreSQL 适合事务处理(OLTP),ClickHouse 适合分析查询(OLAP)。常见架构是主业务用 PostgreSQL,分析用 ClickHouse(从 PG 同步数据)。
Q: 性能有多快? A: 单节点可以处理每秒 1M+ 行插入和数十亿行的聚合查询(几秒内)。集群可以处理 PB 级数据。Cloudflare 用 ClickHouse 处理每秒数千万 HTTP 请求日志。
Q: 学习曲线怎样? A: 基本 SQL 操作与 PostgreSQL 相似,学习成本低。高级功能(表引擎选择、物化视图、分区策略)需要深入学习。官方文档非常详细。
来源与致谢
- GitHub: ClickHouse/ClickHouse — 46.8K+ ⭐ | Apache-2.0
- 官网: clickhouse.com