Configs2026年4月10日·1 分钟阅读

ClickHouse — Open Source Real-Time Analytics Database

ClickHouse is a lightning-fast, open-source column-oriented database for real-time analytics. Query billions of rows in milliseconds with SQL. Used by Cloudflare, Uber, eBay.

AI
AI Open Source · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

docker run -d --name clickhouse 
  -p 8123:8123 -p 9000:9000 
  -v clickhouse-data:/var/lib/clickhouse 
  clickhouse/clickhouse-server:latest

Connect via HTTP or native client:

curl "http://localhost:8123/?query=SELECT+version()"
# Or
docker exec -it clickhouse clickhouse-client
介绍

ClickHouse is an open-source, column-oriented database management system built for real-time analytical processing (OLAP) of huge datasets. Originally developed at Yandex, it can query billions of rows in milliseconds using SQL, making it the go-to choice for analytics, time-series data, logs, and metrics at scale.

With 46.8K+ GitHub stars and Apache-2.0 license, ClickHouse is used by Cloudflare, Uber, eBay, Spotify, and thousands of other companies to power real-time dashboards, analytics platforms, and observability stacks.

What ClickHouse Does

  • Columnar Storage: Data stored column-by-column, enabling massive compression and fast aggregations
  • SQL: Familiar SQL interface with extensions for analytics
  • Real-time Inserts: Ingest millions of rows per second
  • Distributed: Horizontal scaling across cluster with sharding and replication
  • Materialized Views: Pre-compute aggregations for instant queries
  • MergeTree Engines: Multiple specialized table engines for different use cases
  • Compression: 10x+ compression ratios with LZ4, ZSTD, and custom codecs
  • Data Import: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more
  • Time-Series: Optimized for time-series workloads with partitioning
  • Vector Search: Built-in vector similarity search

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Apps /      │────▶│  ClickHouse  │────▶│  Columnar    │
│  BI Tools    │SQLServer      │     │  Storage     │
└──────────────┘     │  (C++)       │     │  (MergeTree) │
                     └──────┬───────┘     └──────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
       ┌──────┴──┐   ┌─────┴───┐   ┌─────┴───┐
       │ Kafka   │   │   S3    │   │PostgreSQL│
       │ (Stream)│   │(Parquet)│   │  (Sync)  │
       └─────────┘   └─────────┘   └──────────┘

Self-Hosting

Docker Compose

services:
  clickhouse:
    image: clickhouse/clickhouse-server:latest
    ports:
      - "8123:8123"    # HTTP interface
      - "9000:9000"    # Native client
    volumes:
      - clickhouse-data:/var/lib/clickhouse
      - clickhouse-logs:/var/log/clickhouse-server
      - ./config.xml:/etc/clickhouse-server/config.d/custom.xml
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    restart: unless-stopped

volumes:
  clickhouse-data:
  clickhouse-logs:

Basic Usage

Create Table

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_name String,
    page_url String,
    properties String,
    country_code FixedString(2),
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
TTL timestamp + INTERVAL 90 DAY;

Insert Data

-- Single insert (supports batch inserts for performance)
INSERT INTO events VALUES
  ('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0),
  ('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99);

-- Insert from CSV (super fast)
INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV;

-- Insert from S3
INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet');

Fast Analytics Queries

-- Count by country (scans billions of rows in seconds)
SELECT
    country_code,
    count() AS events,
    uniq(user_id) AS unique_users,
    sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10;

-- Time series with 1-minute buckets
SELECT
    toStartOfMinute(timestamp) AS minute,
    event_name,
    count() AS event_count
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute, event_name
ORDER BY minute;

-- Funnel analysis
SELECT
    sum(has_view) AS views,
    sum(has_click) AS clicks,
    sum(has_purchase) AS purchases,
    round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct,
    round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct
FROM (
    SELECT
        user_id,
        max(event_name = 'page_view') AS has_view,
        max(event_name = 'click') AS has_click,
        max(event_name = 'purchase') AS has_purchase
    FROM events
    WHERE timestamp >= today()
    GROUP BY user_id
);

Key Features

Materialized Views (Pre-computed Aggregates)

-- Create materialized view that updates in real-time
CREATE MATERIALIZED VIEW hourly_stats_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, country_code)
AS SELECT
    toStartOfHour(timestamp) AS hour,
    country_code,
    count() AS events,
    sum(revenue) AS revenue
FROM events
GROUP BY hour, country_code;

-- Query is now instant
SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7;

Compression

CREATE TABLE events_compressed (
    timestamp DateTime CODEC(DoubleDelta, ZSTD),
    user_id UInt64 CODEC(DoubleDelta, LZ4),
    event_name LowCardinality(String),
    data String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY timestamp;

Typical compression ratios:

  • Timestamps: 100x+ (DoubleDelta)
  • Integers: 10-50x (Delta + LZ4)
  • Strings: 5-20x (LZ4/ZSTD)
  • LowCardinality strings: 100x+

Table Engines

MergeTree — Default, general-purpose
ReplacingMergeTree — Deduplicate on insert
SummingMergeTree — Auto-sum rows with same key
AggregatingMergeTree — Advanced aggregations
CollapsingMergeTree — Handle updates/deletes
ReplicatedMergeTree — Multi-node replication
Distributed — Query across cluster shards
Kafka — Consume from Kafka topics
S3 — Read/write S3 files

Kafka Integration

-- Stream data from Kafka
CREATE TABLE kafka_events (
    timestamp DateTime,
    user_id UInt64,
    event String
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse',
    kafka_format = 'JSONEachRow';

-- Materialized view to persist
CREATE MATERIALIZED VIEW events_consumer
TO events AS
SELECT * FROM kafka_events;

ClickHouse vs Alternatives

Feature ClickHouse PostgreSQL BigQuery Snowflake
Open Source Yes (Apache-2.0) Yes No No
Storage Columnar Row-based Columnar Columnar
Query Speed (analytics) Extremely fast Slow (large data) Fast Fast
Cost Free (self-host) Free $$ per query $$ compute+storage
SQL ANSI + extensions Full ANSI Standard Standard
Real-time inserts Yes (millions/sec) OK Limited Streaming
Best for Analytics, logs OLTP Analytics Analytics

常见问题

Q: ClickHouse 适合替代 PostgreSQL 吗? A: 不是替代,是补充。PostgreSQL 适合事务处理(OLTP),ClickHouse 适合分析查询(OLAP)。常见架构是主业务用 PostgreSQL,分析用 ClickHouse(从 PG 同步数据)。

Q: 性能有多快? A: 单节点可以处理每秒 1M+ 行插入和数十亿行的聚合查询(几秒内)。集群可以处理 PB 级数据。Cloudflare 用 ClickHouse 处理每秒数千万 HTTP 请求日志。

Q: 学习曲线怎样? A: 基本 SQL 操作与 PostgreSQL 相似,学习成本低。高级功能(表引擎选择、物化视图、分区策略)需要深入学习。官方文档非常详细。

来源与致谢

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产