What is ClickHouse — Open Source Real-Time Analytics Database?

ClickHouse is a lightning-fast, open-source column-oriented database for real-time analytics. Query billions of rows in milliseconds with SQL. Used by Cloudflare, Uber, eBay.

Is ClickHouse — Open Source Real-Time Analytics Database free to use?

Yes. ClickHouse — Open Source Real-Time Analytics Database is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install ClickHouse — Open Source Real-Time Analytics Database?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ClickHouse — Open Source Real-Time Analytics Database

What ClickHouse Does

Columnar Storage: Data stored column-by-column, enabling massive compression and fast aggregations
SQL: Familiar SQL interface with extensions for analytics
Real-time Inserts: Ingest millions of rows per second
Distributed: Horizontal scaling across cluster with sharding and replication
Materialized Views: Pre-compute aggregations for instant queries
MergeTree Engines: Multiple specialized table engines for different use cases
Compression: 10x+ compression ratios with LZ4, ZSTD, and custom codecs
Data Import: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more
Time-Series: Optimized for time-series workloads with partitioning
Vector Search: Built-in vector similarity search

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Apps /      │────▶│  ClickHouse  │────▶│  Columnar    │
│  BI Tools    │SQL  │  Server      │     │  Storage     │
└──────────────┘     │  (C++)       │     │  (MergeTree) │
                     └──────┬───────┘     └──────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
       ┌──────┴──┐   ┌─────┴───┐   ┌─────┴───┐
       │ Kafka   │   │   S3    │   │PostgreSQL│
       │ (Stream)│   │(Parquet)│   │  (Sync)  │
       └─────────┘   └─────────┘   └──────────┘

Self-Hosting

Docker Compose

services:
  clickhouse:
    image: clickhouse/clickhouse-server:latest
    ports:
      - "8123:8123"    # HTTP interface
      - "9000:9000"    # Native client
    volumes:
      - clickhouse-data:/var/lib/clickhouse
      - clickhouse-logs:/var/log/clickhouse-server
      - ./config.xml:/etc/clickhouse-server/config.d/custom.xml
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    restart: unless-stopped

volumes:
  clickhouse-data:
  clickhouse-logs:

Basic Usage

Create Table

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_name String,
    page_url String,
    properties String,
    country_code FixedString(2),
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
TTL timestamp + INTERVAL 90 DAY;

Insert Data

-- Single insert (supports batch inserts for performance)
INSERT INTO events VALUES
  ('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0),
  ('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99);

-- Insert from CSV (super fast)
INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV;

-- Insert from S3
INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet');

Fast Analytics Queries

-- Count by country (scans billions of rows in seconds)
SELECT
    country_code,
    count() AS events,
    uniq(user_id) AS unique_users,
    sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10;

-- Time series with 1-minute buckets
SELECT
    toStartOfMinute(timestamp) AS minute,
    event_name,
    count() AS event_count
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute, event_name
ORDER BY minute;

-- Funnel analysis
SELECT
    sum(has_view) AS views,
    sum(has_click) AS clicks,
    sum(has_purchase) AS purchases,
    round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct,
    round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct
FROM (
    SELECT
        user_id,
        max(event_name = 'page_view') AS has_view,
        max(event_name = 'click') AS has_click,
        max(event_name = 'purchase') AS has_purchase
    FROM events
    WHERE timestamp >= today()
    GROUP BY user_id
);

Key Features

Materialized Views (Pre-computed Aggregates)

-- Create materialized view that updates in real-time
CREATE MATERIALIZED VIEW hourly_stats_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, country_code)
AS SELECT
    toStartOfHour(timestamp) AS hour,
    country_code,
    count() AS events,
    sum(revenue) AS revenue
FROM events
GROUP BY hour, country_code;

-- Query is now instant
SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7;

Compression

CREATE TABLE events_compressed (
    timestamp DateTime CODEC(DoubleDelta, ZSTD),
    user_id UInt64 CODEC(DoubleDelta, LZ4),
    event_name LowCardinality(String),
    data String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY timestamp;

Typical compression ratios:

Timestamps: 100x+ (DoubleDelta)
Integers: 10-50x (Delta + LZ4)
Strings: 5-20x (LZ4/ZSTD)
LowCardinality strings: 100x+

Table Engines

MergeTree — Default, general-purpose
ReplacingMergeTree — Deduplicate on insert
SummingMergeTree — Auto-sum rows with same key
AggregatingMergeTree — Advanced aggregations
CollapsingMergeTree — Handle updates/deletes
ReplicatedMergeTree — Multi-node replication
Distributed — Query across cluster shards
Kafka — Consume from Kafka topics
S3 — Read/write S3 files

Kafka Integration

-- Stream data from Kafka
CREATE TABLE kafka_events (
    timestamp DateTime,
    user_id UInt64,
    event String
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse',
    kafka_format = 'JSONEachRow';

-- Materialized view to persist
CREATE MATERIALIZED VIEW events_consumer
TO events AS
SELECT * FROM kafka_events;

ClickHouse vs Alternatives

Feature	ClickHouse	PostgreSQL	BigQuery	Snowflake
Open Source	Yes (Apache-2.0)	Yes	No	No
Storage	Columnar	Row-based	Columnar	Columnar
Query Speed (analytics)	Extremely fast	Slow (large data)	Fast	Fast
Cost	Free (self-host)	Free	$$ per query	$$ compute+storage
SQL	ANSI + extensions	Full ANSI	Standard	Standard
Real-time inserts	Yes (millions/sec)	OK	Limited	Streaming
Best for	Analytics, logs	OLTP	Analytics	Analytics

常见问题

Q: ClickHouse 适合替代 PostgreSQL 吗？ A: 不是替代，是补充。PostgreSQL 适合事务处理（OLTP），ClickHouse 适合分析查询（OLAP）。常见架构是主业务用 PostgreSQL，分析用 ClickHouse（从 PG 同步数据）。

Q: 性能有多快？ A: 单节点可以处理每秒 1M+ 行插入和数十亿行的聚合查询（几秒内）。集群可以处理 PB 级数据。Cloudflare 用 ClickHouse 处理每秒数千万 HTTP 请求日志。

Q: 学习曲线怎样？ A: 基本 SQL 操作与 PostgreSQL 相似，学习成本低。高级功能（表引擎选择、物化视图、分区策略）需要深入学习。官方文档非常详细。

来源与致谢

GitHub: ClickHouse/ClickHouse — 46.8K+ ⭐ | Apache-2.0
官网: clickhouse.com