What ClickHouse Does
- Columnar Storage: Data stored column-by-column, enabling massive compression and fast aggregations
- SQL: Familiar SQL interface with extensions for analytics
- Real-time Inserts: Ingest millions of rows per second
- Distributed: Horizontal scaling across cluster with sharding and replication
- Materialized Views: Pre-compute aggregations for instant queries
- MergeTree Engines: Multiple specialized table engines for different use cases
- Compression: 10x+ compression ratios with LZ4, ZSTD, and custom codecs
- Data Import: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more
- Time-Series: Optimized for time-series workloads with partitioning
- Vector Search: Built-in vector similarity search
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Apps / │────▶│ ClickHouse │────▶│ Columnar │
│ BI Tools │SQL │ Server │ │ Storage │
└──────────────┘ │ (C++) │ │ (MergeTree) │
└──────┬───────┘ └──────────────┘
│
┌─────────────┼─────────────┐
│ │ │
┌──────┴──┐ ┌─────┴───┐ ┌─────┴───┐
│ Kafka │ │ S3 │ │PostgreSQL│
│ (Stream)│ │(Parquet)│ │ (Sync) │
└─────────┘ └─────────┘ └──────────┘Self-Hosting
Docker Compose
services:
clickhouse:
image: clickhouse/clickhouse-server:latest
ports:
- "8123:8123" # HTTP interface
- "9000:9000" # Native client
volumes:
- clickhouse-data:/var/lib/clickhouse
- clickhouse-logs:/var/log/clickhouse-server
- ./config.xml:/etc/clickhouse-server/config.d/custom.xml
ulimits:
nofile:
soft: 262144
hard: 262144
restart: unless-stopped
volumes:
clickhouse-data:
clickhouse-logs:Basic Usage
Create Table
CREATE TABLE events (
timestamp DateTime,
user_id UInt64,
event_name String,
page_url String,
properties String,
country_code FixedString(2),
revenue Decimal(10, 2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
TTL timestamp + INTERVAL 90 DAY;Insert Data
-- Single insert (supports batch inserts for performance)
INSERT INTO events VALUES
('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0),
('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99);
-- Insert from CSV (super fast)
INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV;
-- Insert from S3
INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet');Fast Analytics Queries
-- Count by country (scans billions of rows in seconds)
SELECT
country_code,
count() AS events,
uniq(user_id) AS unique_users,
sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10;
-- Time series with 1-minute buckets
SELECT
toStartOfMinute(timestamp) AS minute,
event_name,
count() AS event_count
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute, event_name
ORDER BY minute;
-- Funnel analysis
SELECT
sum(has_view) AS views,
sum(has_click) AS clicks,
sum(has_purchase) AS purchases,
round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct,
round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct
FROM (
SELECT
user_id,
max(event_name = 'page_view') AS has_view,
max(event_name = 'click') AS has_click,
max(event_name = 'purchase') AS has_purchase
FROM events
WHERE timestamp >= today()
GROUP BY user_id
);Key Features
Materialized Views (Pre-computed Aggregates)
-- Create materialized view that updates in real-time
CREATE MATERIALIZED VIEW hourly_stats_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, country_code)
AS SELECT
toStartOfHour(timestamp) AS hour,
country_code,
count() AS events,
sum(revenue) AS revenue
FROM events
GROUP BY hour, country_code;
-- Query is now instant
SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7;Compression
CREATE TABLE events_compressed (
timestamp DateTime CODEC(DoubleDelta, ZSTD),
user_id UInt64 CODEC(DoubleDelta, LZ4),
event_name LowCardinality(String),
data String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY timestamp;Typical compression ratios:
- Timestamps: 100x+ (DoubleDelta)
- Integers: 10-50x (Delta + LZ4)
- Strings: 5-20x (LZ4/ZSTD)
- LowCardinality strings: 100x+
Table Engines
MergeTree — Default, general-purpose
ReplacingMergeTree — Deduplicate on insert
SummingMergeTree — Auto-sum rows with same key
AggregatingMergeTree — Advanced aggregations
CollapsingMergeTree — Handle updates/deletes
ReplicatedMergeTree — Multi-node replication
Distributed — Query across cluster shards
Kafka — Consume from Kafka topics
S3 — Read/write S3 filesKafka Integration
-- Stream data from Kafka
CREATE TABLE kafka_events (
timestamp DateTime,
user_id UInt64,
event String
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'events',
kafka_group_name = 'clickhouse',
kafka_format = 'JSONEachRow';
-- Materialized view to persist
CREATE MATERIALIZED VIEW events_consumer
TO events AS
SELECT * FROM kafka_events;ClickHouse vs Alternatives
| Feature | ClickHouse | PostgreSQL | BigQuery | Snowflake |
|---|---|---|---|---|
| Open Source | Yes (Apache-2.0) | Yes | No | No |
| Storage | Columnar | Row-based | Columnar | Columnar |
| Query Speed (analytics) | Extremely fast | Slow (large data) | Fast | Fast |
| Cost | Free (self-host) | Free | $$ per query | $$ compute+storage |
| SQL | ANSI + extensions | Full ANSI | Standard | Standard |
| Real-time inserts | Yes (millions/sec) | OK | Limited | Streaming |
| Best for | Analytics, logs | OLTP | Analytics | Analytics |
FAQ
Q: Can ClickHouse replace PostgreSQL? A: It's a complement, not a replacement. PostgreSQL is for transactional workloads (OLTP); ClickHouse is for analytical queries (OLAP). A common architecture is PostgreSQL for the core business and ClickHouse for analytics (with data synced from PG).
Q: How fast is it? A: A single node can handle 1M+ row inserts per second and aggregate queries over billions of rows (within seconds). A cluster can handle petabyte-scale data. Cloudflare uses ClickHouse to process tens of millions of HTTP request logs per second.
Q: What's the learning curve like? A: Basic SQL operations are similar to PostgreSQL and easy to pick up. Advanced features (choosing table engines, materialized views, partitioning strategies) take some study. The official documentation is very thorough.
Sources & Credits
- GitHub: ClickHouse/ClickHouse — 46.8K+ ⭐ | Apache-2.0
- Official site: clickhouse.com