# ClickHouse — Open Source Real-Time Analytics Database > ClickHouse is a lightning-fast, open-source column-oriented database for real-time analytics. Query billions of rows in milliseconds with SQL. Used by Cloudflare, Uber, eBay. ## Install Save in your project root: ## Quick Use ```bash docker run -d --name clickhouse -p 8123:8123 -p 9000:9000 -v clickhouse-data:/var/lib/clickhouse clickhouse/clickhouse-server:latest ``` Connect via HTTP or native client: ```bash curl "http://localhost:8123/?query=SELECT+version()" # Or docker exec -it clickhouse clickhouse-client ``` ## Intro **ClickHouse** is an open-source, column-oriented database management system built for real-time analytical processing (OLAP) of huge datasets. Originally developed at Yandex, it can query billions of rows in milliseconds using SQL, making it the go-to choice for analytics, time-series data, logs, and metrics at scale. With 46.8K+ GitHub stars and Apache-2.0 license, ClickHouse is used by Cloudflare, Uber, eBay, Spotify, and thousands of other companies to power real-time dashboards, analytics platforms, and observability stacks. ## What ClickHouse Does - **Columnar Storage**: Data stored column-by-column, enabling massive compression and fast aggregations - **SQL**: Familiar SQL interface with extensions for analytics - **Real-time Inserts**: Ingest millions of rows per second - **Distributed**: Horizontal scaling across cluster with sharding and replication - **Materialized Views**: Pre-compute aggregations for instant queries - **MergeTree Engines**: Multiple specialized table engines for different use cases - **Compression**: 10x+ compression ratios with LZ4, ZSTD, and custom codecs - **Data Import**: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more - **Time-Series**: Optimized for time-series workloads with partitioning - **Vector Search**: Built-in vector similarity search ## Architecture ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Apps / │────▶│ ClickHouse │────▶│ Columnar │ │ BI Tools │SQL │ Server │ │ Storage │ └──────────────┘ │ (C++) │ │ (MergeTree) │ └──────┬───────┘ └──────────────┘ │ ┌─────────────┼─────────────┐ │ │ │ ┌──────┴──┐ ┌─────┴───┐ ┌─────┴───┐ │ Kafka │ │ S3 │ │PostgreSQL│ │ (Stream)│ │(Parquet)│ │ (Sync) │ └─────────┘ └─────────┘ └──────────┘ ``` ## Self-Hosting ### Docker Compose ```yaml services: clickhouse: image: clickhouse/clickhouse-server:latest ports: - "8123:8123" # HTTP interface - "9000:9000" # Native client volumes: - clickhouse-data:/var/lib/clickhouse - clickhouse-logs:/var/log/clickhouse-server - ./config.xml:/etc/clickhouse-server/config.d/custom.xml ulimits: nofile: soft: 262144 hard: 262144 restart: unless-stopped volumes: clickhouse-data: clickhouse-logs: ``` ## Basic Usage ### Create Table ```sql CREATE TABLE events ( timestamp DateTime, user_id UInt64, event_name String, page_url String, properties String, country_code FixedString(2), revenue Decimal(10, 2) ) ENGINE = MergeTree() PARTITION BY toYYYYMM(timestamp) ORDER BY (timestamp, user_id) TTL timestamp + INTERVAL 90 DAY; ``` ### Insert Data ```sql -- Single insert (supports batch inserts for performance) INSERT INTO events VALUES ('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0), ('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99); -- Insert from CSV (super fast) INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV; -- Insert from S3 INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet'); ``` ### Fast Analytics Queries ```sql -- Count by country (scans billions of rows in seconds) SELECT country_code, count() AS events, uniq(user_id) AS unique_users, sum(revenue) AS total_revenue FROM events WHERE timestamp >= now() - INTERVAL 7 DAY GROUP BY country_code ORDER BY total_revenue DESC LIMIT 10; -- Time series with 1-minute buckets SELECT toStartOfMinute(timestamp) AS minute, event_name, count() AS event_count FROM events WHERE timestamp >= now() - INTERVAL 1 HOUR GROUP BY minute, event_name ORDER BY minute; -- Funnel analysis SELECT sum(has_view) AS views, sum(has_click) AS clicks, sum(has_purchase) AS purchases, round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct, round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct FROM ( SELECT user_id, max(event_name = 'page_view') AS has_view, max(event_name = 'click') AS has_click, max(event_name = 'purchase') AS has_purchase FROM events WHERE timestamp >= today() GROUP BY user_id ); ``` ## Key Features ### Materialized Views (Pre-computed Aggregates) ```sql -- Create materialized view that updates in real-time CREATE MATERIALIZED VIEW hourly_stats_mv ENGINE = SummingMergeTree() PARTITION BY toYYYYMM(hour) ORDER BY (hour, country_code) AS SELECT toStartOfHour(timestamp) AS hour, country_code, count() AS events, sum(revenue) AS revenue FROM events GROUP BY hour, country_code; -- Query is now instant SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7; ``` ### Compression ```sql CREATE TABLE events_compressed ( timestamp DateTime CODEC(DoubleDelta, ZSTD), user_id UInt64 CODEC(DoubleDelta, LZ4), event_name LowCardinality(String), data String CODEC(ZSTD(3)) ) ENGINE = MergeTree() ORDER BY timestamp; ``` Typical compression ratios: - Timestamps: 100x+ (DoubleDelta) - Integers: 10-50x (Delta + LZ4) - Strings: 5-20x (LZ4/ZSTD) - LowCardinality strings: 100x+ ### Table Engines ``` MergeTree — Default, general-purpose ReplacingMergeTree — Deduplicate on insert SummingMergeTree — Auto-sum rows with same key AggregatingMergeTree — Advanced aggregations CollapsingMergeTree — Handle updates/deletes ReplicatedMergeTree — Multi-node replication Distributed — Query across cluster shards Kafka — Consume from Kafka topics S3 — Read/write S3 files ``` ### Kafka Integration ```sql -- Stream data from Kafka CREATE TABLE kafka_events ( timestamp DateTime, user_id UInt64, event String ) ENGINE = Kafka() SETTINGS kafka_broker_list = 'kafka:9092', kafka_topic_list = 'events', kafka_group_name = 'clickhouse', kafka_format = 'JSONEachRow'; -- Materialized view to persist CREATE MATERIALIZED VIEW events_consumer TO events AS SELECT * FROM kafka_events; ``` ## ClickHouse vs Alternatives | Feature | ClickHouse | PostgreSQL | BigQuery | Snowflake | |---------|-----------|-----------|----------|-----------| | Open Source | Yes (Apache-2.0) | Yes | No | No | | Storage | Columnar | Row-based | Columnar | Columnar | | Query Speed (analytics) | Extremely fast | Slow (large data) | Fast | Fast | | Cost | Free (self-host) | Free | $$ per query | $$ compute+storage | | SQL | ANSI + extensions | Full ANSI | Standard | Standard | | Real-time inserts | Yes (millions/sec) | OK | Limited | Streaming | | Best for | Analytics, logs | OLTP | Analytics | Analytics | ## 常见问题 **Q: ClickHouse 适合替代 PostgreSQL 吗?** A: 不是替代,是补充。PostgreSQL 适合事务处理(OLTP),ClickHouse 适合分析查询(OLAP)。常见架构是主业务用 PostgreSQL,分析用 ClickHouse(从 PG 同步数据)。 **Q: 性能有多快?** A: 单节点可以处理每秒 1M+ 行插入和数十亿行的聚合查询(几秒内)。集群可以处理 PB 级数据。Cloudflare 用 ClickHouse 处理每秒数千万 HTTP 请求日志。 **Q: 学习曲线怎样?** A: 基本 SQL 操作与 PostgreSQL 相似,学习成本低。高级功能(表引擎选择、物化视图、分区策略)需要深入学习。官方文档非常详细。 ## 来源与致谢 - GitHub: [ClickHouse/ClickHouse](https://github.com/ClickHouse/ClickHouse) — 46.8K+ ⭐ | Apache-2.0 - 官网: [clickhouse.com](https://clickhouse.com) --- Source: https://tokrepo.com/en/workflows/2fce985b-3535-11f1-9bc6-00163e2b0d79 Author: AI Open Source