What is ClickHouse — Open Source Real-Time Analytics Database?

ClickHouse is a lightning-fast, open-source column-oriented database for real-time analytics. Query billions of rows in milliseconds with SQL. Used by Cloudflare, Uber, eBay.

Is ClickHouse — Open Source Real-Time Analytics Database free to use?

Yes. ClickHouse — Open Source Real-Time Analytics Database is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install ClickHouse — Open Source Real-Time Analytics Database?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ClickHouse — Open Source Real-Time Analytics Database

What ClickHouse Does

Columnar Storage: Data stored column-by-column, enabling massive compression and fast aggregations
SQL: Familiar SQL interface with extensions for analytics
Real-time Inserts: Ingest millions of rows per second
Distributed: Horizontal scaling across cluster with sharding and replication
Materialized Views: Pre-compute aggregations for instant queries
MergeTree Engines: Multiple specialized table engines for different use cases
Compression: 10x+ compression ratios with LZ4, ZSTD, and custom codecs
Data Import: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more
Time-Series: Optimized for time-series workloads with partitioning
Vector Search: Built-in vector similarity search

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Apps /      │────▶│  ClickHouse  │────▶│  Columnar    │
│  BI Tools    │SQL  │  Server      │     │  Storage     │
└──────────────┘     │  (C++)       │     │  (MergeTree) │
                     └──────┬───────┘     └──────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
       ┌──────┴──┐   ┌─────┴───┐   ┌─────┴───┐
       │ Kafka   │   │   S3    │   │PostgreSQL│
       │ (Stream)│   │(Parquet)│   │  (Sync)  │
       └─────────┘   └─────────┘   └──────────┘

Self-Hosting

Docker Compose

services:
  clickhouse:
    image: clickhouse/clickhouse-server:latest
    ports:
      - "8123:8123"    # HTTP interface
      - "9000:9000"    # Native client
    volumes:
      - clickhouse-data:/var/lib/clickhouse
      - clickhouse-logs:/var/log/clickhouse-server
      - ./config.xml:/etc/clickhouse-server/config.d/custom.xml
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    restart: unless-stopped

volumes:
  clickhouse-data:
  clickhouse-logs:

Basic Usage

Create Table

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_name String,
    page_url String,
    properties String,
    country_code FixedString(2),
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
TTL timestamp + INTERVAL 90 DAY;

Insert Data

-- Single insert (supports batch inserts for performance)
INSERT INTO events VALUES
  ('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0),
  ('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99);

-- Insert from CSV (super fast)
INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV;

-- Insert from S3
INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet');

Fast Analytics Queries

-- Count by country (scans billions of rows in seconds)
SELECT
    country_code,
    count() AS events,
    uniq(user_id) AS unique_users,
    sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10;

-- Time series with 1-minute buckets
SELECT
    toStartOfMinute(timestamp) AS minute,
    event_name,
    count() AS event_count
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute, event_name
ORDER BY minute;

-- Funnel analysis
SELECT
    sum(has_view) AS views,
    sum(has_click) AS clicks,
    sum(has_purchase) AS purchases,
    round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct,
    round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct
FROM (
    SELECT
        user_id,
        max(event_name = 'page_view') AS has_view,
        max(event_name = 'click') AS has_click,
        max(event_name = 'purchase') AS has_purchase
    FROM events
    WHERE timestamp >= today()
    GROUP BY user_id
);

Key Features

Materialized Views (Pre-computed Aggregates)

-- Create materialized view that updates in real-time
CREATE MATERIALIZED VIEW hourly_stats_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, country_code)
AS SELECT
    toStartOfHour(timestamp) AS hour,
    country_code,
    count() AS events,
    sum(revenue) AS revenue
FROM events
GROUP BY hour, country_code;

-- Query is now instant
SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7;

Compression

CREATE TABLE events_compressed (
    timestamp DateTime CODEC(DoubleDelta, ZSTD),
    user_id UInt64 CODEC(DoubleDelta, LZ4),
    event_name LowCardinality(String),
    data String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY timestamp;

Typical compression ratios:

Timestamps: 100x+ (DoubleDelta)
Integers: 10-50x (Delta + LZ4)
Strings: 5-20x (LZ4/ZSTD)
LowCardinality strings: 100x+

Table Engines

MergeTree — Default, general-purpose
ReplacingMergeTree — Deduplicate on insert
SummingMergeTree — Auto-sum rows with same key
AggregatingMergeTree — Advanced aggregations
CollapsingMergeTree — Handle updates/deletes
ReplicatedMergeTree — Multi-node replication
Distributed — Query across cluster shards
Kafka — Consume from Kafka topics
S3 — Read/write S3 files

Kafka Integration

-- Stream data from Kafka
CREATE TABLE kafka_events (
    timestamp DateTime,
    user_id UInt64,
    event String
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse',
    kafka_format = 'JSONEachRow';

-- Materialized view to persist
CREATE MATERIALIZED VIEW events_consumer
TO events AS
SELECT * FROM kafka_events;

ClickHouse vs Alternatives

Feature	ClickHouse	PostgreSQL	BigQuery	Snowflake
Open Source	Yes (Apache-2.0)	Yes	No	No
Storage	Columnar	Row-based	Columnar	Columnar
Query Speed (analytics)	Extremely fast	Slow (large data)	Fast	Fast
Cost	Free (self-host)	Free	$$ per query	$$ compute+storage
SQL	ANSI + extensions	Full ANSI	Standard	Standard
Real-time inserts	Yes (millions/sec)	OK	Limited	Streaming
Best for	Analytics, logs	OLTP	Analytics	Analytics

FAQ

Q: Can ClickHouse replace PostgreSQL? A: It's a complement, not a replacement. PostgreSQL is for transactional workloads (OLTP); ClickHouse is for analytical queries (OLAP). A common architecture is PostgreSQL for the core business and ClickHouse for analytics (with data synced from PG).

Q: How fast is it? A: A single node can handle 1M+ row inserts per second and aggregate queries over billions of rows (within seconds). A cluster can handle petabyte-scale data. Cloudflare uses ClickHouse to process tens of millions of HTTP request logs per second.

Q: What's the learning curve like? A: Basic SQL operations are similar to PostgreSQL and easy to pick up. Advanced features (choosing table engines, materialized views, partitioning strategies) take some study. The official documentation is very thorough.

Sources & Credits

GitHub: ClickHouse/ClickHouse — 46.8K+ ⭐ | Apache-2.0
Official site: clickhouse.com