Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsApr 10, 2026·3 min de lectura

ClickHouse — Open Source Real-Time Analytics Database

ClickHouse is a lightning-fast, open-source column-oriented database for real-time analytics. Query billions of rows in milliseconds with SQL. Used by Cloudflare, Uber, eBay.

Introducción

ClickHouse is an open-source, column-oriented database management system built for real-time analytical processing (OLAP) of huge datasets. Originally developed at Yandex, it can query billions of rows in milliseconds using SQL, making it the go-to choice for analytics, time-series data, logs, and metrics at scale.

With 46.8K+ GitHub stars and Apache-2.0 license, ClickHouse is used by Cloudflare, Uber, eBay, Spotify, and thousands of other companies to power real-time dashboards, analytics platforms, and observability stacks.

What ClickHouse Does

  • Columnar Storage: Data stored column-by-column, enabling massive compression and fast aggregations
  • SQL: Familiar SQL interface with extensions for analytics
  • Real-time Inserts: Ingest millions of rows per second
  • Distributed: Horizontal scaling across cluster with sharding and replication
  • Materialized Views: Pre-compute aggregations for instant queries
  • MergeTree Engines: Multiple specialized table engines for different use cases
  • Compression: 10x+ compression ratios with LZ4, ZSTD, and custom codecs
  • Data Import: Read from Kafka, S3, HDFS, MySQL, PostgreSQL, and more
  • Time-Series: Optimized for time-series workloads with partitioning
  • Vector Search: Built-in vector similarity search

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Apps /      │────▶│  ClickHouse  │────▶│  Columnar    │
│  BI Tools    │SQLServer      │     │  Storage     │
└──────────────┘     │  (C++)       │     │  (MergeTree) │
                     └──────┬───────┘     └──────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
       ┌──────┴──┐   ┌─────┴───┐   ┌─────┴───┐
       │ Kafka   │   │   S3    │   │PostgreSQL│
       │ (Stream)│   │(Parquet)│   │  (Sync)  │
       └─────────┘   └─────────┘   └──────────┘

Self-Hosting

Docker Compose

services:
  clickhouse:
    image: clickhouse/clickhouse-server:latest
    ports:
      - "8123:8123"    # HTTP interface
      - "9000:9000"    # Native client
    volumes:
      - clickhouse-data:/var/lib/clickhouse
      - clickhouse-logs:/var/log/clickhouse-server
      - ./config.xml:/etc/clickhouse-server/config.d/custom.xml
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
    restart: unless-stopped

volumes:
  clickhouse-data:
  clickhouse-logs:

Basic Usage

Create Table

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_name String,
    page_url String,
    properties String,
    country_code FixedString(2),
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
TTL timestamp + INTERVAL 90 DAY;

Insert Data

-- Single insert (supports batch inserts for performance)
INSERT INTO events VALUES
  ('2024-04-10 12:30:00', 1001, 'page_view', '/home', '{}', 'US', 0),
  ('2024-04-10 12:31:00', 1002, 'purchase', '/checkout', '{"items":3}', 'GB', 99.99);

-- Insert from CSV (super fast)
INSERT INTO events FROM INFILE '/data/events.csv' FORMAT CSV;

-- Insert from S3
INSERT INTO events SELECT * FROM s3('https://bucket.s3.amazonaws.com/events/*.parquet', 'Parquet');

Fast Analytics Queries

-- Count by country (scans billions of rows in seconds)
SELECT
    country_code,
    count() AS events,
    uniq(user_id) AS unique_users,
    sum(revenue) AS total_revenue
FROM events
WHERE timestamp >= now() - INTERVAL 7 DAY
GROUP BY country_code
ORDER BY total_revenue DESC
LIMIT 10;

-- Time series with 1-minute buckets
SELECT
    toStartOfMinute(timestamp) AS minute,
    event_name,
    count() AS event_count
FROM events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute, event_name
ORDER BY minute;

-- Funnel analysis
SELECT
    sum(has_view) AS views,
    sum(has_click) AS clicks,
    sum(has_purchase) AS purchases,
    round(sum(has_click) / sum(has_view) * 100, 2) AS click_rate_pct,
    round(sum(has_purchase) / sum(has_click) * 100, 2) AS conversion_rate_pct
FROM (
    SELECT
        user_id,
        max(event_name = 'page_view') AS has_view,
        max(event_name = 'click') AS has_click,
        max(event_name = 'purchase') AS has_purchase
    FROM events
    WHERE timestamp >= today()
    GROUP BY user_id
);

Key Features

Materialized Views (Pre-computed Aggregates)

-- Create materialized view that updates in real-time
CREATE MATERIALIZED VIEW hourly_stats_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, country_code)
AS SELECT
    toStartOfHour(timestamp) AS hour,
    country_code,
    count() AS events,
    sum(revenue) AS revenue
FROM events
GROUP BY hour, country_code;

-- Query is now instant
SELECT * FROM hourly_stats_mv WHERE hour >= today() - 7;

Compression

CREATE TABLE events_compressed (
    timestamp DateTime CODEC(DoubleDelta, ZSTD),
    user_id UInt64 CODEC(DoubleDelta, LZ4),
    event_name LowCardinality(String),
    data String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY timestamp;

Typical compression ratios:

  • Timestamps: 100x+ (DoubleDelta)
  • Integers: 10-50x (Delta + LZ4)
  • Strings: 5-20x (LZ4/ZSTD)
  • LowCardinality strings: 100x+

Table Engines

MergeTree — Default, general-purpose
ReplacingMergeTree — Deduplicate on insert
SummingMergeTree — Auto-sum rows with same key
AggregatingMergeTree — Advanced aggregations
CollapsingMergeTree — Handle updates/deletes
ReplicatedMergeTree — Multi-node replication
Distributed — Query across cluster shards
Kafka — Consume from Kafka topics
S3 — Read/write S3 files

Kafka Integration

-- Stream data from Kafka
CREATE TABLE kafka_events (
    timestamp DateTime,
    user_id UInt64,
    event String
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse',
    kafka_format = 'JSONEachRow';

-- Materialized view to persist
CREATE MATERIALIZED VIEW events_consumer
TO events AS
SELECT * FROM kafka_events;

ClickHouse vs Alternatives

Feature ClickHouse PostgreSQL BigQuery Snowflake
Open Source Yes (Apache-2.0) Yes No No
Storage Columnar Row-based Columnar Columnar
Query Speed (analytics) Extremely fast Slow (large data) Fast Fast
Cost Free (self-host) Free $$ per query $$ compute+storage
SQL ANSI + extensions Full ANSI Standard Standard
Real-time inserts Yes (millions/sec) OK Limited Streaming
Best for Analytics, logs OLTP Analytics Analytics

FAQ

Q: Can ClickHouse replace PostgreSQL? A: It's a complement, not a replacement. PostgreSQL is for transactional workloads (OLTP); ClickHouse is for analytical queries (OLAP). A common architecture is PostgreSQL for the core business and ClickHouse for analytics (with data synced from PG).

Q: How fast is it? A: A single node can handle 1M+ row inserts per second and aggregate queries over billions of rows (within seconds). A cluster can handle petabyte-scale data. Cloudflare uses ClickHouse to process tens of millions of HTTP request logs per second.

Q: What's the learning curve like? A: Basic SQL operations are similar to PostgreSQL and easy to pick up. Advanced features (choosing table engines, materialized views, partitioning strategies) take some study. The official documentation is very thorough.

Sources & Credits

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados