# Databend — Cloud-Native Open-Source Data Warehouse Built in Rust

> Databend is a modern cloud data warehouse with separation of storage and compute on object storage. Written in Rust for extreme performance, it is a self-hostable alternative to Snowflake with full Snowflake-style SQL compatibility.

## Install

Save in your project root:

# Databend — Cloud-Native Rust-Based Data Warehouse

## Quick Use
```bash
# Single-binary Docker for a quick test
docker run -d --name databend \
  -p 8000:8000 -p 3307:3307 \
  datafuselabs/databend

# Connect via MySQL-compatible client (port 3307)
mysql -h 127.0.0.1 -P 3307 -uroot

# Or HTTP REST API (port 8000)
curl -u root: -XPOST http://localhost:8000/v1/query \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT 1+1"}'
```

```sql
-- Tables live on object storage (S3/GCS/Azure/MinIO)
CREATE TABLE events (
  ts TIMESTAMP,
  user_id BIGINT,
  event STRING,
  payload VARIANT
);

-- Ingest from S3 in one line
COPY INTO events FROM 's3://my-bucket/events/' FILE_FORMAT = (TYPE = PARQUET);

-- Query with standard SQL + Snowflake-style functions
SELECT event, COUNT(*)
FROM events
WHERE ts >= '2026-04-01'
GROUP BY event
ORDER BY 2 DESC;
```

## Introduction
Databend is a modern analytical data warehouse built from scratch in Rust, with an architecture inspired by Snowflake: **stateless compute nodes** that read and write data in object storage. That means you pay for bytes in S3 (cheap) and scale compute up/down on demand — no local storage management.

With over 9,000 GitHub stars, Databend is used by teams looking for an open-source Snowflake alternative. SQL compatibility is strong enough that many Snowflake queries move over unchanged.

## What Databend Does
Databend stores tables as open-format Parquet/ORC files in object storage and uses its own metadata service (built on FoundationDB / MySQL / Postgres) for catalog. Queries run through a vectorized engine that's been heavily optimized for late materialization, predicate pushdown, and filter caching.

## Architecture Overview
```
Clients (MySQL, HTTP, ClickHouse protocol)
        |
   [Query Nodes (Databend)]
   stateless, scale elastically
        |
   [Meta Service]
   table schemas, versions, auth
   (FoundationDB / Postgres / MySQL as backend)
        |
   [Object Storage]
   S3 / GCS / Azure / OCI / MinIO / HDFS
   Parquet/ORC + small index files
        |
   [External Data Sources]
   Iceberg, Hive, CSV/JSON/TSV, Kafka, Snowflake
```

## Self-Hosting & Configuration
```sql
-- Databend "Warehouses" allow compute isolation per workload
CREATE WAREHOUSE etl WITH WAREHOUSE_SIZE = 'Large';
CREATE WAREHOUSE bi  WITH WAREHOUSE_SIZE = 'Medium';

-- Connect into a specific warehouse
use warehouse etl;

-- Streaming-style COPY + transform + MERGE (CDC pattern)
COPY INTO raw_events FROM 's3://bucket/cdc/*.parquet' FORCE = true
  FILE_FORMAT = (TYPE = PARQUET);

MERGE INTO events AS tgt
USING (SELECT * FROM raw_events WHERE is_delete = false) AS src
  ON tgt.user_id = src.user_id AND tgt.ts = src.ts
WHEN MATCHED THEN UPDATE SET tgt.event = src.event
WHEN NOT MATCHED THEN INSERT (user_id, ts, event) VALUES (src.user_id, src.ts, src.event);
```

## Key Features
- **Stateless compute** — scale warehouses elastically, pay per use
- **Object storage first** — S3/GCS/Azure/MinIO; data is just Parquet files
- **MySQL wire protocol** — BI tools connect unchanged
- **Snowflake-like SQL** — VARIANT, time travel, MERGE, COPY INTO
- **Rust-native performance** — vectorized execution, modern CPU features
- **Streaming ingest** — CDC from Kafka, TableStreams API
- **Time travel** — query historical table versions by timestamp
- **Data sharing** — cross-account table shares like Snowflake

## Comparison with Similar Tools
| Feature | Databend | Snowflake | ClickHouse Cloud | BigQuery | DuckDB |
|---|---|---|---|---|---|
| Storage | Object storage | Proprietary | Object storage (cloud) | Proprietary | Local files |
| Compute model | Elastic warehouses | Elastic warehouses | Cloud-native clusters | Serverless | Single-process |
| SQL dialect | Snowflake-like | Snowflake | ClickHouse | BigQuery | DuckDB |
| Self-host | Yes | No | No (self-host core) | No | N/A (embedded) |
| Time travel | Yes | Yes | Limited | Yes (snapshots) | No |
| Best For | Open-source Snowflake alternative | Managed DW | Log-heavy analytics | GCP shops | Single-machine analytics |

## FAQ
**Q: Databend vs ClickHouse?**
A: Databend is architected for cloud-native storage-compute separation; ClickHouse is a high-performance local/cluster engine. Databend's SQL and storage model are closer to Snowflake; ClickHouse is closer to columnar MPP systems. Pick Databend for S3-first workflows; ClickHouse for raw speed on local disks.

**Q: How mature is Databend?**
A: v1.x since 2023, active monthly releases. Production use by several Chinese and international teams. SQL surface is broad enough for real analytics workloads.

**Q: Can it replace Snowflake?**
A: For many mid-sized analytical workloads, yes — and you own the infrastructure. For teams deeply integrated with Snowflake's ecosystem (Snowpark, etc.), the transition is harder.

**Q: Is Databend truly open source?**
A: The core is Apache-2.0. Databend Cloud (managed) is a paid service. The project is actively maintained by Datafuse Labs.

## Sources
- GitHub: https://github.com/databendlabs/databend
- Docs: https://docs.databend.com
- Company: Datafuse Labs
- License: Apache-2.0

---
Source: https://tokrepo.com/en/workflows/09c00758-37d2-11f1-9bc6-00163e2b0d79
Author: AI Open Source