What is Apache Paimon — Streaming Data Lake Storage?

Apache Paimon is a streaming data lake platform that supports both real-time streaming writes and high-performance batch reads using a lake format with changelog tracking.

Is Apache Paimon — Streaming Data Lake Storage free to use?

Yes. Apache Paimon — Streaming Data Lake Storage is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Apache Paimon — Streaming Data Lake Storage?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Apache Paimon — Streaming Data Lake Storage

Introduction

Apache Paimon is a lake format that brings real-time streaming capabilities to data lakes. Originally developed as Flink Table Store, it enables continuous data ingestion with changelog tracking while maintaining compatibility with batch query engines like Spark, Hive, and Trino.

What Apache Paimon Does

Provides a table format for data lakes with support for streaming write and batch read
Tracks row-level changelogs for incremental processing and CDC pipelines
Supports primary key tables with upsert semantics and append-only tables
Integrates natively with Apache Flink, Spark, Hive, Trino, and StarRocks
Stores data in columnar formats on S3, HDFS, or any Hadoop-compatible filesystem

Architecture Overview

Paimon organizes data into snapshots, each consisting of a manifest list that points to data files and changelog files. Primary key tables use an LSM-tree structure where incoming records are buffered in sorted runs and merged during compaction. Changelog files record INSERT, UPDATE, and DELETE operations, enabling downstream consumers to process only incremental changes. The catalog layer integrates with Hive Metastore or a filesystem catalog for metadata management.

Self-Hosting & Configuration

Add the Paimon connector JAR to your Flink or Spark installation
Create a Paimon catalog pointing to an S3, HDFS, or local warehouse directory
Configure a Hive Metastore catalog for cross-engine metadata sharing
Tune compaction settings based on write throughput and query latency requirements
Set up snapshot expiration policies to manage storage growth

Key Features

Unified batch and streaming storage with changelog tracking for incremental reads
LSM-tree-based primary key tables with efficient upsert and partial update support
Time travel queries via snapshot management for reproducible analytics
Schema evolution support including column additions and type widening
Cross-engine compatibility with Flink, Spark, Hive, Trino, and StarRocks

Comparison with Similar Tools

Delta Lake — Spark-centric with batch focus; Paimon is designed for Flink streaming-first workloads
Apache Iceberg — general-purpose table format; Paimon adds native changelog tracking for CDC
Apache Hudi — supports incremental processing; Paimon uses LSM-trees for higher streaming write throughput
Apache Kafka — streaming transport; Paimon provides persistent lake storage with SQL query support
ClickHouse — OLAP engine; Paimon is a storage format that feeds into multiple query engines

FAQ

Q: How does Paimon differ from Apache Iceberg? A: Paimon is optimized for streaming writes with built-in changelog tracking via LSM-trees, while Iceberg focuses on batch-oriented table management. Paimon natively produces changelogs for incremental downstream processing.

Q: Can I query Paimon tables with Spark? A: Yes. Paimon provides Spark connectors for both reading and writing. Tables created by Flink are fully readable by Spark and vice versa.

Q: What storage backends does Paimon support? A: Paimon supports S3, HDFS, Azure Blob Storage, Google Cloud Storage, OSS, and local filesystems via the Hadoop FileSystem interface.

Q: Is Paimon production-ready? A: Apache Paimon graduated as a top-level Apache project and is used in production by organizations processing streaming data at scale.

Apache Paimon — Streaming Data Lake Storage

Ready-to-run agent install

Introduction

What Apache Paimon Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Apache DolphinScheduler — Distributed Data Workflow Orchestration Platform

Apache Hive — Distributed Data Warehouse for Big Data Analytics

Apache Hadoop — Distributed Big Data Processing Framework

Apache Kafka — Distributed Event Streaming Platform