Introduction
Apache Paimon is a lake format that brings real-time streaming capabilities to data lakes. Originally developed as Flink Table Store, it enables continuous data ingestion with changelog tracking while maintaining compatibility with batch query engines like Spark, Hive, and Trino.
What Apache Paimon Does
- Provides a table format for data lakes with support for streaming write and batch read
- Tracks row-level changelogs for incremental processing and CDC pipelines
- Supports primary key tables with upsert semantics and append-only tables
- Integrates natively with Apache Flink, Spark, Hive, Trino, and StarRocks
- Stores data in columnar formats on S3, HDFS, or any Hadoop-compatible filesystem
Architecture Overview
Paimon organizes data into snapshots, each consisting of a manifest list that points to data files and changelog files. Primary key tables use an LSM-tree structure where incoming records are buffered in sorted runs and merged during compaction. Changelog files record INSERT, UPDATE, and DELETE operations, enabling downstream consumers to process only incremental changes. The catalog layer integrates with Hive Metastore or a filesystem catalog for metadata management.
Self-Hosting & Configuration
- Add the Paimon connector JAR to your Flink or Spark installation
- Create a Paimon catalog pointing to an S3, HDFS, or local warehouse directory
- Configure a Hive Metastore catalog for cross-engine metadata sharing
- Tune compaction settings based on write throughput and query latency requirements
- Set up snapshot expiration policies to manage storage growth
Key Features
- Unified batch and streaming storage with changelog tracking for incremental reads
- LSM-tree-based primary key tables with efficient upsert and partial update support
- Time travel queries via snapshot management for reproducible analytics
- Schema evolution support including column additions and type widening
- Cross-engine compatibility with Flink, Spark, Hive, Trino, and StarRocks
Comparison with Similar Tools
- Delta Lake — Spark-centric with batch focus; Paimon is designed for Flink streaming-first workloads
- Apache Iceberg — general-purpose table format; Paimon adds native changelog tracking for CDC
- Apache Hudi — supports incremental processing; Paimon uses LSM-trees for higher streaming write throughput
- Apache Kafka — streaming transport; Paimon provides persistent lake storage with SQL query support
- ClickHouse — OLAP engine; Paimon is a storage format that feeds into multiple query engines
FAQ
Q: How does Paimon differ from Apache Iceberg? A: Paimon is optimized for streaming writes with built-in changelog tracking via LSM-trees, while Iceberg focuses on batch-oriented table management. Paimon natively produces changelogs for incremental downstream processing.
Q: Can I query Paimon tables with Spark? A: Yes. Paimon provides Spark connectors for both reading and writing. Tables created by Flink are fully readable by Spark and vice versa.
Q: What storage backends does Paimon support? A: Paimon supports S3, HDFS, Azure Blob Storage, Google Cloud Storage, OSS, and local filesystems via the Hadoop FileSystem interface.
Q: Is Paimon production-ready? A: Apache Paimon graduated as a top-level Apache project and is used in production by organizations processing streaming data at scale.