What is Apache Avro — Schema-Based Data Serialization System?

Apache Avro is a compact binary serialization framework with rich schema support, schema evolution, and deep integration with the Hadoop and Kafka ecosystems.

Is Apache Avro — Schema-Based Data Serialization System free to use?

Yes. Apache Avro — Schema-Based Data Serialization System is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Apache Avro — Schema-Based Data Serialization System?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Apache Avro — Schema-Based Data Serialization System

Introduction

Apache Avro is a data serialization system that uses JSON-defined schemas to produce compact binary data. It is the standard serialization format for Apache Kafka and is widely used throughout the Hadoop ecosystem for data storage, RPC, and schema evolution.

What Apache Avro Does

Serializes structured data into a compact binary format using JSON-defined schemas
Supports forward and backward schema evolution without breaking consumers
Provides code generation for Java, Python, C, C++, C#, and other languages
Includes an RPC framework for building schema-aware network services
Integrates natively with Kafka, Hadoop, Spark, Flink, and Hive

Architecture Overview

Avro schemas are defined in JSON and describe record types with named fields, each with a type. The binary encoding writes field values in schema-declared order without field tags, producing smaller payloads than tagged formats. A writer schema and reader schema are resolved at deserialization time, enabling schema evolution. Container files embed the writer schema in the file header so readers are always self-describing. The Schema Registry pattern (used with Kafka) stores schemas centrally and embeds only a schema ID in each message.

Self-Hosting & Configuration

Define schemas as JSON files with record types, fields, and types
Generate language-specific classes using the avro-tools CLI or Maven/Gradle plugin
Use GenericRecord for dynamic schema handling without code generation
Deploy a Schema Registry (like Confluent Schema Registry) alongside Kafka for centralized schema management
Configure compatibility rules (BACKWARD, FORWARD, FULL) to enforce safe evolution

Key Features

Compact binary format with no per-field tags reduces payload size
Schema evolution with backward and forward compatibility guarantees
Self-describing container files embed the schema for standalone use
Language-neutral: libraries exist for Java, Python, C, C++, C#, Ruby, and more
Standard serialization format for Apache Kafka and the Hadoop ecosystem

Comparison with Similar Tools

Protocol Buffers — uses field tags for evolution; Avro uses schema resolution and produces smaller payloads for many workloads
JSON — human-readable but verbose; Avro is binary and significantly more compact
MessagePack — schema-less binary; Avro enforces schemas for type safety and evolution
Thrift — includes RPC and transport; Avro focuses on serialization with simpler schema evolution
Parquet — columnar storage format; Avro is row-oriented and used for serialization and messaging

FAQ

Q: Why is Avro the default for Kafka? A: Avro combines compact binary encoding with schema evolution support. The Schema Registry pattern lets producers and consumers evolve independently while maintaining compatibility.

Q: How does schema evolution work? A: Writers and readers can use different schema versions. Fields can be added (with defaults) or removed without breaking existing consumers, as long as compatibility rules are followed.

Q: Do I need code generation to use Avro? A: No. Avro supports GenericRecord for dynamic usage without generated classes. Code generation is optional but provides type-safe access in statically typed languages.

Q: Can Avro schemas reference other schemas? A: Yes. Avro supports named types that can be referenced across schemas, and schemas can be composed using unions and nested records.

Apache Avro — Schema-Based Data Serialization System

Agent 可直接安装

Introduction

What Apache Avro Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Apache Zeppelin — Web-Based Notebook for Interactive Data Analytics

Apache Gravitino — Unified Metadata Lake for Data and AI

Apache Storm — Distributed Real-Time Stream Processing Engine

Apache Hive — Distributed Data Warehouse for Big Data Analytics