How do I install Apache Storm — Distributed Real-Time Stream Processing Engine?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Apache Storm — Distributed Real-Time Stream Processing Engine

Introduction

Apache Storm is a distributed real-time computation system for processing unbounded streams of data. Originally created at BackType (acquired by Twitter), Storm provides guaranteed message processing, horizontal scalability, and fault tolerance for applications that require low-latency analytics, continuous computation, and real-time ETL.

What Apache Storm Does

Processes millions of tuples per second per node with sub-second processing latency
Guarantees at-least-once or exactly-once message processing semantics via Trident
Distributes computation across a cluster with automatic task reassignment on failures
Supports multiple programming languages through its multi-language protocol (Python, Ruby, JavaScript)
Integrates with Kafka, HDFS, HBase, Redis, Cassandra, and other data systems via connectors

Architecture Overview

Storm topologies consist of spouts (data sources) and bolts (processing units) connected in a directed acyclic graph. Nimbus (the master daemon) distributes topology code across the cluster and assigns tasks to Supervisors, which spawn worker processes on each node. ZooKeeper coordinates state between Nimbus and Supervisors. Tuples flow through the topology, and Storm's acker mechanism tracks the completion of each tuple tree to provide reliability guarantees.

Self-Hosting & Configuration

Requires Java 11+, ZooKeeper 3.5+, and Python 3 for the multi-language protocol
Configure storm.yaml with nimbus.seeds, supervisor.slots.ports, and storm.zookeeper.servers
Set worker heap size and parallelism hints based on workload and available cluster resources
Deploy topologies via storm jar and manage them through the Storm UI on port 8080
Enable Kerberos authentication and SSL for production cluster security

Key Features

Horizontal scalability with dynamic rebalancing of topology parallelism
Guaranteed message processing with configurable at-least-once or exactly-once semantics
Trident API provides high-level abstractions for stateful stream processing and micro-batching
Multi-language support allows writing spouts and bolts in Python, Ruby, or any language
Fault tolerant with automatic worker restart and task reassignment on node failures

Comparison with Similar Tools

Apache Flink — modern stream processor with event-time semantics and exactly-once by default; Storm is simpler but less feature-rich for stateful processing
Apache Kafka Streams — library-based stream processing tied to Kafka; Storm is a standalone cluster with broader source support
Apache Spark Streaming — micro-batch approach with higher latency; Storm provides true per-tuple processing
Apache Samza — stream processor integrated with Kafka and YARN; Storm uses its own resource management
Amazon Kinesis Data Analytics — managed streaming service on AWS; Storm is self-hosted and vendor-neutral

FAQ

Q: Is Apache Storm still actively maintained? A: Yes. Storm continues to receive releases under the Apache Software Foundation, though Flink and Kafka Streams have become more popular for new deployments.

Q: What is the Trident API? A: Trident is a high-level abstraction on top of Storm that provides exactly-once processing, stateful operations, and micro-batching for use cases that need stronger consistency guarantees.

Q: How does Storm handle backpressure? A: Storm implements backpressure by monitoring executor queue sizes. When a bolt falls behind, upstream spouts are throttled to prevent memory exhaustion.

Q: Can Storm process data from Kafka? A: Yes. The storm-kafka-client module provides a KafkaSpout for consuming Kafka topics with configurable offset management and partition assignment.

Apache Storm — Distributed Real-Time Stream Processing Engine

这个资产可以被 Agent 直接读取和安装

Introduction

What Apache Storm Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Hazelcast — Real-Time Distributed Computing Platform

Arroyo — Distributed Stream Processing Engine in Rust

Apache Pinot — Real-Time Distributed OLAP Datastore

NSQ — Real-Time Distributed Messaging Platform in Go