# Apache Storm — Distributed Real-Time Stream Processing Engine

> Apache Storm is a distributed real-time computation system for processing unbounded streams of data with guaranteed message processing and sub-second latency.

## Install

Save in your project root:

# Apache Storm — Distributed Real-Time Stream Processing Engine

## Quick Use
```bash
# Install Storm (standalone development mode)
wget https://dlcdn.apache.org/storm/apache-storm-2.7.0/apache-storm-2.7.0.tar.gz
tar -xzf apache-storm-2.7.0.tar.gz
cd apache-storm-2.7.0

# Start local development cluster
bin/storm dev-zookeeper &
bin/storm nimbus &
bin/storm supervisor &
bin/storm ui &

# Submit a topology
bin/storm jar examples/storm-starter/storm-starter-*.jar org.apache.storm.starter.WordCountTopology
```

## Introduction
Apache Storm is a distributed real-time computation system for processing unbounded streams of data. Originally created at BackType (acquired by Twitter), Storm provides guaranteed message processing, horizontal scalability, and fault tolerance for applications that require low-latency analytics, continuous computation, and real-time ETL.

## What Apache Storm Does
- Processes millions of tuples per second per node with sub-second processing latency
- Guarantees at-least-once or exactly-once message processing semantics via Trident
- Distributes computation across a cluster with automatic task reassignment on failures
- Supports multiple programming languages through its multi-language protocol (Python, Ruby, JavaScript)
- Integrates with Kafka, HDFS, HBase, Redis, Cassandra, and other data systems via connectors

## Architecture Overview
Storm topologies consist of spouts (data sources) and bolts (processing units) connected in a directed acyclic graph. Nimbus (the master daemon) distributes topology code across the cluster and assigns tasks to Supervisors, which spawn worker processes on each node. ZooKeeper coordinates state between Nimbus and Supervisors. Tuples flow through the topology, and Storm's acker mechanism tracks the completion of each tuple tree to provide reliability guarantees.

## Self-Hosting & Configuration
- Requires Java 11+, ZooKeeper 3.5+, and Python 3 for the multi-language protocol
- Configure storm.yaml with nimbus.seeds, supervisor.slots.ports, and storm.zookeeper.servers
- Set worker heap size and parallelism hints based on workload and available cluster resources
- Deploy topologies via storm jar and manage them through the Storm UI on port 8080
- Enable Kerberos authentication and SSL for production cluster security

## Key Features
- Horizontal scalability with dynamic rebalancing of topology parallelism
- Guaranteed message processing with configurable at-least-once or exactly-once semantics
- Trident API provides high-level abstractions for stateful stream processing and micro-batching
- Multi-language support allows writing spouts and bolts in Python, Ruby, or any language
- Fault tolerant with automatic worker restart and task reassignment on node failures

## Comparison with Similar Tools
- **Apache Flink** — modern stream processor with event-time semantics and exactly-once by default; Storm is simpler but less feature-rich for stateful processing
- **Apache Kafka Streams** — library-based stream processing tied to Kafka; Storm is a standalone cluster with broader source support
- **Apache Spark Streaming** — micro-batch approach with higher latency; Storm provides true per-tuple processing
- **Apache Samza** — stream processor integrated with Kafka and YARN; Storm uses its own resource management
- **Amazon Kinesis Data Analytics** — managed streaming service on AWS; Storm is self-hosted and vendor-neutral

## FAQ
**Q: Is Apache Storm still actively maintained?**
A: Yes. Storm continues to receive releases under the Apache Software Foundation, though Flink and Kafka Streams have become more popular for new deployments.

**Q: What is the Trident API?**
A: Trident is a high-level abstraction on top of Storm that provides exactly-once processing, stateful operations, and micro-batching for use cases that need stronger consistency guarantees.

**Q: How does Storm handle backpressure?**
A: Storm implements backpressure by monitoring executor queue sizes. When a bolt falls behind, upstream spouts are throttled to prevent memory exhaustion.

**Q: Can Storm process data from Kafka?**
A: Yes. The storm-kafka-client module provides a KafkaSpout for consuming Kafka topics with configurable offset management and partition assignment.

## Sources
- https://github.com/apache/storm
- https://storm.apache.org/releases/current/

---
Source: https://tokrepo.com/en/workflows/asset-6c1d4873
Author: AI Open Source