# Apache Spark — Unified Analytics Engine for Big Data

> Apache Spark is the most widely used engine for large-scale data processing. It provides in-memory computing for batch processing, SQL queries, machine learning, graph processing, and streaming — all through a unified API in Python, Scala, Java, and R.

## Install

Save in your project root:

# Apache Spark — Unified Analytics Engine for Big Data

## Quick Use
```bash
# Install PySpark
pip install pyspark

# Quick demo
python3 -c "
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('demo').getOrCreate()
df = spark.createDataFrame([
    ('Alice', 100), ('Bob', 200), ('Alice', 150)
], ['name', 'amount'])
df.groupBy('name').sum('amount').show()
spark.stop()
"

# Or download Spark standalone
# wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
# ./bin/pyspark  # interactive shell
```

## Introduction
Apache Spark is the de facto standard for big data processing. It replaced Hadoop MapReduce by providing in-memory computation that is 10-100x faster. Spark provides a unified engine for batch processing (Spark SQL), streaming (Structured Streaming), machine learning (MLlib), and graph processing (GraphX).

With over 43,000 GitHub stars, Spark is used by virtually every data-driven organization. Netflix processes petabytes with Spark, Uber runs thousands of Spark jobs daily, and all major cloud providers offer managed Spark services (Databricks, EMR, Dataproc, HDInsight).

## What Spark Does
Spark distributes data processing across a cluster of machines. You write transformations on DataFrames (similar to pandas), and Spark optimizes and distributes the execution. It handles data partitioning, task scheduling, fault recovery, and memory management automatically.

## Architecture Overview
```
[Spark Application]
Python, Scala, Java, R, SQL
        |
   [SparkSession]
   Entry point for all
   Spark functionality
        |
   [Catalyst Optimizer]
   Query planning and
   optimization
        |
   [Tungsten Engine]
   Code generation and
   memory management
        |
+-------+-------+-------+
|       |       |       |
[Spark   [Structured [MLlib]
SQL]     Streaming]  Machine
Batch    Real-time   Learning
queries  processing  at scale
        |
   [Cluster Manager]
   Standalone, YARN,
   Mesos, Kubernetes
        |
   [Data Sources]
   Parquet, ORC, CSV,
   JSON, JDBC, Delta Lake,
   Kafka, S3, HDFS
```

## Self-Hosting & Configuration
```python
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create Spark session
spark = SparkSession.builder \
    .appName("DataPipeline") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read data from various sources
df = spark.read.parquet("s3://bucket/data/")
# df = spark.read.csv("data.csv", header=True, inferSchema=True)
# df = spark.read.json("data.json")

# Transformations (lazy — not executed until action)
result = (
    df
    .filter(F.col("status") == "active")
    .groupBy("category")
    .agg(
        F.count("*").alias("total"),
        F.sum("revenue").alias("total_revenue"),
        F.avg("revenue").alias("avg_revenue")
    )
    .orderBy(F.desc("total_revenue"))
)

# Action — triggers execution
result.show()
result.write.parquet("output/summary/")

# SQL interface
df.createOrReplaceTempView("sales")
spark.sql("""
    SELECT category, COUNT(*) as cnt, SUM(revenue) as total
    FROM sales
    WHERE status = 'active'
    GROUP BY category
    ORDER BY total DESC
""").show()
```

## Key Features
- **In-Memory Computing** — 10-100x faster than Hadoop MapReduce
- **Unified Engine** — batch, streaming, ML, and graph in one framework
- **Spark SQL** — SQL queries on structured data with Catalyst optimizer
- **Structured Streaming** — real-time stream processing with exactly-once
- **MLlib** — distributed machine learning library
- **Multi-Language** — Python, Scala, Java, R, and SQL APIs
- **Delta Lake** — ACID transactions on data lakes
- **Cloud Native** — runs on Kubernetes, YARN, or standalone

## Comparison with Similar Tools
| Feature | Spark | Flink | Presto/Trino | Dask | Polars |
|---|---|---|---|---|---|
| Primary Use | Batch + Stream | Stream + Batch | SQL queries | Python parallel | DataFrame |
| Scale | Petabytes | Petabytes | Petabytes | Terabytes | Terabytes |
| Streaming | Structured Streaming | Native (best) | No | No | No |
| ML | MLlib | FlinkML | No | Dask-ML | No |
| Language | Python, Scala, Java, R | Java, Scala, Python | SQL | Python | Python, Rust |
| Managed Service | Databricks, EMR | Managed Flink | Starburst | Coiled | N/A |
| Best For | General big data | Real-time streaming | Ad-hoc SQL | Python scaling | Single-machine |

## FAQ
**Q: When should I use Spark vs pandas?**
A: Use pandas for data that fits in memory (up to ~10GB). Use Spark when data exceeds single-machine memory or when you need distributed processing. Consider Polars as a middle ground.

**Q: Spark vs Flink for streaming?**
A: Flink has better streaming semantics (true event-time processing, lower latency). Spark Structured Streaming is good enough for most use cases and has a larger ecosystem. Use Flink for mission-critical, low-latency streaming.

**Q: What is Databricks?**
A: Databricks is the managed Spark platform created by the Spark founders. It provides optimized Spark runtime, collaborative notebooks, Delta Lake, and MLflow — the most popular way to run Spark in production.

**Q: How do I learn Spark?**
A: Start with PySpark and DataFrames (similar to pandas). Learn Spark SQL for familiar SQL queries. Then explore Structured Streaming and MLlib as needed.

## Sources
- GitHub: https://github.com/apache/spark
- Documentation: https://spark.apache.org/docs
- Created at UC Berkeley AMPLab, Apache Top-Level Project
- License: Apache-2.0

---
Source: https://tokrepo.com/en/workflows/8cd9fbc0-3734-11f1-9bc6-00163e2b0d79
Author: AI Open Source