Introduction
Apache Spark is the de facto standard for big data processing. It replaced Hadoop MapReduce by providing in-memory computation that is 10-100x faster. Spark provides a unified engine for batch processing (Spark SQL), streaming (Structured Streaming), machine learning (MLlib), and graph processing (GraphX).
With over 43,000 GitHub stars, Spark is used by virtually every data-driven organization. Netflix processes petabytes with Spark, Uber runs thousands of Spark jobs daily, and all major cloud providers offer managed Spark services (Databricks, EMR, Dataproc, HDInsight).
What Spark Does
Spark distributes data processing across a cluster of machines. You write transformations on DataFrames (similar to pandas), and Spark optimizes and distributes the execution. It handles data partitioning, task scheduling, fault recovery, and memory management automatically.
Architecture Overview
[Spark Application]
Python, Scala, Java, R, SQL
|
[SparkSession]
Entry point for all
Spark functionality
|
[Catalyst Optimizer]
Query planning and
optimization
|
[Tungsten Engine]
Code generation and
memory management
|
+-------+-------+-------+
| | | |
[Spark [Structured [MLlib]
SQL] Streaming] Machine
Batch Real-time Learning
queries processing at scale
|
[Cluster Manager]
Standalone, YARN,
Mesos, Kubernetes
|
[Data Sources]
Parquet, ORC, CSV,
JSON, JDBC, Delta Lake,
Kafka, S3, HDFSSelf-Hosting & Configuration
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Create Spark session
spark = SparkSession.builder \
.appName("DataPipeline") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
# Read data from various sources
df = spark.read.parquet("s3://bucket/data/")
# df = spark.read.csv("data.csv", header=True, inferSchema=True)
# df = spark.read.json("data.json")
# Transformations (lazy — not executed until action)
result = (
df
.filter(F.col("status") == "active")
.groupBy("category")
.agg(
F.count("*").alias("total"),
F.sum("revenue").alias("total_revenue"),
F.avg("revenue").alias("avg_revenue")
)
.orderBy(F.desc("total_revenue"))
)
# Action — triggers execution
result.show()
result.write.parquet("output/summary/")
# SQL interface
df.createOrReplaceTempView("sales")
spark.sql("""
SELECT category, COUNT(*) as cnt, SUM(revenue) as total
FROM sales
WHERE status = 'active'
GROUP BY category
ORDER BY total DESC
""").show()Key Features
- In-Memory Computing — 10-100x faster than Hadoop MapReduce
- Unified Engine — batch, streaming, ML, and graph in one framework
- Spark SQL — SQL queries on structured data with Catalyst optimizer
- Structured Streaming — real-time stream processing with exactly-once
- MLlib — distributed machine learning library
- Multi-Language — Python, Scala, Java, R, and SQL APIs
- Delta Lake — ACID transactions on data lakes
- Cloud Native — runs on Kubernetes, YARN, or standalone
Comparison with Similar Tools
| Feature | Spark | Flink | Presto/Trino | Dask | Polars |
|---|---|---|---|---|---|
| Primary Use | Batch + Stream | Stream + Batch | SQL queries | Python parallel | DataFrame |
| Scale | Petabytes | Petabytes | Petabytes | Terabytes | Terabytes |
| Streaming | Structured Streaming | Native (best) | No | No | No |
| ML | MLlib | FlinkML | No | Dask-ML | No |
| Language | Python, Scala, Java, R | Java, Scala, Python | SQL | Python | Python, Rust |
| Managed Service | Databricks, EMR | Managed Flink | Starburst | Coiled | N/A |
| Best For | General big data | Real-time streaming | Ad-hoc SQL | Python scaling | Single-machine |
FAQ
Q: When should I use Spark vs pandas? A: Use pandas for data that fits in memory (up to ~10GB). Use Spark when data exceeds single-machine memory or when you need distributed processing. Consider Polars as a middle ground.
Q: Spark vs Flink for streaming? A: Flink has better streaming semantics (true event-time processing, lower latency). Spark Structured Streaming is good enough for most use cases and has a larger ecosystem. Use Flink for mission-critical, low-latency streaming.
Q: What is Databricks? A: Databricks is the managed Spark platform created by the Spark founders. It provides optimized Spark runtime, collaborative notebooks, Delta Lake, and MLflow — the most popular way to run Spark in production.
Q: How do I learn Spark? A: Start with PySpark and DataFrames (similar to pandas). Learn Spark SQL for familiar SQL queries. Then explore Structured Streaming and MLlib as needed.
Sources
- GitHub: https://github.com/apache/spark
- Documentation: https://spark.apache.org/docs
- Created at UC Berkeley AMPLab, Apache Top-Level Project
- License: Apache-2.0