# Apache Spark — Unified Analytics Engine for Big Data > Apache Spark is the most widely used engine for large-scale data processing. It provides in-memory computing for batch processing, SQL queries, machine learning, graph processing, and streaming — all through a unified API in Python, Scala, Java, and R. ## Install Save in your project root: # Apache Spark — Unified Analytics Engine for Big Data ## Quick Use ```bash # Install PySpark pip install pyspark # Quick demo python3 -c " from pyspark.sql import SparkSession spark = SparkSession.builder.appName('demo').getOrCreate() df = spark.createDataFrame([ ('Alice', 100), ('Bob', 200), ('Alice', 150) ], ['name', 'amount']) df.groupBy('name').sum('amount').show() spark.stop() " # Or download Spark standalone # wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz # ./bin/pyspark # interactive shell ``` ## Introduction Apache Spark is the de facto standard for big data processing. It replaced Hadoop MapReduce by providing in-memory computation that is 10-100x faster. Spark provides a unified engine for batch processing (Spark SQL), streaming (Structured Streaming), machine learning (MLlib), and graph processing (GraphX). With over 43,000 GitHub stars, Spark is used by virtually every data-driven organization. Netflix processes petabytes with Spark, Uber runs thousands of Spark jobs daily, and all major cloud providers offer managed Spark services (Databricks, EMR, Dataproc, HDInsight). ## What Spark Does Spark distributes data processing across a cluster of machines. You write transformations on DataFrames (similar to pandas), and Spark optimizes and distributes the execution. It handles data partitioning, task scheduling, fault recovery, and memory management automatically. ## Architecture Overview ``` [Spark Application] Python, Scala, Java, R, SQL | [SparkSession] Entry point for all Spark functionality | [Catalyst Optimizer] Query planning and optimization | [Tungsten Engine] Code generation and memory management | +-------+-------+-------+ | | | | [Spark [Structured [MLlib] SQL] Streaming] Machine Batch Real-time Learning queries processing at scale | [Cluster Manager] Standalone, YARN, Mesos, Kubernetes | [Data Sources] Parquet, ORC, CSV, JSON, JDBC, Delta Lake, Kafka, S3, HDFS ``` ## Self-Hosting & Configuration ```python from pyspark.sql import SparkSession from pyspark.sql import functions as F # Create Spark session spark = SparkSession.builder \ .appName("DataPipeline") \ .config("spark.sql.adaptive.enabled", "true") \ .getOrCreate() # Read data from various sources df = spark.read.parquet("s3://bucket/data/") # df = spark.read.csv("data.csv", header=True, inferSchema=True) # df = spark.read.json("data.json") # Transformations (lazy — not executed until action) result = ( df .filter(F.col("status") == "active") .groupBy("category") .agg( F.count("*").alias("total"), F.sum("revenue").alias("total_revenue"), F.avg("revenue").alias("avg_revenue") ) .orderBy(F.desc("total_revenue")) ) # Action — triggers execution result.show() result.write.parquet("output/summary/") # SQL interface df.createOrReplaceTempView("sales") spark.sql(""" SELECT category, COUNT(*) as cnt, SUM(revenue) as total FROM sales WHERE status = 'active' GROUP BY category ORDER BY total DESC """).show() ``` ## Key Features - **In-Memory Computing** — 10-100x faster than Hadoop MapReduce - **Unified Engine** — batch, streaming, ML, and graph in one framework - **Spark SQL** — SQL queries on structured data with Catalyst optimizer - **Structured Streaming** — real-time stream processing with exactly-once - **MLlib** — distributed machine learning library - **Multi-Language** — Python, Scala, Java, R, and SQL APIs - **Delta Lake** — ACID transactions on data lakes - **Cloud Native** — runs on Kubernetes, YARN, or standalone ## Comparison with Similar Tools | Feature | Spark | Flink | Presto/Trino | Dask | Polars | |---|---|---|---|---|---| | Primary Use | Batch + Stream | Stream + Batch | SQL queries | Python parallel | DataFrame | | Scale | Petabytes | Petabytes | Petabytes | Terabytes | Terabytes | | Streaming | Structured Streaming | Native (best) | No | No | No | | ML | MLlib | FlinkML | No | Dask-ML | No | | Language | Python, Scala, Java, R | Java, Scala, Python | SQL | Python | Python, Rust | | Managed Service | Databricks, EMR | Managed Flink | Starburst | Coiled | N/A | | Best For | General big data | Real-time streaming | Ad-hoc SQL | Python scaling | Single-machine | ## FAQ **Q: When should I use Spark vs pandas?** A: Use pandas for data that fits in memory (up to ~10GB). Use Spark when data exceeds single-machine memory or when you need distributed processing. Consider Polars as a middle ground. **Q: Spark vs Flink for streaming?** A: Flink has better streaming semantics (true event-time processing, lower latency). Spark Structured Streaming is good enough for most use cases and has a larger ecosystem. Use Flink for mission-critical, low-latency streaming. **Q: What is Databricks?** A: Databricks is the managed Spark platform created by the Spark founders. It provides optimized Spark runtime, collaborative notebooks, Delta Lake, and MLflow — the most popular way to run Spark in production. **Q: How do I learn Spark?** A: Start with PySpark and DataFrames (similar to pandas). Learn Spark SQL for familiar SQL queries. Then explore Structured Streaming and MLlib as needed. ## Sources - GitHub: https://github.com/apache/spark - Documentation: https://spark.apache.org/docs - Created at UC Berkeley AMPLab, Apache Top-Level Project - License: Apache-2.0 --- Source: https://tokrepo.com/en/workflows/8cd9fbc0-3734-11f1-9bc6-00163e2b0d79 Author: AI Open Source