Apache Spark Deep Dive
⚡ Apache Spark Deep Dive
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It is the gold standard for big data processing.
🟢 Level 1: Foundations (The Core Engine)
1. The Spark Architecture
Spark uses a Master-Worker architecture:
- Driver: The “brain” that coordinates the execution.
- Executors: The “workers” that run the actual tasks and store data in RAM.
- Cluster Manager: Allocates resources (YARN, Kubernetes, or Standalone).
2. RDDs vs. DataFrames
- RDD (Resilient Distributed Dataset): The low-level API. Immutable collection of objects distributed across nodes.
- DataFrame: The high-level API. Distributed collection of data organized into named columns (like a table). Always use DataFrames for better performance!
🟡 Level 2: Execution Mechanics
3. Lazy Evaluation
Spark does not run transformations immediately. It builds a Logical Plan (DAG). The code only executes when an Action (like .collect(), .save(), or .show()) is called.
4. Transformations vs. Actions
- Narrow Transformations: No data movement across nodes (e.g.,
filter,map). - Wide Transformations (Shuffles): Data must move across the network (e.g.,
groupBy,join,orderBy). Shuffles are expensive!
🔴 Level 3: Advanced Optimization
5. Adaptive Query Execution (AQE)
Spark 3.x feature that re-optimizes the query plan during runtime based on actual data statistics.
6. Caching & Persistence
Store frequently used DataFrames in memory to avoid re-computing the entire DAG.
df.cache() # Store in memory
df.persist() # Configurable storage level (Memory, Disk, or both)7. Partitioning & Bucketing
- Partitioning: Splitting data into folders based on a column (e.g.,
date). - Bucketing: Sorting and distributing data into a fixed number of “buckets” to speed up joins.