Towards Dev

A publication for sharing projects, ideas, codes, and new theories.

Follow publication

Apache Spark Architecture :A Deep Dive into Big Data Processing

Prem Vishnoi(cloudvala)
Towards Dev
Published in
6 min readFeb 6, 2025

Agenda

  1. Core Architecture
  2. Key Components
  3. Execution Model
  4. Best Practices
  5. Real-world Applications

What is Spark?

Apache Spark is a powerful framework for big data processing.

It helps process massive datasets by splitting the work across many computers (a cluster) and coordinating tasks to get results efficiently.

Spark’s Basic Architecture

Think of our laptop or desktop computer — it’s great for everyday tasks, but it struggles with huge amounts of data.

A cluster solves this problem by using multiple machines (or nodes) to share the load.

However, having a group of machines is not enough. We need a framework like Apache Spark to manage and assign tasks to these machines so they can work together seamlessly.

Key Components of Spark’s Architecture

1. Cluster Manager

A tool (like YARN, Mesos, or Spark’s standalone manager) that tracks and manages resources (e.g., CPUs and memory) across the cluster.

2. Spark Application

This is our program that Spark will run. It consists of:

Driver Process: The brain of the application that:

Runs our main program.

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Published in Towards Dev

A publication for sharing projects, ideas, codes, and new theories.

Written by Prem Vishnoi(cloudvala)

Head of Data and ML experienced in designing, implementing, and managing large-scale data infrastructure. Skilled in ETL, data modeling, and cloud computing

Responses (1)

Write a response

This is great! Can you provide a real world evaluating the performance of a job using the Spark UI?