Home

Following

Library

Reading history

Stories

Stats

Towards Dev

Home

Newsletter

About

Follow publication

A publication for sharing projects, ideas, codes, and new theories.

Follow publication

Member-only story

Apache Spark Architecture :A Deep Dive into Big Data Processing

Prem Vishnoi(cloudvala)

Published in

Towards Dev

6 min readFeb 6, 2025

Agenda

Core Architecture
Key Components
Execution Model
Best Practices
Real-world Applications

What is Spark?

Apache Spark is a powerful framework for big data processing.

It helps process massive datasets by splitting the work across many computers (a cluster) and coordinating tasks to get results efficiently.

Spark’s Basic Architecture

Think of our laptop or desktop computer — it’s great for everyday tasks, but it struggles with huge amounts of data.

A cluster solves this problem by using multiple machines (or nodes) to share the load.

However, having a group of machines is not enough. We need a framework like Apache Spark to manage and assign tasks to these machines so they can work together seamlessly.

Key Components of Spark’s Architecture

1. Cluster Manager

A tool (like YARN, Mesos, or Spark’s standalone manager) that tracks and manages resources (e.g., CPUs and memory) across the cluster.

2. Spark Application

This is our program that Spark will run. It consists of:

Driver Process: The brain of the application that:

Runs our main program.

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Sign up with email

Already have an account? Sign in

Published in Towards Dev

Last published 3 days ago

A publication for sharing projects, ideas, codes, and new theories.

Written by Prem Vishnoi(cloudvala)

Head of Data and ML experienced in designing, implementing, and managing large-scale data infrastructure. Skilled in ETL, data modeling, and cloud computing

Responses (1)

Write a response

What are your thoughts?

Also publish to my profile

Eric Wallace

Mar 23

This is great! Can you provide a real world evaluating the performance of a job using the Spark UI?

Recommended from Medium

Handling Large Data Volumes (100GB — 1TB) in PySpark

Ramesh Hariharan

Handling Large Data Volumes (100GB — 1TB) in PySpark

Processing large volumes of data efficiently is crucial for businesses dealing with analytics, machine learning, and real-time data…

Mar 19

Ever Spent 45 Minutes Waiting for a Spark Job? Here’s How to Cut It Down to 4

Vijay Gadhave

Ever Spent 45 Minutes Waiting for a Spark Job? Here’s How to Cut It Down to 4

Note: If you’re not a medium member, CLICK HERE

Mar 19

Mastering Apache Spark Performance: A Deep Dive into Optimization Techniques

Mahak Goyal

Mastering Apache Spark Performance: A Deep Dive into Optimization Techniques

Dealing with big data processing can be exciting, but it often comes with performance challenges. If you’ve ever run Spark jobs that seem…

Mar 23

Why Parquet Is So Much Faster Than CSV in PySpark (with Real Numbers)

Kaizen

Why Parquet Is So Much Faster Than CSV in PySpark (with Real Numbers)

A few days ago, I was running some data transformation jobs in PySpark. I started with a CSV file — loaded it, applied a few…

Mar 24

Custom Job Description in Spark — The lesser known feature

Praveen Kumar B N

Custom Job Description in Spark — The lesser known feature

One challenge that most Data Engineers face is, there are 100s of jobs in a spark pipeline and it shows up in Spark UI as a bunch of jobs…

Feb 23

Spark Interview Series — Catalyst Optimizer and Tungsten Engine

Siddharth Ghosh

Spark Interview Series — Catalyst Optimizer and Tungsten Engine

Apache Spark is known for handling massive amounts of data quickly. But what makes it so fast? Two key technologies work behind the scenes…

1d ago

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech