What is EMR

11 min readAug 19, 2023

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Pig and Apache Pig, Spark your can process data for analytics purposes and business intelligence workloads. Additionally, your can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as S3(Amazon Simple Storage Service ) and Amazon DynamoDB.

If you are a first-time user of Amazon EMR, we recommend that you begin by reading the following:

What Is Amazon EMR? (this section) — This section provides an overview of Amazon EMR functionality and features.
Amazon EMR — This service page provides the Amazon EMR highlights, product details, and pricing information.
Getting Started: Analyzing Big Data with Amazon EMR — This section provides a tutorial on using Amazon EMR to create a sample cluster and run a Hive script as a step.

In This Section

Overview of Amazon EMR
Benefits of Using Amazon EMR
Overview of Amazon EMR Architecture

Understanding Clusters and Nodes

The main component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances.

Each instance in the cluster is called node. Each node has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop.

The node types in Amazon EMR are as below

Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes — collectively referred to as slave nodes — for processing. The master node tracks the status of tasks and monitors the health of the cluster.
Core node: A slave node with software components that run tasks and store data in the Hadoop Distributed (HDFS) File System on your cluster.
Task node: A slave node with software components that only run tasks. Task nodes are optional.

The Below diagram represents a cluster with one master node and four slave nodes.

Submitting Work to a Cluster

When you run your cluster on Amazon EMR, you have several options as to how you specify the work that needs to be done.

Create a long-running cluster and use the Amazon EMR console, the Amazon EMR API, or the AWS CLI to submit steps, which may contain one or more Hadoop jobs. For more information, see Submit Work to a Cluster.
Provide the entire definition of the work to be done on the Map and Reduce functions. This is typically done for clusters that process a set amount of data and then terminate when processing is complete. For more information, see Apache Hadoop in the Amazon EMR Release Guide.
Create a cluster with a Hadoop application, such as Hive or Pig, installed and use the interface provided by the application to submit queries, either scripted or interactively. For more information, see the Amazon EMR Release Guide.
Create a long-running cluster, connect to it, and submit Hadoop jobs using the Hadoop API. For more information, go to Class JobClient in the Apache Hadoop API documentation.

Processing Data

When your launch your cluster, you choose the frameworks and applications to install for your data processing needs. There are two ways to process data in your Amazon EMR cluster: by submitting jobs or queries directly to the applications that are installed on your cluster or by running steps in the cluster.

Submitting Jobs Directly to Applications

You can submit jobs and interact directly with the software that is installed in you Amazon EMR cluster. To do this, your typically connect to the master node over a secure connection and access the interfaces and tools that are available for the software that runs directly on you cluster. For more information, see Connect to the Cluster.

Running Steps to Process Data

Your can submit one or more ordered steps to an Amazon EMR cluster. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.

The following is an example process using four steps:

Submit an input datasets for processing
Process the output of the first step by using Pig program.
Process a second input dataset by using Hive program.
Write an output dataset.

Generally, when you process data in Amazon EMR, the input is data stored as files in your chosen underlying file system, such as Amazon S3 or HDFS. This data passes from one step to the next in the processing sequence. The final step writes the output data to a specified location, such as an Amazon S3 bucket.

Steps are run in the following sequence:

A request is submitted to begin processing steps.
The state of all steps is set to PENDING.
When the first step in the sequence starts, its state changes to RUNNING. The other steps remain in the PENDING state.
After the first step completes, its state changes to COMPLETED.
The next step in the sequence starts, and its state changes to RUNNING. When it completes, its state changes to COMPLETED.
This pattern repeats for each step until they all complete and processing ends.

The following diagram represents the step sequence and change of state for the steps as they are processed.

If a step fails during processing, its state changes to TERMINATED_WITH_ERRORS. By default, any remaining steps in the sequence are set to CANCELLED and do not run, although you can choose to ignore processing failures and allow remaining steps to proceed.

The following diagram represents the step sequence and default change of state when a step fails during processing.

Understanding the Cluster Lifecycle

A successful Amazon EMR cluster follows this process:

Amazon EMR first provisions a cluster with your chosen applications, such as Hadoop or Spark. During this phase, the cluster state is.STARTING
Next, any user-defined actions — called bootstrap actions — such as installing additional applications, run on the cluster. During this phase, the cluster state is.BOOTSTRAPPING
After all bootstrap actions are successfully completed, the cluster state is.RUNNING The cluster sequentially runs all steps during this phase.
After steps run successfully, the cluster either goes into a WAITING state or aSHUTTING_DOWN state, described as follows.

If you configured your cluster as a long-running cluster by disabling auto-terminate or by using the KeepJobFlowAliveWhenNoSteps parameter in the API, the cluster will go into a WAITING state after processing all submitted steps and wait for the next set of instructions. If you have more data to process, you can add more steps. You must manually terminate a long-running cluster when you no longer need it. After you manually terminate the cluster, it goes into the stateSHUTTING_DOWN and then into the stateTERMINATED.
If you configured your cluster as a transient cluster by enabling auto-terminate or by using the KeepJobFlowAliveWhenNoSteps parameter in the API, it automatically goes into the stateSHUTTING_DOWN after all of the steps complete. The instances are terminated, and all data stored in the cluster itself is deleted. Information stored in other locations, such as in your Amazon S3 bucket, persists. When the shutdown process completes, the cluster state is set to.COMPLETED

A failure during the clustering process terminates the cluster and all of its instances unless: a) you enable termination protection; or b) you supply an ActionOnFailure in the StepConfig for the step. Any data stored on the cluster is deleted. The cluster state is set toTERMINATED_WITH_ERRORS. If you have enabled termination protection so that you can retrieve data from your cluster in the event of a failure, then the cluster is not terminated. Once you are truly done with the cluster, you can remove termination protection and terminate the cluster. For more information, see Managing Cluster Termination.

The following diagram represents the lifecycle of a cluster, and how each stage of the lifecycle maps to a particular cluster state.

Overview of Amazon EMR Architecture

Amazon EMR service architecture consists of several layers, each of which provides certain capabilities and functionality to the cluster. This section provides an overview of the layers and the components of each.

In This Topic

Storage

The storage layer includes the different file systems that are used with your cluster. There are several different types of storage options as follows.

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster. HDFS is useful for caching intermediate results during MapReduce processing or for workloads that have significant random I/O.

For more information, go to HDFS Users Guide on the Apache Hadoop website.

EMR File System (EMRFS)

Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.

Local File System

The local file system refers to a locally connected disk. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.

Cluster Resource Management

The resource management layer is responsible for managing cluster resources and scheduling the job for processing data.

By default, Amazon EMR uses YARN (Yet Another Resource Negotiator), which is a component introduced in Apache Hadoop 2.0 to centrally manage cluster resources for multiple data-processing frameworks. However, there are other frameworks and applications that are offered in Amazon EMR that do not use YARN as a resource manager. Amazon EMR also has an agent on each node that administers YARN components, keeps the cluster healthy, and communicates with the Amazon EMR services.

Data Processing Frameworks

The data processing framework layer is the engine used to process and analyze data. There is many frameworks available that run on YARN or have their own resource management. Different frameworks are available for different kind of processing needs, such as batch, interactive, in-memory, streaming, and so on etc. The framework that your choose depends on your use case. This impacts the languages and interfaces available from the application layer, which is the layer used to interact with the data you want to process. The main processing frameworks available for Amazon EMR are Hadoop MapReduce and Spark.

Hadoop MapReduce

Hadoop MapReduces is an open-source programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic, while your provide the Map and Reduce functions. The Map function maps data to sets of key-value pairs called intermediate results. The Reduce function combines the intermediate results, applies additional algorithms, and produces the final output. There are multiple frameworks available for MapReduce, such as Hive, which automatically generates Map and Reduces programs.

For more information, go to How Map and Reduce operations are actually carried out on the Apache Hadoop Wiki website.

Apache Spark

Apache Spark is an open-source, distributed computing system that provides a fast and general-purpose framework for processing and analyzing large datasets. It was developed to address the limitations of the traditional MapReduce model, offering improved speed, ease of use, and support for a wide range of data processing tasks. Spark is designed to work efficiently on clusters of computers and can process data in memory, leading to significantly faster performance compared to disk-based systems.

Key features of Apache Spark include:

In-Memory Processing: One of Spark’s defining features is its ability to store data in memory, reducing the need for repeated disk access. This makes Spark particularly well-suited for iterative algorithms and interactive data analysis.

Distributed Computing: Spark distributes data and computations across a cluster of machines, enabling parallel processing. This distributed nature allows for the processing of large datasets that would be impractical to handle on a single machine.

Ease of Use: Spark provides high-level APIs in multiple programming languages (such as Scala, Java, Python, and R), making it accessible to a wide range of developers. Its APIs cover a variety of data processing tasks, from batch processing to real-time streaming and machine learning.

Versatility: Spark is not limited to batch processing; it supports various processing modes like batch, interactive, iterative, and streaming. This flexibility allows developers to use a single framework for different types of data processing tasks.

Rich Ecosystem: Spark offers a comprehensive ecosystem of libraries and tools that extend its capabilities. These include Spark SQL for querying structured data using SQL, Spark Streaming for real-time data processing, MLlib for machine learning, and GraphX for graph processing.

Fault Tolerance: Spark provides fault tolerance through lineage information, which allows it to reconstruct lost data from the original source data and transformations. This ensures data reliability in the face of machine failures.

Integration with Big Data Sources: Spark can read data from various data sources, including Hadoop Distributed File System (HDFS), Apache HBase, Apache Cassandra, and more.

Community and Active Development: Apache Spark has a vibrant open-source community that contributes to its development and supports a wide range of use cases.

Apache Spark is commonly used for tasks such as data cleaning and transformation, real-time analytics, machine learning, graph processing, and more. Its ability to handle diverse workloads within a unified framework has made it a popular choice for organizations dealing with large-scale data processing needs.

Overall, Apache Spark’s power, speed, and flexibility have positioned it as a cornerstone technology in the field of big data analytics and distributed computing.

For more information, see Apache Spark on Amazon EMR Clusters in the Amazon EMR Release Guide.

Applications and Programs

Amazon EMR supports many applications, such as Hive, Pig, and the Spark Streaming library to provide capabilities such as using higher-level languages to create processing workloads, leveraging machine learning algorithms, making stream processing applications, and building data warehouses. In addition, Amazon EMR also supports open-source projects that have their own cluster management functionality instead of using YARN.

You use various libraries and languages to interact with the applications that you run on Amazon EMR. For example, you can use Java, Hive, or Pig with MapReduce or Spark Streaming, Spark SQL, MLlib, and GraphX with Spark.

Source: What Is Amazon EMR? — Amazon EMR