This article is the first part of the series that will discuss AWS Elastic MapReduce (EMR).

EMR is the service that AWS provides to process high amounts of data.

VMware Training – Resources (Intense)

Some of the use cases are:

  • Log processing
  • Large scientific data sets processing
  • Click stream data analysis

Virtual servers that are part of a cluster do the processing. The work is distributed among the servers from the cluster. A cluster is a set of servers that work together to analyse data, to store the data or to move and transform the data. Haddop manages the cluster. Hadoop is a framework that allows distributed processing of large data using simple programming models.

Hadoop uses MapReduce to allow applications to be written so that vast amounts of data are processed in parallel on large clusters. The results of the computation performed by the servers are returned as a single result. The cluster is composed of two types of nodes: the master node that controls the distribution of the tasks and the slave nodes that process the data.

The Hadoop clusters running in AWS are using EC2 instances as master and slaves nodes within the clusters. S3 buckets are used for storage of the input data and output results. CloudWatch is used to monitor the cluster performance and if needed, trigger alarms. The Amazon EMR control software that launches and manages the Hadoop cluster manages all this.

Let’s discuss the EMR features:

  • Pay as you go
  • Ability to use Spot instances
  • Ability to choose the instances type of choice
  • Resizable clusters
  • Ability to store the data on S3 or HDFS
  • Parallel clusters
  • MapR support

    Let’s discuss how EMR works.

As mentioned, EMR allows the user to run Hadoop clusters on AWS. Hadoop is a software framework that allows massive data processing with the help of server clusters. A cluster can be either a single server or tens of thousands of servers. The work processing is distributed among the servers using MapReduce programming model. MapReduce simplifies the process of writing parallel-distributed applications. There are actually two functions: The Map function splits the data to sets of key/values pairs which are called intermediate results; the Reduce function merges the intermediate results and produces the final result (the output).

Hadoop Distributed File System (HDFS) is the file system for Hadoop and as the name says, it is distributed and very scalable. HDFS distributes the data across the servers in the cluster in a way that the same data is stored on multiple servers at a time, hence allowing high availability in case one server is lost. Amazon EMR offers the feature of referencing the data stored in S3 buckets as if it was stored on HDFS.

In Hadoop, a unit of work is called a job. A job might be split in one or more tasks and each task might try run multiple times if it doesn’t succeed on the first attempt.

There are multiple Hadoop applications that can be used with Hadoop with the purpose of extending the functionality and the features of Hadoop. Some of these applications are:

  • Hive – An open source data warehouse and analytics package
  • Pig – An open source analytics package
  • HBase – Allows storing large quantities of sparse data using column-based storage

In Amazon EMR, there are three roles in which a server in the cluster can act. Each role is referred to as a node type. The node types map to the master or slave roles.

These are the node types:

  • Master node – acts as a coordinator of the raw data distribution to the core and tasks instance groups. The status of each task performed and the health of the instance groups are also monitored by the master node. There can be only one master node in a cluster. The master node is a Hadoop master node.
  • Core node – runs the tasks and stores the data in HDFS. This is a slave node.
  • Task node – this node is optional and it’s running tasks. This is slave node .

There is a new unit of work introduced by EMR that is called step. Each step can contain one or more jobs. A step is nothing more than an instruction that handles data. A cluster contains one or more steps. Each step is processed in the order in which they were configured in the cluster.

When the cluster starts, all the steps are in PENDING state. Next, the first step is run and its state is changed to RUNNING. When the step is completed, the step state is marked as COMPLETED. The next step is then run and its state is set to RUNNING. And this continues until there are no steps to run. If a step fails, that step’s state is set to FAILED and all the remaining steps whose state is PENDING are marked as CANCELLED. The steps running stop in this moment. Data resulted from one step is passed to the next step by using files stored in HDFS. The final step generates the result that is stored in S3 bucket.

As a conclusion these are the high-level view steps to get started with EMR:

  • Develop the data processing application
  • Upload the data and the application to Amazon S3
  • Configure and launch the cluster
  • Optionally, monitor the cluster
  • Fetch the output

These are the basics of Amazon EMR. The purpose of this article is to discuss Amazon EMR and provide you some of the details of Hadoop and how Amazon implemented it in AWS along with its enhancements.

By reaching this point of the article you should be familiar with what Amazon Elastic MapReduce is, what is based on (and that is Hadoop) and the components of Amazon EMR. You should be also familiar with some of the high-level details of Hadoop.

If you want to know more about Amazon EMR, I strongly suggest you start going over the links from the reference section from the end of the article.

In the next part of the series we will see Amazon EMR in action and how we can process large data that contain repetitive words and find out how many times each word has repeated.