In the second part of the series, we will see how an Elastic MapReduce (EMR) cluster is deployed. We will use the information from the first part of the series as a foundation for the hands-on part of the series.

So let’s get this started and see EMR in action.

VMware Training – Resources (Intense)

From the AWS Management Console, from the “Analytics” section, click on “EMR”:

If you don’t have any clusters already created or they weren’t terminated recently, then you will have only the option to create a new cluster. Click on “Create Cluster” to start the process:

The next page is split in multiple sections and we will discuss them below.

The first is one is “Cluster Configuration” where you can provide the name of the cluster, if you want to enable termination protection and if you want to enable the logging and if yes, where this should be done. The logging is always stored in an S3 bucket:

In the same section, you have a button that allows you to configure a sample application and this is what we are going to do. These applications are already set up for you so you can see an Elastic MapReduce cluster in action without needing to go through the steps of creating your own application:

There are multiple sample applications from where you can choose, but for simplicity, we will use the word count application:

Once you select the application, you will need to provide two things: where the output location will be and where the logging data, if enabled, will be stored:

The output location and logging location needs to be an S3 bucket and this means that we need to have this S3 bucket created before you configure the EMR cluster. In this case, I created the S3 bucket before I started the EMR cluster creation process. As you can see, the output location and the logging location are pointing to this S3 bucket:

The next section is the “Tags” section where you can apply different tags to the cluster. I used only one just as an example:

The next section is about the Hadoop distribution that is used. Based on the region where you are creating the EMR cluster, you might have different options here. You can either use the Amazon distribution or use MapR Distribution. In this case, the cluster is created in Central EU region and we have only the Amazon distribution. The Amazon Hadoop distribution is based on Apache Hadoop, but it has patches and improvements to make it efficient with Amazon AWS.

Also, you can choose the AMI version of the EC2 instances that are launched to run the cluster. These AMIs are specifically purpose built images that contain the OS, Hadoop and other software needed to run the cluster. These AMIs can be used only in the EMR context and they are periodically updated to improve their efficiency.

The next section is “File System Configuration” where you can decide if you want to encrypt the data and where: client or server side:

The next section is an interesting one. Here you can define where you can launch the EC2 instances and what type the EC2 instances can be. More powerful, quicker they can process the data:

The next section is about key pairs and IAM roles regarding access to the master node and where other users can access the cluster:

And the IAM roles configuration with regards to the EMR cluster:

The next section allows the user to introduce scripts that are executed before Hadoop starts on every cluster node. This can be left as default:

The last section allows the user to add additional steps. A step is a unit of work that can contain one or more Hadoop jobs.

Then you can click on “Create Cluster” to create the cluster:

Once you create the cluster, it will move through different states:

By expanding the recently created cluster, you can see again the details that were selected during the cluster provisioning. You can also see the states of the master and core EC2 instances:

You can also get the details of the EC2 instances used to power the cluster by expanding the “Hardware” section:

Other monitoring section is “Steps” where you can see the status of each step. In this case, because the screenshot was taken just after the cluster was created, the specific step wasn’t executed yet, hence the pending status:

You can see the details of the steps, by expanding it and you will get the details configured during cluster creation:

A few minutes later, the cluster is running:

And once the step has been executed, the cluster will slowly transition to terminating status:

And finally to terminated status:

You can check for how long the cluster has been alive and how long the execution of the steps took place. In this example, the cluster lived for nine minutes, whereas the execution of the step took one minute:

As you remember, we had to create an S3 bucket where the results and logs would be stored. Let’s check the content now and we will see two folders: one containing the logs and one containing the results:

By going through the logs hierarchy folders, you can find detailed information about what happened when the cluster was alive and when the steps were executed:

If you browse the output folder, eventually and hopefully you will get something like this:

The presence of the “_SUCCESS” file means that the steps were successfully executed. The other seven files are the results of the steps. In this specific case, each file has one word per line and how many times that word was present in the initial data. This is just a tiny part of one file:

a	14716
abate	9
abaza	3
abc	6
abda	3
abdala	6

For instance, the word “a” has been counted 14,716 times.

Just to give you an idea what happened. There were 1,990,447 words in total and some of the words could be repeating. After preforming the step, it turned to that there are 29,164 distinct words.

And that is pretty much how you create an Elastic MapReduce cluster.

The creation is pretty straightforward. Amazon EMR allows you to spin EC2 instances that can give you the required compute power to process the data. The hard part of configuring the Hadoop applications falls in the user’s responsibility.

By reaching this point of the article, you should now be familiar with Elastic MapReduce and what it can do for you. Also we saw how to deploy a cluster to process high amounts of data.

You can find more details about Elastic MapReduce and about additional features by using the reference section.

Reference