Using native Hadoop shell and UI on Amazon EMR

This post is carried over from my earlier blog site here. I am migrating posts that seem to have gathered most hits there.


Amazon’s Elastic MapReduce (EMR) is a popular Hadoop on the cloud service. Using EMR, users can provision a Hadoop cluster on Amazon AWS resources and run jobs on them. EMR defines an abstraction called the ‘jobflow’ to submit jobs to a provisioned cluster. A jobflow contains a set of ‘steps’. Each step runs a MapReduce job, a Hive script, a shell executable, and so on. Users can track the status of the jobs from the Amazon EMR console.

Users who have used a static Hadoop cluster are used to the Hadoop CLI for submitting jobs and also viewing the Hadoop JobTracker and NameNode user interfaces for tracking activity on the cluster. It is possible to use EMR in this mode, and is documented in the extensive EMR documentation. This blog collates the information for using these interfaces into one place for such a usage mode, along with some experience notes. The examples given here have been tested on OS X MountainLion. The details will vary for other operating systems, but should be similar.

When a cluster is provisioned, a node called the ‘master’ node is created on the EMR cluster, that runs the JobTracker and the NameNode. In short, the mechanism to access the Hadoop CLI is to ssh into the master node and use the installed Hadoop software. Likewise, for accessing the UI, an SSH tunnel needs to be set up to the web interfaces that also run on the master node.

Before we start

  • This blog assumes you already have signed up for an Amazon account.
  • Next, you need the Amazon EMR CLI. The CLI is a useful way to access Amazon EMR that provides a good balance between completeness of functionality and ease of use, compared to the Amazon EMR console.

Note: The EMR CLI is a Ruby application requiring Ruby 1.8.x. If you have installed Ruby 1.9 on your system, you have two options:

    • Use RVM, install Ruby 1.8.x. via RVM and switch the Ruby version. 
    • Alternatively, you can use a fork of the Amazon CLI that works with Ruby 1.9 called elastic-mapreduce-ruby.
  • Configure the EMR client by creating a credentials.json file. The file should be created in the home directory of your elastic-mapreduce client. A sample file’s contents is mentioned here:

  • In the example above, access_id and private_key are security credentials connected with your AWS account. The keypair is a resource associated to your Amazon EC2 service. The private key details of the keypair can be downloaded to a PEM file locally, whose path is specified in key-pair-file. You can refer to this link for details on how to get these various security parameters.

Launching a cluster

Once set up, we are now ready to launch an EMR cluster and access it to submit Hadoop jobs.

Typical deployments of a Hadoop cluster comprise of three types of nodes – the masters (JobTracker and NameNode), the slaves (TaskTrackers and DataNodes) and client nodes (typically called access nodes or gateways) from where users submit jobs. An EMR cluster too has three types of nodes – a single master node (that runs both the JobTracker and NameNode), core nodes (that run both TaskTrackers and DataNodes) and task nodes (that run only the TaskTrackers). As you can see, the categories of nodes are slightly different. But we could double up the master node in EMR to be a client node as well. Since the master node in an EMR cluster is used only by you, it is assumed it will have enough capacity to run the Hadoop CLI.

Start an EMR cluster with the CLI in this fashion:

The ‘elastic-mapreduce’ program will be in the home directory of the EMR client. The above command creates a cluster with 3 instances – 1 master and 2 slaves. The –alive flag ensures that the launched cluster stays alive until it is manually terminated by the user. This option is required to login into the master node and submit Hadoop jobs to the cluster directly. The output of this command will be a ‘jobflow’ ID – something that will look like j-FTW6FLQ1P2G0. Make a note of this ID, as we will use it in other commands below.

Note: You will be charged according to the EMR rates for your usage of the cluster, depending on the type and number of instances chosen.

You can also get more details about the launched cluster using the following command:

Of specific interest in the output of the above command will be the “State” attribute of the ExecutionStatusDetail node. You need to wait until the value of this attribute becomes “WAITING”. Once this state is reached, your cluster is ready for further action. Similarly, the “MasterPublicDnsName” attribute gives the DNS name of the machine running the Hadoop JobTacker and NameNode.

Browsing Hadoop web UI on the cluster

At this point, it will be useful to check out how the cluster looks like from the familiar JobTracker and NameNode web UI.

The JobTracker web server runs on port 9100 on the master node. In order to browse this UI, you can set up an SSH tunnel that will work as a SOCKS proxy server from your machine to the master node that dynamically forwards HTTP requests. The Amazon CLI provides a command to set up this proxy:

Note: This command helpfully prints out the SSH command that is invoked to set up the SOCKS proxy server. When facing problems with connectivity, I have been able to take this SSH command, modify it (for instance to add the -v option to ssh for enabling debug output) and run it directly for debugging / resolving issues.

After the socks proxy server is started, you can configure your browser’s or system’s proxy settings to use this SOCKS proxy for HTTP traffic. For Chrome, I did this by launching Settings > Change Proxy Settings > Select “SOCKS Proxy” > Enter the server address as 127.0.0.1 and the port as 8157.

socks-proxy

If things are fine here, you should be able to hit the URL http://<MasterPublicDnsName>:9100/ on the browser and see the JobTracker UI page and http://<MasterPublicDnsName>:9101/ for the NameNode UI.

The method described here routes all HTTP traffic from your browser through the tunnel. A better option would be to set up a rule that routes only traffic to the EMR clusters through the tunnel. Such a method is described in the EMR documentation, using the FoxyProxy Firefox browser plugin.

Submitting Jobs

The next step is to submit jobs to the EMR cluster using the Hadoop CLI. As described above, the master node doubles up as a Hadoop client node as well. So, we should SSH into the master node using the following command:

This command should drop you into the home directory of the ‘hadoop’ user on the master node. A quick listing of the home directory will show a full Hadoop installation, including bin and conf directories and all the jars that are part of the Hadoop distribution.

You can execute your Hadoop commands as usual from here. For e.g.

 

Using the UI that we set up in the previous step, you can also browse the job pages as usual.

Terminating the cluster

Once you are done with the usage, you must remember to terminate the cluster, as otherwise you will continue to accrue cost on an hourly basis irrespective of usage. The command to terminate is:

Note: Terminating the cluster will result in loss of data stored in the HDFS of the cluster. If you need to store data permanently, you can do so by providing Amazon S3 paths as output directories of your jobs.

In conclusion, tools like the elastic mapreduce CLI make it easy to not only provision a Hadoop cluster and run jobs on it, but also to provide the familiar look and feel of the Hadoop client and the Web UI.

 

facebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">