Saturday, April 2, 2011

Running a Hadoop Cluster on Amazon EC2


This blog post describes the process of bootstrapping an Apache Hadoop MapReduce cluster on the cloud infrastructure provided by Amazon Web Service (aws) Elastic Cloud Compute (ec2). AWS is an appealing platform due to the fact that you can rent compute resources by the hour for as little as $.085 per server-hour for small ec2 instances.

Amazon Web Services

Amazon Web Services is a cloud-based service provided by Amazon that supplants the traditional hosted-server paradigm. AWS offers servers, disk storage, notification services and other products that are billable as a 'pay for what you use' model. You can scale up when demand increases and scale down when demand decreases. The flexibility and affordability of AWS is the reason why many new startup companies choose to host on Amazon's infrastructure rather than procuring and maintaining their own.

Account Creation

Creating an account is easy and free. Simply go to to sign up for an account. Amazon will want a valid credit card to bill for charges. New customers are eligible for a free tier for basic resources.

Security Credentials

Once your account has been created, you will need to generate and retrieve several types of security credentials. Navigate to the access credentials page of aws to retrieve the information detailed in this section.

Account ID

The account id is near the upper right hand corner of the page.  Retrieve this id and remove the hyphens from it. For instance, if my account number were 1234-5678-9123, I would do the following:
josh@laptop:~$ echo 1234-5678-9123 | sed 's/\-//g' > account-id.txt
Access Keys

If access keys don't already exist, generate them via the web interface and then retrieve the following information:
  • Access Key ID
  • Secret Access Key
X509 Certificates

Create an X509 Certificate using the web interface and then retrieve the following files:
  • X.509 public certificate
  • X.509 private key
     Save these keys to ~/.ec2/ and take care to chmod 600 any private/secret keys.
    Command Line Tools
    • Download and install the CLI tools 
    • Set the appropriate EC2 environment variables (changing the values, of course, to suit your environment)
    • Add $EC2_HOME/bin to your PATH environment variable
    SSH Keys

    Use the CLI tools to generate the ssh keys that will be used to log into your ec2 compute instance.  Generate the key pair as follows:
    josh@laptop:~$ ec2-add-keypair gsg-keypair
    This command will generate an rsa public/private key pair and register it with aws. Since it doesn't save any output to a file, you will need to copy the part starting with "-----BEGIN RSA PRIVATE KEY" through "-----END RSA PRIVATE KEY" (inclusive) into the following file:


    For security purposes, make sure this file is owned appropriately and has 600 UNIX permissions. At this point, you have completely configured aws. Next you must configure hadoop to know about your aws account.


    Although you will be executing hadoop on the AWS EC2 instances, you still need to install hadoop locally. This is because it contains some useful shell scripts that are aws-aware. I am using version .19.2 installed to /opt/hadoop/current and it is referenced for the rest of this post as HADOOP_HOME.  As I rely on the defaults in the configuration of this version of hadoop, I strongly suggest that you use the contrib scripts from this version as well.


    Add $HADOOP_HOME/src/contrib/ec2/bin/ to your PATH. Then edit the following environment variables in $HADOOP_HOME/src/contrib/ec2/bin/ with the values you obtained in the security credentials section:
    • AWS_ACCOUNT_ID (be sure to leave out the dashes)
     There are other settings in this file, such as what version of hadoop you'd like to run on ec2.  Since this post is focused on getting a basic cluster up and running, we will not focus on these other advanced settings.

    Launch EC2 Instances

    Now that everything is configured, we can launch our hadoop cluster on the aws infrastructure.  To keep things inexpensive, we will launch a hadoop cluster with one master and 2 slaves using small ec2 instances ($.085 per server-hour).  Note that amazon bills by the hour, so launching the hadoop cluster for an hour or less will cost ~ $.26.

    To launch a cluster named 'test-cluster' with 2 slaves, issue the following command:
     josh@laptop:~$ hadoop-ec2 launch-cluster test-cluster 2 
    After the cluster is initialized, you can log into the cluster via ssh by using the command:
    josh@laptop:~$ hadoop-ec2 login test-cluster

    Execute Hadoop Examples

    At this point, we can execute some of the stock examples that ship with hadoop. For instance, lets use the hadoop cluster to estimate the value of pi via sampling:
    # cd /usr/local/hadoop-*
    # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
    As this post is concerned with getting a basic ec2 hadoop cluster running, we will not cover executing custom map-reduce jobs with custom data sets from ec2. At this point, though, feel free to explore the hadoop Amazon Machine Instance (AMI) and run some of the other examples.

    Terminate EC2 Instances

    Remember that the meter is running until you shut down the ec2 instances.  To shut down the ec2 instances, issue the following command from your local machine:
    josh@laptop:~$ hadoop-ec2 terminate-cluster test-cluster
    If you are paranoid, you can log in to the dashboard of aws web console to double check that the ec2 instances are no longer running.


    Amazon's web service platform is ideal for quickly provisioning hadoop cluster nodes on a shoe-string budget. Once the configuration steps in this post are executed, one can launch a hadoop cluster of up to 20 nodes within minutes (you can request that amazon allow you to launch more nodes, if needed). This is perfect for experimenting with technologies like hadoop or running one-time map reduce jobs.


    1. Thanks for this thing, but I have a question :)
      Any guesses for this error

      hadoop@testEnv:/opt/whirr/current$ hadoop-ec2 launch-cluster test-cluster 2
      Testing for existing master in group: test-cluster
      Starting master with AMI ami-fa6a8e93
      Unexpected error: Connection reset
      at org.apache.commons.httpclient.HttpParser.readRawLine(
      at org.apache.commons.httpclient.HttpParser.readLine(
      at org.apache.commons.httpclient.HttpConnection.readLine(
      at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(
      at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(
      at org.apache.commons.httpclient.HttpMethodBase.readResponse(
      at org.apache.commons.httpclient.HttpMethodBase.execute(
      at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(
      at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(
      at org.apache.commons.httpclient.HttpClient.executeMethod(
      at org.codehaus.xfire.transport.http.CommonsHttpMessageSender.send(
      at org.codehaus.xfire.transport.http.HttpChannel.sendViaClient(
      at org.codehaus.xfire.transport.http.HttpChannel.send(
      at org.codehaus.xfire.handler.OutMessageSender.invoke(
      at org.codehaus.xfire.handler.HandlerPipeline.invoke(
      at org.codehaus.xfire.client.Invocation.invoke(
      at org.codehaus.xfire.client.Invocation.invoke(
      at org.codehaus.xfire.client.Client.invoke(
      at org.codehaus.xfire.client.XFireProxy.handleRequest(
      at org.codehaus.xfire.client.XFireProxy.invoke(
      at $Proxy12.runInstances(Unknown Source)
      Waiting for instance to start

      1. Is your master running? Have you enabled any firewall exceptions?