This blog post describes the process of bootstrapping an Apache Hadoop MapReduce cluster on the cloud infrastructure provided by Amazon Web Service (aws) Elastic Cloud Compute (ec2). AWS is an appealing platform due to the fact that you can rent compute resources by the hour for as little as $.085 per server-hour for small ec2 instances.
Amazon Web Services
Amazon Web Services is a cloud-based service provided by Amazon that supplants the traditional hosted-server paradigm. AWS offers servers, disk storage, notification services and other products that are billable as a 'pay for what you use' model. You can scale up when demand increases and scale down when demand decreases. The flexibility and affordability of AWS is the reason why many new startup companies choose to host on Amazon's infrastructure rather than procuring and maintaining their own.
Creating an account is easy and free. Simply go to http://aws.amazon.com/ to sign up for an account. Amazon will want a valid credit card to bill for charges. New customers are eligible for a free tier for basic resources.
Once your account has been created, you will need to generate and retrieve several types of security credentials. Navigate to the access credentials page of aws to retrieve the information detailed in this section.
The account id is near the upper right hand corner of the page. Retrieve this id and remove the hyphens from it. For instance, if my account number were 1234-5678-9123, I would do the following:
josh@laptop:~$ echo 1234-5678-9123 | sed 's/\-//g' > account-id.txt
If access keys don't already exist, generate them via the web interface and then retrieve the following information:
- Access Key ID
- Secret Access Key
Create an X509 Certificate using the web interface and then retrieve the following files:
- X.509 public certificate
- X.509 private key
- Download and install the CLI tools
- Set the appropriate EC2 environment variables (changing the values, of course, to suit your environment)
EC2_HOME=/opt/aws/ec2/ec2-api-tools-22.214.171.124 EC2_URL=http://ec2.us-east-1.amazonaws.com EC2_PRIVATE_KEY=/home/hadoop/.ec2/pk-AGWHNHJGCWVK3YZOY5TTZNUHVIC6IZSC EC2_CERT=/home/hadoop/.ec2/cert-AGWHNHJGCWVK3YZOY5TTZNUHVIC6IZSC.pem
- Add $EC2_HOME/bin to your PATH environment variable
Use the CLI tools to generate the ssh keys that will be used to log into your ec2 compute instance. Generate the key pair as follows:
This command will generate an rsa public/private key pair and register it with aws. Since it doesn't save any output to a file, you will need to copy the part starting with "-----BEGIN RSA PRIVATE KEY" through "-----END RSA PRIVATE KEY" (inclusive) into the following file:
josh@laptop:~$ ec2-add-keypair gsg-keypair
For security purposes, make sure this file is owned appropriately and has 600 UNIX permissions. At this point, you have completely configured aws. Next you must configure hadoop to know about your aws account.
Although you will be executing hadoop on the AWS EC2 instances, you still need to install hadoop locally. This is because it contains some useful shell scripts that are aws-aware. I am using version .19.2 installed to /opt/hadoop/current and it is referenced for the rest of this post as HADOOP_HOME. As I rely on the defaults in the configuration of this version of hadoop, I strongly suggest that you use the contrib scripts from this version as well.
Add $HADOOP_HOME/src/contrib/ec2/bin/ to your PATH. Then edit the following environment variables in $HADOOP_HOME/src/contrib/ec2/bin/hadoopec2-env.sh with the values you obtained in the security credentials section:
- AWS_ACCOUNT_ID (be sure to leave out the dashes)
Launch EC2 Instances
Now that everything is configured, we can launch our hadoop cluster on the aws infrastructure. To keep things inexpensive, we will launch a hadoop cluster with one master and 2 slaves using small ec2 instances ($.085 per server-hour). Note that amazon bills by the hour, so launching the hadoop cluster for an hour or less will cost ~ $.26.
To launch a cluster named 'test-cluster' with 2 slaves, issue the following command:
After the cluster is initialized, you can log into the cluster via ssh by using the command:
josh@laptop:~$ hadoop-ec2 launch-cluster test-cluster 2
josh@laptop:~$ hadoop-ec2 login test-cluster
Execute Hadoop Examples
At this point, we can execute some of the stock examples that ship with hadoop. For instance, lets use the hadoop cluster to estimate the value of pi via sampling:
# cd /usr/local/hadoop-* # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000As this post is concerned with getting a basic ec2 hadoop cluster running, we will not cover executing custom map-reduce jobs with custom data sets from ec2. At this point, though, feel free to explore the hadoop Amazon Machine Instance (AMI) and run some of the other examples.
Terminate EC2 Instances
Remember that the meter is running until you shut down the ec2 instances. To shut down the ec2 instances, issue the following command from your local machine:
If you are paranoid, you can log in to the dashboard of aws web console to double check that the ec2 instances are no longer running.
josh@laptop:~$ hadoop-ec2 terminate-cluster test-cluster
Amazon's web service platform is ideal for quickly provisioning hadoop cluster nodes on a shoe-string budget. Once the configuration steps in this post are executed, one can launch a hadoop cluster of up to 20 nodes within minutes (you can request that amazon allow you to launch more nodes, if needed). This is perfect for experimenting with technologies like hadoop or running one-time map reduce jobs.