Skip to content

Using SpatialHadoop with Amazon Web Services (AWS)

Ahmed Eldawy edited this page Feb 27, 2016 · 8 revisions

This page describes how to use SpatialHadoop with Amazon Web Services (AWS). First, we describe how to setup SpatialHadoop using Amazon EC2, and then we describe how to use it with Amazon Elastic MapReduce (EMR).

Setting up SpatialHadoop on Amazon EC2

This tutorial describes how to set up a cluster on Amazon EC2 that runs SpatialHadoop. The process is very similar to install Hadoop with an extra step that installs SpatialHadoop.

  • The first step is to download and expand the latest SpatialHadoop binary folder in your local machine.
  • Edit the file '/src/contrib/ec2/bin/hadoop-ec2-env.sh'. Set the values of 'AWS_ACCOUNT_ID', 'AWS_ACCESS_KEY_ID' and 'AWS_SECRET_ACCESS_KEY' as your Amazon EC2 account. This ensures that the script can access your Amazon account and start the instances there. For more details, check how to run Hadoop on Amazon EC2.
  • Edit the file '/src/contrib/ec2/bin/hadoop-ec2-env.sh'. Set HADOOP_VERSION to '1.2.1' and S3_BUCKET to '512500806257'. This bucket contains a recent Amazon image with Hadoop 1.2.1 installed. This will be used as the base version.
  • Edit the file <spatialhadoop>/src/contrib/ec2/bin/hadoop-ec2-init-remote.sh. Add the following (highlighted) line right after the line that starts with HADOOP_HOME ....

hadoop-ec2-init-remote.sh

if [ "$IS_MASTER" == "true" ]; then
 MASTER_HOST=`wget -q -O - http://169.254.169.254/latest/meta-data/local-hostname`
fi

HADOOP_HOME=`ls -d /usr/local/hadoop-*`
#################### Add only the following line ###################
wget -qO- http://spatialhadoop.cs.umn.edu/downloads/spatialhadoop-2.3.tar.gz | tar --directory $HADOOP_HOME -xvz

################################################################################
# Hadoop configuration
# Modify this section to customize your Hadoop cluster.
################################################################################

Note: You can replace 'spatialhadoop-2.3.tar.gz' with 'spatialhadoop-latest.tar.gz'. This will install a more recent version of SpatialHadoop which has some new features and bug fixes. However, it might not be as stable as the release version.

  • Now your cluster is ready to start. You can launch a new cluster by typing

.

bin/hadoop-ec2 launch-cluster test-cluster 2

For more details, check how to run Hadoop on Amazon EC2.

Using SpatialHadoop with Amazon Elastic MapReduce (EMR)

Amazon provides an alternative way to running MapReduce job through the Elastic MapReduce (EMR) service. The service takes the burden of configuring and starting the Hadoop cluster using a simple web console or through a command line interface. SpatialHadoop can run on EMR clusters by providing a bootstrap action that installs SpatialHadoop as the cluster is starting.

In this tutorial, we will show how to install SpatialHaodop using the web console but the same technique can be used in the command line interface.

  • Start the "New Cluster" wizard by clicking the "Create Cluster" button in the web console. Push the Create Cluster button

  • Choose the version of Hadoop you want to start. In this tutorial, we will use Amazon's distribution of Hadoop which builds on Apache Hadoop 2.4.0. You can also choose an older version but it is not recommended by Amazon. We did not test SpatialHadoop with MapR distribution so it is up to you to choose that version. Choose an AMI version

  • In the "Bootstrap Actions" section, add a new bootstrap action, choose "Custom action" and click "Configure and add". Add a new bootstrap action

  • In the name field enter "Install SpatialHadoop", in the S3 location enter "s3://shadoop-emr/install-shadoop.rb" and leave the "Optional arguments" field blank. When you are done, click "Add". Specify the path to the install-shadoop.rb file

Hint: Leaving the "Optional arguments" feed blank will automatically install the most recent version of SpatialHadoop. If you would like to install a specific version, enter the download URL of the SpatialHadoop package as an argument. For example, if you would like to install SpatialHadoop 2.2, enter "http://spatialhadoop.cs.umn.edu/downloads/spatialhadoop-2.2.tar.gz" as an argument.

  • You can just start the cluster without specifying any steps and it will have SpatialHadoop installed on it. If you would also like to run some steps, you can add choose the "Custom JAR" step and click "Configure and add". Add you SpatialHadoop commands using the custom jar option

Enter a suitable name for the step and specify the JAR location as /home/hadoop/spatialhadoop-main.jar. In the "arguments" field, specify the command you would like to run along with any arguments as shown in the figure below. Specify the parameters for the SpatialHadoop command you want to run

Hint: Repeat the last step as many times as you want for each command you would like to run.

  • Finally, you can start the cluster using the "Create cluster" button at the bottom of the page.
Clone this wiki locally