Apache Spark on AWS EMR

This article features creating a Spark cluster on AWS EMR, executing the user defined jar, and generating analysis results back to AWS S3.

AWS Cli is heavily used here, hence all the above tasks are completely defined by a simple script. It took me quite some time to configure a useful EMR cluster. I hope this article will help others.

# Install AWS cli, reference ~/.aws/config for the result
pip install --upgrade awscli
aws configure
 
# Create Cluster
aws emr create-cluster --ami-version VERSION --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --no-auto-terminate --name spark_cluster --bootstrap-action Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name=install_spark --ec2-attributes KeyName=AWS_IAM
 
# SSH to master node
aws emr ssh --cluster-id JOB_ID --key-pair-file AWS_IAM.pem

# Copy jar to be executed to master node
hadoop fs -get s3n://bucket/jar /tmp
cd spark
sudo mv /tmp/*.jar .
 
# Run spark job
./bin/spark-submit —-master spark://MASTER_HOST:7077 --class "PACKAGE.CLASS" YOUR_JAR.jar JAR_PARAMs
  
# Terminate Cluster
aws emr terminate-clusters --cluster-ids j-jobid

Passwordless SSH from Master to Slave on AWS EMR

Upload pem file to master node and SSH to master

scp -i admin-key.pem admin-key.pem ec2-user@ecIP.eu-west-1.compute.amazonaws.com:~/
aws emr ssh --cluster-id JOB_ID

Configure passwordless SSH from master to slaves

# Identify IPs of slave nodes, 
hdfs dfsadmin -report | grep ^Name | cut -f2 -d:| cut -f2 -d' '
ssh-agent bash
ssh-add YOUR_AWS_IAM.pem
ssh-keygen [enter enter]
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@IP_OF_SLAVE

Reference for ssh-agent.

SSH Tunnel on AWS : Using native Hadoop shell and UI on Amazon EMR

Socks Proxy is quite handy for browsing web content from EC2 instance and all the similar cloud machines. This is because hadoop, spark and many useful clustering solutions provide a portal page for users to monitor the cluster or jobs. The content is almost always only accessible from localhost.

For AWS EMR, to configure a socks proxy is as simple as it gets:

./elastic-mapreduce --describe -j <jobflow-id>

Then simply configure SOCKS proxy in your browser as 127.0.0.1 port 8157. FoxyProxy is such a plugin for browsers.

If vivid figures are needed,

YHemanth's Blog

Amazon’s Elastic MapReduce (EMR) is a popular Hadoop on the cloud service. Using EMR, users can provision a Hadoop cluster on Amazon AWS resources and run jobs on them. EMR defines an abstraction called the ‘jobflow’ to submit jobs to a provisioned cluster. A jobflow contains a set of ‘steps’. Each step runs a MapReduce job, a Hive script, a shell executable, and so on. Users can track the status of the jobs from the Amazon EMR console.

Users who have used a static Hadoop cluster are used to the Hadoop CLI for submitting jobs and also viewing the Hadoop JobTracker and NameNode user interfaces for tracking activity on the cluster. It is possible to use EMR in this mode, and is documented in the extensive EMR documentation. This blog collates the information for using these interfaces into one place for such a usage mode, along with some…

View original post 1,207 more words