Apache Spark on AWS EMR

This article features creating a Spark cluster on AWS EMR, executing the user defined jar, and generating analysis results back to AWS S3.

AWS Cli is heavily used here, hence all the above tasks are completely defined by a simple script. It took me quite some time to configure a useful EMR cluster. I hope this article will help others.

# Install AWS cli, reference ~/.aws/config for the result
pip install --upgrade awscli
aws configure
# Create Cluster
aws emr create-cluster --ami-version VERSION --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --no-auto-terminate --name spark_cluster --bootstrap-action Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name=install_spark --ec2-attributes KeyName=AWS_IAM
# SSH to master node
aws emr ssh --cluster-id JOB_ID --key-pair-file AWS_IAM.pem

# Copy jar to be executed to master node
hadoop fs -get s3n://bucket/jar /tmp
cd spark
sudo mv /tmp/*.jar .
# Run spark job
./bin/spark-submit —-master spark://MASTER_HOST:7077 --class "PACKAGE.CLASS" YOUR_JAR.jar JAR_PARAMs
# Terminate Cluster
aws emr terminate-clusters --cluster-ids j-jobid

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s