This article features creating a Spark cluster on AWS EMR, executing the user defined jar, and generating analysis results back to AWS S3.
AWS Cli is heavily used here, hence all the above tasks are completely defined by a simple script. It took me quite some time to configure a useful EMR cluster. I hope this article will help others.
# Install AWS cli, reference ~/.aws/config for the result pip install --upgrade awscli aws configure # Create Cluster aws emr create-cluster --ami-version VERSION --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --no-auto-terminate --name spark_cluster --bootstrap-action Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name=install_spark --ec2-attributes KeyName=AWS_IAM # SSH to master node aws emr ssh --cluster-id JOB_ID --key-pair-file AWS_IAM.pem # Copy jar to be executed to master node hadoop fs -get s3n://bucket/jar /tmp cd spark sudo mv /tmp/*.jar . # Run spark job ./bin/spark-submit —-master spark://MASTER_HOST:7077 --class "PACKAGE.CLASS" YOUR_JAR.jar JAR_PARAMs # Terminate Cluster aws emr terminate-clusters --cluster-ids j-jobid
Upload pem file to master node and SSH to master
scp -i admin-key.pem admin-key.pem ec2-user@ecIP.eu-west-1.compute.amazonaws.com:~/ aws emr ssh --cluster-id JOB_ID
Configure passwordless SSH from master to slaves
# Identify IPs of slave nodes, hdfs dfsadmin -report | grep ^Name | cut -f2 -d:| cut -f2 -d' ' ssh-agent bash ssh-add YOUR_AWS_IAM.pem ssh-keygen [enter enter] ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@IP_OF_SLAVE
Socks Proxy is quite handy for browsing web content from EC2 instance and all the similar cloud machines. This is because hadoop, spark and many useful clustering solutions provide a portal page for users to monitor the cluster or jobs. The content is almost always only accessible from localhost.
For AWS EMR, to configure a socks proxy is as simple as it gets:./elastic-mapreduce --describe -j <jobflow-id>
Then simply configure SOCKS proxy in your browser as 127.0.0.1 port 8157. FoxyProxy is such a plugin for browsers.
Amazon’s Elastic MapReduce (EMR) is a popular Hadoop on the cloud service. Using EMR, users can provision a Hadoop cluster on Amazon AWS resources and run jobs on them. EMR defines an abstraction called the ‘jobflow’ to submit jobs to a provisioned cluster. A jobflow contains a set of ‘steps’. Each step runs a MapReduce job, a Hive script, a shell executable, and so on. Users can track the status of the jobs from the Amazon EMR console.
Users who have used a static Hadoop cluster are used to the Hadoop CLI for submitting jobs and also viewing the Hadoop JobTracker and NameNode user interfaces for tracking activity on the cluster. It is possible to use EMR in this mode, and is documented in the extensive EMR documentation. This blog collates the information for using these interfaces into one place for such a usage mode, along with some…
View original post 1,207 more words