My spark jobs on EMR stops for no reason. After some investigation, it seems the default setting for spark.yarn.executor.memoryOverhead is way too small. I resolved this issue by configuring ” –conf spark.yarn.executor.memoryOverhead=2048″ when starting spark-shell.
This article features creating a Spark cluster on AWS EMR, executing the user defined jar, and generating analysis results back to AWS S3.
AWS Cli is heavily used here, hence all the above tasks are completely defined by a simple script. It took me quite some time to configure a useful EMR cluster. I hope this article will help others.
# Install AWS cli, reference ~/.aws/config for the result pip install --upgrade awscli aws configure # Create Cluster aws emr create-cluster --ami-version VERSION --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --no-auto-terminate --name spark_cluster --bootstrap-action Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name=install_spark --ec2-attributes KeyName=AWS_IAM # SSH to master node aws emr ssh --cluster-id JOB_ID --key-pair-file AWS_IAM.pem # Copy jar to be executed to master node hadoop fs -get s3n://bucket/jar /tmp cd spark sudo mv /tmp/*.jar . # Run spark job ./bin/spark-submit —-master spark://MASTER_HOST:7077 --class "PACKAGE.CLASS" YOUR_JAR.jar JAR_PARAMs # Terminate Cluster aws emr terminate-clusters --cluster-ids j-jobid
Spark is a in-mem platform for fast compute (ebay uses Spark); Hazelcast is data grid that is more about storage than computing; Cascading is a tool for building data processing pipline.
Spark set up script for AWS EMR from s3 provided by amazon http://shrub.appspot.com/elasticmapreduce/samples/spark/1.0.0/
Best Practice for Testing Spark
Report Analysis Using Spark: Spark Report Patterns
Have you tried Spark API yet, I mean all of them? Examples for all the API functions.
Experience with Spark: The author has extensive hands on experiences with Spark. It seems he has encountered many issues. The one of the comments is from a Spark contributor, who points readers to some ways of improving app performance when using Spark.
AWS and Hadoop related
hdfs or hadoop fs is a client side tool to work with hadoop cluster. It requires proper config. hdfs user is the super user.
It is a common use case to configure a cluster on several VMs using the likes of KVM/VirtualBox. Surprisingly I could not find any document on building such a Spark cluster. I will write one on building a Spark standalone cluster incase anyone else is trying to do the same thing. Admittedly the Spark official documentation is pretty good, it still took me some time to configure every properly. The followings are how.
First, I downloaded Hortonworks sandbox V2.1 in OVI format. The default host is is “sandbox.hortonworks.com”. Import the instance to virtualbox.
Download Apache Spark 1.0 prebuilt for Hadoop2, which is the env of Hortonworks Sandbox V2.0. Abstract the package to /home/spark/. And the instance is ready to use. Importantly, hadoop is not prerequisite. It is also possible to build spark locally using Maven. Hadoop2 is used here to act as hadoop data source.
Make a clone of the above configured vm. Change the hostname of the second one as “slave.sandbox.hortonworks.com”. Add the IPs of both VM to the hosts list of both VMs so they can route to each other. This is because the Spark works reference others by hostnames.
Configure Apache Spark and start the cluster from master.
The spark master is ready at “sandbox.hortonworks.com:7077”.