Spark on Yarn: Set Yarn Memory Overhead

My spark jobs on EMR stops for no reason. After some investigation, it seems the default setting for spark.yarn.executor.memoryOverhead is way too small. I resolved this issue by configuring ” –conf spark.yarn.executor.memoryOverhead=2048″ when starting spark-shell.


Apache Spark on AWS EMR

This article features creating a Spark cluster on AWS EMR, executing the user defined jar, and generating analysis results back to AWS S3.

AWS Cli is heavily used here, hence all the above tasks are completely defined by a simple script. It took me quite some time to configure a useful EMR cluster. I hope this article will help others.

# Install AWS cli, reference ~/.aws/config for the result
pip install --upgrade awscli
aws configure
# Create Cluster
aws emr create-cluster --ami-version VERSION --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --no-auto-terminate --name spark_cluster --bootstrap-action Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name=install_spark --ec2-attributes KeyName=AWS_IAM
# SSH to master node
aws emr ssh --cluster-id JOB_ID --key-pair-file AWS_IAM.pem

# Copy jar to be executed to master node
hadoop fs -get s3n://bucket/jar /tmp
cd spark
sudo mv /tmp/*.jar .
# Run spark job
./bin/spark-submit —-master spark://MASTER_HOST:7077 --class "PACKAGE.CLASS" YOUR_JAR.jar JAR_PARAMs
# Terminate Cluster
aws emr terminate-clusters --cluster-ids j-jobid

Spark and Scala and Big Data in General


Process JSON in scala for Spark.

Effective Scala

Guide to advanced Scala user.

Apache Spark

Spark is a in-mem platform for fast compute (ebay uses Spark); Hazelcast is data grid that is more about storage than computing; Cascading is a tool for building data processing pipline.

Spark set up script for AWS EMR from s3 provided by amazon

Best Practice for Testing Spark

Report Analysis Using Spark: Spark Report Patterns

Have you tried Spark API yet, I mean all of them? Examples for all the API functions.

Experience with Spark: The author has extensive hands on experiences with Spark. It seems he has encountered many issues. The one of the comments is from a Spark contributor, who points readers to some ways of improving app performance when using Spark.

AWS and Hadoop related

hdfs or hadoop fs is a client side tool to work with hadoop cluster. It requires proper config. hdfs user is the super user.

Boost performance of S3

Hadoop for dummies by Yahoo!

Deploying a Hadoop 2.0 Cluster on EC2 with HDP2

Install Spark cluster on VMs

It is a common use case to configure a cluster on several VMs using the likes of KVM/VirtualBox. Surprisingly I could not find any document on building such a Spark cluster. I will write one on building a Spark standalone cluster incase anyone else is trying to do the same thing. Admittedly the Spark official documentation is pretty good, it still took me some time to configure every properly. The followings are how.

First, I downloaded Hortonworks sandbox V2.1 in OVI format. The default host is is “”. Import the instance to virtualbox.

Download Apache Spark 1.0 prebuilt for Hadoop2, which is the env of Hortonworks Sandbox V2.0. Abstract the package to /home/spark/. And the instance is ready to use. Importantly, hadoop is not prerequisite. It is also possible to build spark locally using Maven. Hadoop2 is used here to act as hadoop data source.

cd /tmp
mkdir -p /home/spark/
mv *.tgz /home/spark/
tar xzfv *.tgz
cd spark-1.0.0-bin-hadoop2
# the spark instance is ready to use in /home/spark/spark-1.0.0-bin-hadoop2

Make a clone of the above configured vm. Change the hostname of the second one as “”. Add the IPs of both VM to the hosts list of both VMs so they can route to each other. This is because the Spark works reference others by hostnames.

# Edit /etc/hosts       localhost.localdomain localhost sandbox slavesandbox

Configure Apache Spark and start the cluster from master.

# Edit ${SparkHome}/conf/slaves
# Start cluster
$ ./home/spark/spark-bin/sbin/
# The status of cluster is visible from

The spark master is ready at “”.