Install Spark cluster on VMs

It is a common use case to configure a cluster on several VMs using the likes of KVM/VirtualBox. Surprisingly I could not find any document on building such a Spark cluster. I will write one on building a Spark standalone cluster incase anyone else is trying to do the same thing. Admittedly the Spark official documentation is pretty good, it still took me some time to configure every properly. The followings are how.

First, I downloaded Hortonworks sandbox V2.1 in OVI format. The default host is is “sandbox.hortonworks.com”. Import the instance to virtualbox.

Download Apache Spark 1.0 prebuilt for Hadoop2, which is the env of Hortonworks Sandbox V2.0. Abstract the package to /home/spark/. And the instance is ready to use. Importantly, hadoop is not prerequisite. It is also possible to build spark locally using Maven. Hadoop2 is used here to act as hadoop data source.

cd /tmp
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0-bin-hadoop2.tgz
mkdir -p /home/spark/
mv *.tgz /home/spark/
tar xzfv *.tgz
cd spark-1.0.0-bin-hadoop2
# the spark instance is ready to use in /home/spark/spark-1.0.0-bin-hadoop2

Make a clone of the above configured vm. Change the hostname of the second one as “slave.sandbox.hortonworks.com”. Add the IPs of both VM to the hosts list of both VMs so they can route to each other. This is because the Spark works reference others by hostnames.

# Edit /etc/hosts
127.0.0.1       localhost.localdomain localhost
10.149.2.111    sandbox.hortonworks.com sandbox
10.149.2.222    slave.sandbox.hortonworks.com slavesandbox

Configure Apache Spark and start the cluster from master.

# Edit ${SparkHome}/conf/slaves
slave.sandbox.hortonworks.com
sandbox.hortonworks.com
 
# Start cluster
$ ./home/spark/spark-bin/sbin/start-all.sh
# The status of cluster is visible from http://sandbox.hortonworks.com:8080

The spark master is ready at “sandbox.hortonworks.com:7077”.