1. install Anaconda3 – it is much easier to install this Python distro straight away.
wget [link to the latest version of anaconda]
2. generate password in IPython
from IPython.lib import passed
''' type password here and save the generated code '''
3.generate notebook profile
ipython profile create nbserver
4.generate openssl cert
condo update openssl
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
5.edit ipython_notebook_config.py as follows.
# Configuration file for ipython-notebook.
c = get_config()
# Kernel config
c.IPKernelApp.pylab = 'inline'
# Notebook config
c.NotebookApp.certfile = u'/PATH_TO_CERT/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = [GENERATED_CODE]
c.NotebookApp.port = 8888
6. start IPython notebook server
ipython notebook --profile=nbserver
Visit https://ec2-IP.compute-1.amazonaws.com:8888 to use.
This article features creating a Spark cluster on AWS EMR, executing the user defined jar, and generating analysis results back to AWS S3.
AWS Cli is heavily used here, hence all the above tasks are completely defined by a simple script. It took me quite some time to configure a useful EMR cluster. I hope this article will help others.
# Install AWS cli, reference ~/.aws/config for the result
pip install --upgrade awscli
# Create Cluster
aws emr create-cluster --ami-version VERSION --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --no-auto-terminate --name spark_cluster --bootstrap-action Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name=install_spark --ec2-attributes KeyName=AWS_IAM
# SSH to master node
aws emr ssh --cluster-id JOB_ID --key-pair-file AWS_IAM.pem
# Copy jar to be executed to master node
hadoop fs -get s3n://bucket/jar /tmp
sudo mv /tmp/*.jar .
# Run spark job
./bin/spark-submit —-master spark://MASTER_HOST:7077 --class "PACKAGE.CLASS" YOUR_JAR.jar JAR_PARAMs
# Terminate Cluster
aws emr terminate-clusters --cluster-ids j-jobid
Upload pem file to master node and SSH to master
scp -i admin-key.pem admin-key.pem ec2-user@ecIP.eu-west-1.compute.amazonaws.com:~/
aws emr ssh --cluster-id JOB_ID
Configure passwordless SSH from master to slaves
# Identify IPs of slave nodes,
hdfs dfsadmin -report | grep ^Name | cut -f2 -d:| cut -f2 -d' '
ssh-keygen [enter enter]
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@IP_OF_SLAVE
Reference for ssh-agent.
AWS S3 supports both s3 and s3n file system when communicates with HDFS.
s3n, the s3 native file system, allows files to be stored in the original format. s3 is the s3 block file system, which is block based storage, an equivalent of HDFS in AWS implementation. Other s3 tools would not be able to recognise the original file format, but to see a bunch of block files.
However s3n imposes a file size limit of 5G per file. s3 does not prevent users from storing large file bigger than 5G though.
Besides, s3 puts block files directly into a S3 bucket and occupies the whole bucket without differentiating folders. Whilst s3n puts files in original shape into a folder under S3 bucket. Hence s3n is more flexible in this sense.
Since my test files are mostly small files smaller than 1G, hadoop fs -cp outperforms hadoop distcp in my test. Besides, s3n boosts faster transmission than s3.
# copy files from hdfs to s3n
hadoop fs -cp hdfs://namenode.company.com/logsfolder/logs s3n://awsid:awskey@bucket/folder