IPython Notebook on AWS EC2

1. install Anaconda3 – it is much easier to install this Python distro straight away.

wget [link to the latest version of anaconda]
bash Anaconda-*.sh

2. generate password in IPython

from IPython.lib import passed
passwd()
''' type password here and save the generated code '''

3.generate notebook profile

ipython profile create nbserver

4.generate openssl cert

condo update openssl

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

5.edit ipython_notebook_config.py as follows.

# Configuration file for ipython-notebook.
c = get_config()
# Kernel config
c.IPKernelApp.pylab = 'inline'
# Notebook config
c.NotebookApp.certfile = u'/PATH_TO_CERT/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = [GENERATED_CODE]
c.NotebookApp.port = 8888

6. start IPython notebook server

ipython notebook --profile=nbserver

Visit https://ec2-IP.compute-1.amazonaws.com:8888 to use.

Apache Spark on AWS EMR

This article features creating a Spark cluster on AWS EMR, executing the user defined jar, and generating analysis results back to AWS S3.

AWS Cli is heavily used here, hence all the above tasks are completely defined by a simple script. It took me quite some time to configure a useful EMR cluster. I hope this article will help others.

# Install AWS cli, reference ~/.aws/config for the result
pip install --upgrade awscli
aws configure
 
# Create Cluster
aws emr create-cluster --ami-version VERSION --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --no-auto-terminate --name spark_cluster --bootstrap-action Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name=install_spark --ec2-attributes KeyName=AWS_IAM
 
# SSH to master node
aws emr ssh --cluster-id JOB_ID --key-pair-file AWS_IAM.pem

# Copy jar to be executed to master node
hadoop fs -get s3n://bucket/jar /tmp
cd spark
sudo mv /tmp/*.jar .
 
# Run spark job
./bin/spark-submit —-master spark://MASTER_HOST:7077 --class "PACKAGE.CLASS" YOUR_JAR.jar JAR_PARAMs
  
# Terminate Cluster
aws emr terminate-clusters --cluster-ids j-jobid

Passwordless SSH from Master to Slave on AWS EMR

Upload pem file to master node and SSH to master

scp -i admin-key.pem admin-key.pem ec2-user@ecIP.eu-west-1.compute.amazonaws.com:~/
aws emr ssh --cluster-id JOB_ID

Configure passwordless SSH from master to slaves

# Identify IPs of slave nodes, 
hdfs dfsadmin -report | grep ^Name | cut -f2 -d:| cut -f2 -d' '
ssh-agent bash
ssh-add YOUR_AWS_IAM.pem
ssh-keygen [enter enter]
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@IP_OF_SLAVE

Reference for ssh-agent.

Choose S3n over S3 when sharing file with HDFS

AWS S3 supports both s3 and s3n file system when communicates with HDFS.

s3n, the s3 native file system, allows files to be stored in the original format. s3 is the s3 block file system, which is block based storage, an equivalent of HDFS in AWS implementation. Other s3 tools would not be able to recognise the original file format, but to see a bunch of block files.

However s3n imposes a file size limit of 5G per file. s3 does not prevent users from storing large file bigger than 5G though.

Besides, s3 puts block files directly into a S3 bucket and occupies the whole bucket without differentiating folders. Whilst s3n puts files in original shape into a folder under S3 bucket. Hence s3n is more flexible in this sense.

Since my test files are mostly small files smaller than 1G, hadoop fs -cp outperforms hadoop distcp in my test. Besides, s3n boosts faster transmission than s3.

# copy files from hdfs to s3n
hadoop fs -cp hdfs://namenode.company.com/logsfolder/logs s3n://awsid:awskey@bucket/folder

SSH Tunnel on AWS : Using native Hadoop shell and UI on Amazon EMR

Socks Proxy is quite handy for browsing web content from EC2 instance and all the similar cloud machines. This is because hadoop, spark and many useful clustering solutions provide a portal page for users to monitor the cluster or jobs. The content is almost always only accessible from localhost.

For AWS EMR, to configure a socks proxy is as simple as it gets:

./elastic-mapreduce --describe -j <jobflow-id>

Then simply configure SOCKS proxy in your browser as 127.0.0.1 port 8157. FoxyProxy is such a plugin for browsers.

If vivid figures are needed,

YHemanth's Blog

Amazon’s Elastic MapReduce (EMR) is a popular Hadoop on the cloud service. Using EMR, users can provision a Hadoop cluster on Amazon AWS resources and run jobs on them. EMR defines an abstraction called the ‘jobflow’ to submit jobs to a provisioned cluster. A jobflow contains a set of ‘steps’. Each step runs a MapReduce job, a Hive script, a shell executable, and so on. Users can track the status of the jobs from the Amazon EMR console.

Users who have used a static Hadoop cluster are used to the Hadoop CLI for submitting jobs and also viewing the Hadoop JobTracker and NameNode user interfaces for tracking activity on the cluster. It is possible to use EMR in this mode, and is documented in the extensive EMR documentation. This blog collates the information for using these interfaces into one place for such a usage mode, along with some…

View original post 1,207 more words