Notes: Mining Massive Data Sets

Chapter 3: Finding Similar Items

‘Jaccard Similarity’ measures the size of intersection between sets. While another approach is collaborative filtering.

Shingling of Documents

Shingling size: k=5 gives power(27, 5) = 14 M possible shingles. This is sufficient for emails as emails are usually shorter than 14 M characters. K = 9 is a common choice for long documents.

Stop-word based ‘shingles from words’ works better when used to find similar news articles. This is because news articles use much more stop words than ads and a like. This means the measured similarity is more sensitive to articles while not so much to the surrounding ads.

Reducing the size of shingles sets:
Hashing singles to 4 bytes each.
Computer signature from shingles using minhashing. This is to record the first non-0 element for each shingle in a permuted matrix representation. p 81. In this way, a N row * M column characteristic matrix is compressed to a k row * M column matrix, where k is the number of randomly selected permutations. “The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.”


SSH Tunnel on AWS : Using native Hadoop shell and UI on Amazon EMR

Socks Proxy is quite handy for browsing web content from EC2 instance and all the similar cloud machines. This is because hadoop, spark and many useful clustering solutions provide a portal page for users to monitor the cluster or jobs. The content is almost always only accessible from localhost.

For AWS EMR, to configure a socks proxy is as simple as it gets:

./elastic-mapreduce --describe -j <jobflow-id>

Then simply configure SOCKS proxy in your browser as port 8157. FoxyProxy is such a plugin for browsers.

If vivid figures are needed,

YHemanth's Blog

Amazon’s Elastic MapReduce (EMR) is a popular Hadoop on the cloud service. Using EMR, users can provision a Hadoop cluster on Amazon AWS resources and run jobs on them. EMR defines an abstraction called the ‘jobflow’ to submit jobs to a provisioned cluster. A jobflow contains a set of ‘steps’. Each step runs a MapReduce job, a Hive script, a shell executable, and so on. Users can track the status of the jobs from the Amazon EMR console.

Users who have used a static Hadoop cluster are used to the Hadoop CLI for submitting jobs and also viewing the Hadoop JobTracker and NameNode user interfaces for tracking activity on the cluster. It is possible to use EMR in this mode, and is documented in the extensive EMR documentation. This blog collates the information for using these interfaces into one place for such a usage mode, along with some…

View original post 1,207 more words

Spark and Scala and Big Data in General


Process JSON in scala for Spark.

Effective Scala

Guide to advanced Scala user.

Apache Spark

Spark is a in-mem platform for fast compute (ebay uses Spark); Hazelcast is data grid that is more about storage than computing; Cascading is a tool for building data processing pipline.

Spark set up script for AWS EMR from s3 provided by amazon

Best Practice for Testing Spark

Report Analysis Using Spark: Spark Report Patterns

Have you tried Spark API yet, I mean all of them? Examples for all the API functions.

Experience with Spark: The author has extensive hands on experiences with Spark. It seems he has encountered many issues. The one of the comments is from a Spark contributor, who points readers to some ways of improving app performance when using Spark.

AWS and Hadoop related

hdfs or hadoop fs is a client side tool to work with hadoop cluster. It requires proper config. hdfs user is the super user.

Boost performance of S3

Hadoop for dummies by Yahoo!

Deploying a Hadoop 2.0 Cluster on EC2 with HDP2