Notes: Mining Massive Data Sets

Chapter 3: Finding Similar Items

‘Jaccard Similarity’ measures the size of intersection between sets. While another approach is collaborative filtering.

Shingling of Documents

Shingling size: k=5 gives power(27, 5) = 14 M possible shingles. This is sufficient for emails as emails are usually shorter than 14 M characters. K = 9 is a common choice for long documents.

Stop-word based ‘shingles from words’ works better when used to find similar news articles. This is because news articles use much more stop words than ads and a like. This means the measured similarity is more sensitive to articles while not so much to the surrounding ads.

Reducing the size of shingles sets:
Hashing singles to 4 bytes each.
Computer signature from shingles using minhashing. This is to record the first non-0 element for each shingle in a permuted matrix representation. p 81. In this way, a N row * M column characteristic matrix is compressed to a k row * M column matrix, where k is the number of randomly selected permutations. “The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.”


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s