Ensemble Learning

Bagging (bootstrap aggregating)
A simple and straightforward way of ensembling models by averaging results from multiple models. Each model is trained with a fraction of data with replacement. Each model votes with equal weight: averaging for regression and majority vote for classification.
E.g. random forests

Boosting
Train models sequentially. Start with equally weighted data.
Increase weights on misclassified data for the next model.
So on and so forth…
E.g. AdaBoosting

Stacking
Train a model that takes the output of multiple models as input.

Notes: Mining Massive Data Sets

Chapter 3: Finding Similar Items

‘Jaccard Similarity’ measures the size of intersection between sets. While another approach is collaborative filtering.

Shingling of Documents

Shingling size: k=5 gives power(27, 5) = 14 M possible shingles. This is sufficient for emails as emails are usually shorter than 14 M characters. K = 9 is a common choice for long documents.

Stop-word based ‘shingles from words’ works better when used to find similar news articles. This is because news articles use much more stop words than ads and a like. This means the measured similarity is more sensitive to articles while not so much to the surrounding ads.

Reducing the size of shingles sets:
Hashing singles to 4 bytes each.
Computer signature from shingles using minhashing. This is to record the first non-0 element for each shingle in a permuted matrix representation. p 81. In this way, a N row * M column characteristic matrix is compressed to a k row * M column matrix, where k is the number of randomly selected permutations. “The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.”