MinHash

In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words. In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words. The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let U be a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union: This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union. Let h be a hash function that maps the members of U to distinct integers, let perm be a random permutation of the elements of the set U, and for any set S define hmin(S) to be the minimal member of S with respect to h ∘ perm—that is, the member x of S with the minimum value of h(perm(x)). Now, applying hmin to both A and B, and assuming no hash collisions, we see that answering the question whether the values are equal (hmin(A) = hmin(B)) is equivalent to viewing the entire procedure (drawing a random permutation- hashing- computing min value- comparing values) as randomly drawing elements of the union A ∪ B and checking if they are a member of the intersection A ∩ B. The probability of this being true is exactly the Jaccard index, therefore: That is, the probability that hmin(A) = hmin(B) is true is equal to the similarity J(A,B), assuming drawing perm from a uniform distribution. In other words, if r is the random variable that is one when hmin(A) = hmin(B) and zero otherwise, then r is an unbiased estimator of J(A,B). r has too high a variance to be a useful estimator for the Jaccard similarity on its own, because r {displaystyle r} is always zero or one. The idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way. The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer parameter, and represents each set S by the k values of hmin(S) for these k functions. To estimate J(A,B) using this version of the scheme, let y be the number of hash functions for which hmin(A) = hmin(B), and use y/k as the estimate. This estimate is the average of k different 0-1 random variables, each of which is one when hmin(A) = hmin(B) and zero otherwise, and each of which is an unbiased estimator of J(A,B). Therefore, their average is also an unbiased estimator, and by standard deviation for sums of 0-1 random variables, its expected error is O(1/√k). Therefore, for any constant ε > 0 there is a constant k = O(1/ε2) such that the expected error of the estimate is at most ε. For example, 400 hashes would be required to estimate J(A,B) with an expected error less than or equal to .05.

Parent Topic

Child Topic

No Parent Topic