Monday, August 27, 2012

Jaccard Coefficient: Calculating Document Similarity in XQuery and Python

How do you determine that two documents are near duplicates?

Here's one way
First, you calculate the w-shingles for your documents. Then you calculate the Jaccard Coefficient of the shingles: if the result exceeds some threshold, you declare them similar.

Calculating the Jaccard Coefficient in Python
Here's a technique to calculate the Jaccard Coeffecient in Python using sets. This method is adapted from a series of answers I found on stackoverflow.

a = frozenset(("Selling", "a", "beautiful", "house", "in", "California"))
b = frozenset(("Buying", "a", "beautiful", "crip", "in", "California"))
jaccardcoefficent = (float(len(a & b)) / float(len (a|b)))

Note that we have to force the values to floats, in order to avoid integer arithmetic. This is a Python gotcha.

(For clarity and concision, I used strings in the sets. But you'd want to use w-shingles instead).

Calculating the Jaccard Coefficient in XQuery
Now, let's use the same technique to calculate the Jaccard Coeffecient in XQuery. We're also using set operations in XQuery (specifically union and intersection on the XQuery sequences).

let $a := ("Selling", "a", "beautiful", "house", "in", "California")
let $b := ("Buying", "a", "beautiful", "crip", "in", "California")
return fn:count(fn:distinct-values(($a[. = $b]))) div fn:count(fn:distinct-values(($a, $b)))

(Again, you'd want to use w-shingles rather than collections of strings).

In the spirit of being a lazy programmer, you might want to check whether the XQuery engine you're using already supports this kind of calculation, without you having to re-implement.

For example, Zorba supports several different techniques for calculating string similarity, including a nice simple method for determining the Jaccard Coefficient.

Now that we can pick out the w-shingles for two documents and calculate the Jaccard Coefficient, you now have enough to do a pretty decent job of identifying near duplicates. And, if you have all the time in the world, you're probably all set. However, it is quite likely that either you have an archive of millions of documents that you want to assess or that you're processing a fast, real time feed of documents. In either case, it is probable that the technique above will be too slow. So, you need a way to more quickly (and less accurately) identify candidate near duplicates. And, there is a technique to do this, known as "sketches".

But this is enough, for this post.

No comments:

Post a Comment