*How do you determine that two documents are near duplicates?*

**Here's one way**

First, you calculate the w-shingles for your documents. Then you calculate the Jaccard Coefficient of the shingles: if the result exceeds some threshold, you declare them similar.

**Calculating the Jaccard Coefficient in Python**

Here's a technique to calculate the Jaccard Coeffecient in Python using sets. This method is adapted from a series of answers I found on stackoverflow.

a = frozenset(("Selling", "a", "beautiful", "house", "in", "California"))

b = frozenset(("Buying", "a", "beautiful", "crip", "in", "California"))

jaccardcoefficent = (float(len(a & b)) / float(len (a|b)))

Note that we have to force the values to floats, in order to avoid integer arithmetic. This is a Python gotcha.

(For clarity and concision, I used strings in the sets. But you'd want to use w-shingles instead).

**Calculating the Jaccard Coefficient in XQuery**

Now, let's use the same technique to calculate the Jaccard Coeffecient in XQuery. We're also using set operations in XQuery (specifically union and intersection on the XQuery sequences).

let $a := ("Selling", "a", "beautiful", "house", "in", "California")

let $b := ("Buying", "a", "beautiful", "crip", "in", "California")

return fn:count(fn:distinct-values(($a[. = $b]))) div fn:count(fn:distinct-values(($a, $b)))

let $b := ("Buying", "a", "beautiful", "crip", "in", "California")

return fn:count(fn:distinct-values(($a[. = $b]))) div fn:count(fn:distinct-values(($a, $b)))

(Again, you'd want to use w-shingles rather than collections of strings).

**Wait**

In the spirit of being a lazy programmer, you might want to check whether the XQuery engine you're using already supports this kind of calculation, without you having to re-implement.

For example, Zorba supports several different techniques for calculating string similarity, including a nice simple method for determining the Jaccard Coefficient.

**Almost**

Now that we can pick out the w-shingles for two documents and calculate the Jaccard Coefficient, you now have enough to do a pretty decent job of identifying near duplicates. And, if you have all the time in the world, you're probably all set. However, it is quite likely that either you have an archive of millions of documents that you want to assess or that you're processing a fast, real time feed of documents. In either case, it is probable that the technique above will be too slow. So, you need a way to more quickly (and less accurately) identify candidate near duplicates. And, there is a technique to do this, known as "sketches".

But this is enough, for this post.

Wow, great post.

ReplyDeleteThis comment has been removed by a blog administrator.

ReplyDeleteThanks for nice article.

ReplyDeleteThis comment has been removed by a blog administrator.

ReplyDeleteThis comment has been removed by a blog administrator.

ReplyDelete