smyles blog: Jaccard Coefficient: Calculating Document Similarity in XQuery and Python

Monday, August 27, 2012

Jaccard Coefficient: Calculating Document Similarity in XQuery and Python

How do you determine that two documents are near duplicates?

Here's one way
First, you calculate the w-shingles for your documents. Then you calculate the Jaccard Coefficient of the shingles: if the result exceeds some threshold, you declare them similar.

Calculating the Jaccard Coefficient in Python
Here's a technique to calculate the Jaccard Coeffecient in Python using sets. This method is adapted from a series of answers I found on stackoverflow.

a = frozenset(("Selling", "a", "beautiful", "house", "in", "California"))
b = frozenset(("Buying", "a", "beautiful", "crip", "in", "California"))
jaccardcoefficent = (float(len(a & b)) / float(len (a|b)))
Note that we have to force the values to floats, in order to avoid integer arithmetic. This is a Python gotcha.

(For clarity and concision, I used strings in the sets. But you'd want to use w-shingles instead).

Calculating the Jaccard Coefficient in XQuery
Now, let's use the same technique to calculate the Jaccard Coeffecient in XQuery. We're also using set operations in XQuery (specifically union and intersection on the XQuery sequences).

let $a := ("Selling", "a", "beautiful", "house", "in", "California")
let $b := ("Buying", "a", "beautiful", "crip", "in", "California")
return fn:count(fn:distinct-values(($a[. = $b]))) div fn:count(fn:distinct-values(($a, $b)))

(Again, you'd want to use w-shingles rather than collections of strings).

Wait
In the spirit of being a lazy programmer, you might want to check whether the XQuery engine you're using already supports this kind of calculation, without you having to re-implement.

For example, Zorba supports several different techniques for calculating string similarity, including a nice simple method for determining the Jaccard Coefficient.

Almost
Now that we can pick out the w-shingles for two documents and calculate the Jaccard Coefficient, you now have enough to do a pretty decent job of identifying near duplicates. And, if you have all the time in the world, you're probably all set. However, it is quite likely that either you have an archive of millions of documents that you want to assess or that you're processing a fast, real time feed of documents. In either case, it is probable that the technique above will be too slow. So, you need a way to more quickly (and less accurately) identify candidate near duplicates. And, there is a technique to do this, known as "sketches".

But this is enough, for this post.

5 comments:

UnknownDecember 26, 2017 at 10:29 PM
Wow, great post.
ReplyDelete
Replies
UnknownDecember 28, 2017 at 6:37 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
AnonymousJanuary 29, 2018 at 8:56 PM
Thanks for nice article.
ReplyDelete
Replies
AnonymousJanuary 29, 2018 at 8:56 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
tohan gibFebruary 14, 2018 at 7:36 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Add comment

smyles blog

SyntaxHighlighter

SyntaxHighlighter

Monday, August 27, 2012

Jaccard Coefficient: Calculating Document Similarity in XQuery and Python

5 comments:

Popular Posts

About Me

What the People I Follow on Twitter Are RT'ing

Blog Archive