## Tuesday, August 28, 2012

### Sketching

In this post, I'm not discussing drawing but instead a technique to efficiently identify candidate near duplicate news articles.

Shingling
In previous posts, I've discussed a technique called "w-shingling", which lets you determine that two news articles are near duplicates. Briefly, given two documents, you

1. calculate a set of w-shingles by identifying all sequences of three words that start with stop words (XQuery, Python)
2. determining the similarity of the w-shingles for the two documents by calculating their Jaccard Coefficient

This technique works well enough, but it requires you to compare every single pairwise combination of documents in your corpus. This can result in quite a few comparisons: even for a relatively small set of 100 documents, there are 4,950 permutations! For a large corpus of documents (such as a news archive with millions of documents) or in situations where you're trying to do real time comparisons of documents (such as when crawling websites) you need a way to efficiently narrow down the number of comparisons you need to perform.

Sketching
The technique of sketching builds upon the w-shingling technique to quickly identify sets of candidate duplicates, which can then be compared using more accurate methods.

To calculate a sketch for a document you:

1. calculate the w-shingles for the document
2. calculate the the hash of each w-shingle
2. group the hashes of w-shingles into n buckets by calculating the modulus (hash mod n)
3. take the smallest hash value as the sketch for this document

The resulting buckets are candidate near duplicates.

Efficient
The sketch technique described above is linear, i.e. you calculate one sketch per document, and requires simple functions built into every modern language (hashing and modulus arithmetic). It entails calculating the w-shingles per document, which we'll use again in the next step (calculating the Jaccard coefficient for candidate pairs). The sketches and the hashed w-shingles don't require nearly as much memory to store as the original text strings. And it results in *far* fewer pairwise comparisons.

If you sort our 100 example documents into 25 buckets, say, then you'd only need to do 150 pairwise comparisons (assuming that the 100 documents got sorted into 4 documents per bucket, which would mean 25 sets of 6 pairwise comparisons).

So, sketching and w-shingling are a very memory and time efficient way to find near duplicates.

More Precise Duplicate Detection
Can we expand this technique even further? If we regard sketches and w-shingling as being increasingly expensive and increasingly accurate ways to identify near duplicates, can we add a third step to provide even more precise results?

To some extent, it depends upon how you are defining duplicates. The w-shingling technique will generally identify very similar news articles that are "about" the same thing, since so many of the key phrases are identical. These might be news articles from different publications written about the same news events (which might be useful for clustering search results, for example).

However, I'm particularly interested in identifying articles that are variants of "the same" text. For example, two different versions of the same news article (perhaps published at different times on different websites, but from the same source, such as The Associated Press). That's why I would like to examine the technique of Longest Common Substring. But in another post...

## Monday, August 27, 2012

### Jaccard Coefficient: Calculating Document Similarity in XQuery and Python

How do you determine that two documents are near duplicates?

Here's one way
First, you calculate the w-shingles for your documents. Then you calculate the Jaccard Coefficient of the shingles: if the result exceeds some threshold, you declare them similar.

Calculating the Jaccard Coefficient in Python
Here's a technique to calculate the Jaccard Coeffecient in Python using sets. This method is adapted from a series of answers I found on stackoverflow.

a = frozenset(("Selling", "a", "beautiful", "house", "in", "California"))
b = frozenset(("Buying", "a", "beautiful", "crip", "in", "California"))
jaccardcoefficent = (float(len(a & b)) / float(len (a|b)))

Note that we have to force the values to floats, in order to avoid integer arithmetic. This is a Python gotcha.

(For clarity and concision, I used strings in the sets. But you'd want to use w-shingles instead).

Calculating the Jaccard Coefficient in XQuery
Now, let's use the same technique to calculate the Jaccard Coeffecient in XQuery. We're also using set operations in XQuery (specifically union and intersection on the XQuery sequences).

let \$a := ("Selling", "a", "beautiful", "house", "in", "California")
let \$b := ("Buying", "a", "beautiful", "crip", "in", "California")
return fn:count(fn:distinct-values((\$a[. = \$b]))) div fn:count(fn:distinct-values((\$a, \$b)))

(Again, you'd want to use w-shingles rather than collections of strings).

Wait
In the spirit of being a lazy programmer, you might want to check whether the XQuery engine you're using already supports this kind of calculation, without you having to re-implement.

For example, Zorba supports several different techniques for calculating string similarity, including a nice simple method for determining the Jaccard Coefficient.

Almost
Now that we can pick out the w-shingles for two documents and calculate the Jaccard Coefficient, you now have enough to do a pretty decent job of identifying near duplicates. And, if you have all the time in the world, you're probably all set. However, it is quite likely that either you have an archive of millions of documents that you want to assess or that you're processing a fast, real time feed of documents. In either case, it is probable that the technique above will be too slow. So, you need a way to more quickly (and less accurately) identify candidate near duplicates. And, there is a technique to do this, known as "sketches".

But this is enough, for this post.