I am a
lazy programmer. So, when I decided that I wanted to try detecting near duplicate news articles in
XQuery using the
shingling technique, naturally I
googled it.
Sadly, I couldn't find exactly what I was looking for, so I was forced to write it myself.
Shingling in XQuery for the Impatient
Here's the code I came up with:
let $string := "the quick brown fox jumps over the lazy dog. now is the time for all good men to come to the aid of the party"
let $shingle-length := 3
let $tokens := fn:tokenize ($string , " ")
for $w at $position in $tokens
let $shingle := subsequence($tokens, $position - $shingle-length, $shingle-length)
let $shingleanchor := $tokens[$position - $shingle-length]
where (($position gt $shingle-length) and (fn:string-length($shingleanchor) lt 4))
return
element shingle
{
fn:string-join($shingle, " ")
}
And here's what you get when you run it:
<shingle>the quick brown</shingle>
<shingle>fox jumps over</shingle>
<shingle>the lazy dog.</shingle>
<shingle>now is the</shingle>
<shingle>is the time</shingle>
<shingle>the time for</shingle>
<shingle>for all good</shingle>
<shingle>all good men</shingle>
<shingle>men to come</shingle>
<shingle>to come to</shingle>
<shingle>to the aid</shingle>
<shingle>the aid of</shingle>
<shingle>aid of the</shingle>
Some Explanation
Let's start with the problem I'd like to solve: detecting near duplicates. It is straightforward to determine that two strings of text are exact duplicates of one another. A much more interesting problem is to detect that two documents are
near duplicates. For example, consider that two web pages might differ in only small ways - such as placement of adverts or Facebook widgets. To a person looking at them, they would quickly surmise that these small differences are insignificant, if the main text is the same. For news in particular, most people would say that two text articles are the same if the only difference is the headline on the two.
There are many techniques that have been put forward for detecting near duplicates, including simhash and longest common subsquence. But I'm interested in using the shingling technique (I might try out the others, too, of course).
What is a "Shingle"?
In natural language processing, a "shingle" is a sequence of tokens from a document. The set of unique shingles from a document is known as a "
w-shingling" (where the
w indicates the length of the shingles). Shingles are typically either composed of non-whitespace characters (numbers and letters) or complete words.
For example, the 3-shingling of the phrase
If you think you can win, you can win.
Using characters for tokens would be
(ify, fyo, you, thi, hin, nky, kyo, ouc, uca, can, anw, nwi, win)
Whereas the 3-shingling of the phrase using words for tokens would be
(if you think, you think you, think you can, you can win)
And, for a final twist, I want to ensure that my shingles start with a
stop word. This idea comes from the SIGIR 2008 paper "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (
PDF). They found that near duplicate news articles (specifically articles from the Associated Press) could be identified just from the shingles that start with stop words. In my XQuery implementation, I approximate stop words by identifying words that are less than four characters long. (You could use a list of actual stop words, if you prefer. But using the 3 characters or less approximation results in more compact example code).
Determining Near Duplicates
But how do you actually use the shingles to determine that two documents are, in fact, near duplicates?
I'll leave that as an exercise for you to work through. But the essential technique revolves around the
Jaccard similarity co-efficient.