Monday, July 23, 2012

Shingling in XQuery

I am a lazy programmer. So, when I decided that I wanted to try detecting near duplicate news articles in XQuery using the shingling technique, naturally I googled it. Sadly, I couldn't find exactly what I was looking for, so I was forced to write it myself.
Shingle by M J M

Shingling in XQuery for the Impatient

Here's the code I came up with:

let $string := "the quick brown fox jumps over the lazy dog. now is the time for all good men to come to the aid of the party"
let $shingle-length := 3
let $tokens := fn:tokenize ($string , " ")
for $w at $position in $tokens
let $shingle := subsequence($tokens, $position - $shingle-length, $shingle-length)
let $shingleanchor := $tokens[$position - $shingle-length]
where (($position gt $shingle-length) and (fn:string-length($shingleanchor) lt 4))
element shingle 
    fn:string-join($shingle, " ")

And here's what you get when you run it:

<shingle>the quick brown</shingle>
<shingle>fox jumps over</shingle>
<shingle>the lazy dog.</shingle>
<shingle>now is the</shingle>
<shingle>is the time</shingle>
<shingle>the time for</shingle>
<shingle>for all good</shingle>
<shingle>all good men</shingle>
<shingle>men to come</shingle>
<shingle>to come to</shingle>
<shingle>to the aid</shingle>
<shingle>the aid of</shingle>
<shingle>aid of the</shingle>
Pink Shingle by Sue & Pip

Some Explanation

Let's start with the problem I'd like to solve: detecting near duplicates. It is straightforward to determine that two strings of text are exact duplicates of one another. A much more interesting problem is to detect that two documents are near duplicates. For example, consider that two web pages might differ in only small ways - such as placement of adverts or Facebook widgets. To a person looking at them, they would quickly surmise that these small differences are insignificant, if the main text is the same. For news in particular, most people would say that two text articles are the same if the only difference is the headline on the two.
There are many techniques that have been put forward for detecting near duplicates, including simhash and longest common subsquence. But I'm interested in using the shingling technique (I might try out the others, too, of course).

What is a "Shingle"?

In natural language processing, a "shingle" is a sequence of tokens from a document. The set of unique shingles from a document is known as a "w-shingling" (where the w indicates the length of the shingles). Shingles are typically either composed of non-whitespace characters (numbers and letters) or complete words.

For example, the 3-shingling of the phrase

If you think you can win, you can win.

Using characters for tokens would be

(ify, fyo, you, thi, hin, nky, kyo, ouc, uca, can, anw, nwi, win)

Whereas the 3-shingling of the phrase using words for tokens would be

(if you think, you think you, think you can, you can win)
Shingles by Roger Smith
And, for a final twist, I want to ensure that my shingles start with a stop word. This idea comes from the SIGIR 2008 paper "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (PDF). They found that near duplicate news articles (specifically articles from the Associated Press) could be  identified just from the shingles that start with stop words. In my XQuery implementation, I approximate stop words by identifying words that are less than four characters long. (You could use a list of actual stop words, if you prefer. But using the 3 characters or less approximation results in more compact example code).

Determining Near Duplicates

But how do you actually use the shingles to determine that two documents are, in fact, near duplicates?

I'll leave that as an exercise for you to work through. But the essential technique revolves around the Jaccard similarity co-efficient.

No comments:

Post a Comment