Tuesday, July 24, 2012

Shingling in Python

As a short companion piece to my post about Shingling in XQuery, and as an exercise to keep my Python skills sharp, I rewrote my XQuery example in Python.

Character Shingling in Python

It is slightly easier to find examples of shingling in Python. For example, I found this one line example of character shingling in this blog post

[word[i:i + n] for i in range(len(word) - n + 1)]

This makes use of Python's list comprehension technique to succinctly render n-shingles for a given word. I used that as inspiration for my word-based w-shingle method.

Word Shingling in Python

As before, I am only interested in shingles that start with what look like stop words (approximated as being words consisting of fewer than 4 characters).

theString = "the quick brown fox jumps over the lazy dog. now is the time for all good men to come to the aid of the party"
shingleLength = 3
tokens = theString.split()

print [tokens[i:i+shingleLength] for i in range(len(tokens) - shingleLength + 1) if len(tokens[i]) < 4]


Monday, July 23, 2012

Shingling in XQuery

I am a lazy programmer. So, when I decided that I wanted to try detecting near duplicate news articles in XQuery using the shingling technique, naturally I googled it. Sadly, I couldn't find exactly what I was looking for, so I was forced to write it myself.
Shingle by M J M

Shingling in XQuery for the Impatient

Here's the code I came up with:

let $string := "the quick brown fox jumps over the lazy dog. now is the time for all good men to come to the aid of the party"
let $shingle-length := 3
let $tokens := fn:tokenize ($string , " ")
for $w at $position in $tokens
let $shingle := subsequence($tokens, $position - $shingle-length, $shingle-length)
let $shingleanchor := $tokens[$position - $shingle-length]
where (($position gt $shingle-length) and (fn:string-length($shingleanchor) lt 4))
element shingle 
    fn:string-join($shingle, " ")

And here's what you get when you run it:

<shingle>the quick brown</shingle>
<shingle>fox jumps over</shingle>
<shingle>the lazy dog.</shingle>
<shingle>now is the</shingle>
<shingle>is the time</shingle>
<shingle>the time for</shingle>
<shingle>for all good</shingle>
<shingle>all good men</shingle>
<shingle>men to come</shingle>
<shingle>to come to</shingle>
<shingle>to the aid</shingle>
<shingle>the aid of</shingle>
<shingle>aid of the</shingle>
Pink Shingle by Sue & Pip

Some Explanation

Let's start with the problem I'd like to solve: detecting near duplicates. It is straightforward to determine that two strings of text are exact duplicates of one another. A much more interesting problem is to detect that two documents are near duplicates. For example, consider that two web pages might differ in only small ways - such as placement of adverts or Facebook widgets. To a person looking at them, they would quickly surmise that these small differences are insignificant, if the main text is the same. For news in particular, most people would say that two text articles are the same if the only difference is the headline on the two.
There are many techniques that have been put forward for detecting near duplicates, including simhash and longest common subsquence. But I'm interested in using the shingling technique (I might try out the others, too, of course).

What is a "Shingle"?

In natural language processing, a "shingle" is a sequence of tokens from a document. The set of unique shingles from a document is known as a "w-shingling" (where the w indicates the length of the shingles). Shingles are typically either composed of non-whitespace characters (numbers and letters) or complete words.

For example, the 3-shingling of the phrase

If you think you can win, you can win.

Using characters for tokens would be

(ify, fyo, you, thi, hin, nky, kyo, ouc, uca, can, anw, nwi, win)

Whereas the 3-shingling of the phrase using words for tokens would be

(if you think, you think you, think you can, you can win)
Shingles by Roger Smith
And, for a final twist, I want to ensure that my shingles start with a stop word. This idea comes from the SIGIR 2008 paper "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (PDF). They found that near duplicate news articles (specifically articles from the Associated Press) could be  identified just from the shingles that start with stop words. In my XQuery implementation, I approximate stop words by identifying words that are less than four characters long. (You could use a list of actual stop words, if you prefer. But using the 3 characters or less approximation results in more compact example code).

Determining Near Duplicates

But how do you actually use the shingles to determine that two documents are, in fact, near duplicates?

I'll leave that as an exercise for you to work through. But the essential technique revolves around the Jaccard similarity co-efficient.

Wednesday, July 18, 2012

XML Namespaces

If you only take away one thing from this blog posting, let it be this:

XML Protip:

If you can’t figure out why your XPath expression isn’t working, check the namespace.

XML Namespaces

It is possible to define XML elements and attributes in different namespaces.
This promotes reuse of existing vocabularies because, rather than copying someone else's elements and attributes into your schema, you can just directly incorporate their XML vocabulary into your instance documents.

Namespaces Look Like URLs

XML Namespaces look like URLs, although there is no requirement that there be anything in particular that you can retrieve at the end of the URL. But URLs are convenient because they are a well-established way to have a federated naming scheme. In other words, if I own an Internet domain, I can make up URLs within that domain without needing to consult any centralized authority. That means I can use URLs to mint new, guaranteed unique namespace identifiers.

Declaring Namespaces

You can use xmlns to declare a default namespace
<newsItem xmlns=''>
    <title>Pope Blesses Astronauts</title>

In the above example, newsItem is in the namespace.
itemMeta and title are also in the namespace, even though they don't have any xmlns listed on the elements. That's because child elements inherit the namespace from their parents.

XML Namespace Prefixes

You can use xmlns:prefix to declare a namespace and bind it to a prefix

<nar:newsItem xmlns:nar=''>
    <nar:title>Pope Blesses Astronauts</nar:title>

newsItem is in the namespace. And so are itemMeta and title: to an XML parser, this document and the previous one are identical.

Whether you use a prefix on each namespaced element or you leave them unadorned is a matter of style and personal preference.

Namespace are Useful, but can be Confusing

Although namespaces are important in constructing modular XML documents and avoiding re-inventing the wheel of vocabularies, they can be quite confusing. In particular, the non-prefix syntax (in which an element inherits its namespace from its parent) can catch you out in various ways.

For example, if you were to just copy and paste the "inner" markup from the first document above, you'd wind up with the document below - and lose the namespace!

    <title>Pope Blesses Astronauts</title>

This is bad.

Similarly, it is routine to look at an instance document and construct an XPath such as this to pick out the value of the title element:


And spend hours trying to figure out why it isn't working. Of course, you know that it is because we didn't specify that the itemMeta  and  title elements need to be in the namespace.

Hence my number one tip for debugging XML-related code:

XML Protip:

If you can’t figure out why your XPath expression isn’t working, check the namespace.


Paul Kelly adds his number one XPath tip: check the spelling first.

Thursday, July 5, 2012

Knight News Challenges

The Knight Foundation runs an annual "news challenge":

The Knight News Challenge accelerates media innovation by funding breakthrough ideas in news and information. Winners receive a share of $5 million in funding and support from Knight’s network of influential peers and advisors to help advance their ideas.
Numbers by Koen Vereeken
I was recently browsing through the many entries to the current stage of the challenge: "DATA"

We’re looking for new ways of collecting, understanding, visualizing and helping the public use the large amounts of information generated each day.

I encourage you to look through them, too. To whet your appetite, here are a handful that caught my eye:

This one might be good for a laugh?
Combine satire, open data, and software development to produce an informative, comedy hackathon centered around the 2012 election.

I've got an interest in rNews. It is nice to see it spread in the wild a little bit.
Enable publishers to automatically augment content with meta data in the rNews format, derived from natural language processing.

HTTP for Data / Python
And I've got a soft spot for Python, still one of my favourite programming languages of all time.
Improve Python, Numpy, Scipy for data science & big data. Lay groundwork for an open data web (“HTTP for Data”).

The fledgling Wikidata initiative is very interesting and has generated a lot of excitement in the semantic web world. Perhaps a little ambitious - a repository for all identifiers for everything - but the ambition is laudable.
Turn Wikidata into a common, free identifier repository for tagging, exploring, linking, and visualizing resources, news stories, and data everywhere.

News Event Service
And my name is on this one...
Create a news event registry and a curated event taxonomy based on both a daily crawl of global news and contributions from news providers.