Monday, October 15, 2012

Longest Common Substring in XQuery (Part Two)

In part one, I discussed implementing a recursive implementation of LCS, in order to identify near duplicates. However, the recursive implementation is very expensive in both time and memory, so in this post I discuss a better alternative.
String Art by fotobydave
A More Efficient Solution
It isn't hard to find non-recursive implementations of LCS in Python. Here's a nice example via WikiBooks:

 def LongestCommonSubstring(S1, S2):
    M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
    longest, x_longest = 0, 0
    for x in xrange(1,1+len(S1)):
        for y in xrange(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest  = x
                M[x][y] = 0
    return S1[x_longest-longest: x_longest]

But it isn't straightforward to reimplement this in XQuery 1.0. That's because variables in XQuery are not, in fact, variable. XQuery is a functional language; one consequence of this is that variables are immutable, i.e once they have been assigned a value, that value cannot be changed. This makes implementing the more efficient version of LCS in XQuery tricky, since it relies on the technique of memoization. (Basically, "memoization" means avoiding having to repeat function calls by keeping track of the results of previous calculations and using those instead).
map book by danstina
Maps to the Rescue
Although "pure" XQuery 1.0 variables cannot be altered, each real world implementation of XQuery tends to have its own, proprietary way of getting around this problem. Since I'm using MarkLogic, I used maps to solve the problem.

The MarkLogic map:map is similar to Object in JavaScript or hashes in PERL. A map:map is an associative array.(For more discussion of what maps are and how they can offer performance advantages, I recommend Map Maker, Map Maker, Make Me a Map! and Maps and Profiling Performance).

Get Me Rewrite!
So, here's the above Python rewritten in XQuery, using the map:map API.

Dissecting the map:map Memoization
As in part one, the function lcs is a convenience function: it simply turns the strings into codepoints and then calls the lcs_codepoints function. This function is the driver: it sets up the map datastructures to capture the results and calls lcs_matchingcharacters to perform the pairwise comparison of the codepoints for the two strings.

It took me a while to grok the use of map:map. It helped me to think of it in terms of Python associative arrays. However, the syntax in Python is arguably more elegant. In particular, when I understood that what I needed was a two dimensional matrix, I realized that I need a map of maps! At this point, as I wrote in my notes, my head literally exploded.
Explode! by gainesp2003
However, I got it to work. And it now handles much larger string comparisons than the recursive method. It isn't fast, however.

Suggestions for improvement?

Tuesday, October 9, 2012

Longest Common Substring in XQuery (Part One)

I've been looking at ways to determine whether two documents are near duplicates. So far, I've looked at Shingling and Sketching, which are relatively inexpensive ways to calculate the probability that two news articles are near duplicates. My goal is to quickly and cheaply figure out which documents are very likely to be duplicates, so that I can then apply a relatively expensive algorithm for deciding whether they are actually duplicates: Longest Common Substring.
strings by missy and the universe
The Longest Common Substring
Essentially, this technique compares two strings and finds the longest run of characters that occurs in both of them. You can then declare the two documents as near duplicates if the ratio of the common substring length to the length of the documents exceeds some threshold.

That is probably a bit confusing. So, let's try to illustrate it with some examples. Take the following two strings:

Selling a beautiful house in California.
Buying a beautiful crip in California.

The longest common substring is " in California." (it is 15 characters long, whereas " a beautiful " comes in second at 13 characters long). The first string is 40 characters long. So, you could assess how similar the strings are by taking the ratio: 15/40 = 0.375.

We didn't say what the ratio threshold needs to be, but I think you'll agree that this seems like a low ratio and most people would agree that those two strings are not near duplicates. The shingling and sketching techniques could well have identified that these are candidate duplicates, since so many of the runs of characters are quite similar in the two. And, that's why it is key to combine the techniques: sketching and shingling are relatively inexpensive ways to identify candidate near duplicates, which helps us efficiently narrow down to pairs of documents that are worth checking. Calculating the longest common substring can be quite expensive (in both time and memory, as I will shortly illustrate) but it gives a much more accurate determination of whether two strings are near duplicates. (At least, I feel intuitively that it is more accurate. I plan to try out these techniques on large sets of documents and then have humans assess the results. It could turn out that I need a different technique or sets of techniques).
Recursion by alisdair
Implementing Longest Common Subsequence in XQuery Using Recursion
First, I implemented a recursive algorithm for Longest Common Subsequence. (The Longest Common Subsequence is an allied problem to Longest Common Substring, but they are not quite the same). In part, that's because I found an Longest Common Subsequence implementation in Python which used recursion, but also because recursion in XQuery is often a natural algorithmic approach. XQuery is a functional language and many algorithms functionally decompose into recursive solutions.

Here's the Python recursive implementation of Longest Common Subsequence via Algorithmist

def lcs_len(x, y):
    """This function returns length of longest common sequence of x and y."""
    if len(x) == 0 or len(y) == 0:
        return 0
    xx = x[:-1]   # xx = sequence x without its last element    
    yy = y[:-1]
    if x[-1] == y[-1]:  # if last elements of x and y are equal
        return lcs_len(xx, yy) + 1
        return max(lcs_len(xx, y), lcs_len(x, yy))

And here is my re-implementation in XQuery.

declare function local:lcs_codepoints_len($x as xs:integer*, $y as xs:integer*)
as xs:unsignedLong
  let $len_x := count($x)
  let $len_y := count($y)
  let $xx := for $i in 1 to $len_x - 1 return $x[$i]
  let $yy := for $i in 1 to $len_y - 1 return $y[$i]
    if ($len_x = 0 or $len_y = 0) then
    else if ($x[$len_x] = $y[$len_y]) then
      local:lcs_codepoints_len($xx, $yy) + 1
      fn:max((local:lcs_codepoints_len($xx, $y), local:lcs_codepoints_len($x, $yy)))

declare function local:lcs_len($x as xs:string, $y as xs:string)
as xs:unsignedLong
  let $xc := fn:string-to-codepoints($x)
  let $yc := fn:string-to-codepoints($y)
  return local:lcs_codepoints_len($xc, $yc)

let $a := "the quick brown"
let $b := "the quick brow"

return local:lcs_len($a, $b)

You will notice that the lcs_len function is really just a "helper" function that turns the two strings into a collection of codepoints. Code points are the integer values for the corresponding Unicode characters. This seems to be the easiest way in XQuery to deal with the individual characters within a string.

The function lcs_codepoints_len is the one that is really the recursive function that does all the work. Essentially, it  steps through the characters in the two strings and does pairwise comparisons. If the two characters match, the length is incremented. If not, then the function calls itself recursively to figure out the max subsequence length of the split between and takes the max of the result.

This works - for really small strings. But i found that it quickly exceeds the available memory (on my wimpy laptop) for even moderately long strings. So, I needed a more memory efficient solution. I worked one out, and that will be what I discuss in Part Two.

Friday, October 5, 2012

Named Entity Recognition

Let's say you would like to automatically identify people, places, organizations or brands within blocks of text. What you need is a system that will perform Named Entity Recognition.

NER is a challenge that has been extensively studied over the last several years. And there are many software packages available that you can use to identify various different types of entities. Here's a list that I recently compiled, mainly open source, mainly English-centric, but in no particular order.

An open source, Java framework for various different types of text processing, including NER.

Open source Java, from the University of Sheffield. Part of GATE.

Python NLTK
Very widely used open source Natural Language Tool Kit for Python.

Commercially supported Java, straight out of Brooklyn. Also, Lingpipe has a nice list of competitors (some of which duplicates what I’ve listed in this post):

Java open source code from CMU.

Stanford Named Entity Recognizer
Java open source code from Stanford.

Illinois Named Entity Tagger
Java open source code from Illinois.

Probabilistic programming with imperatively-defined factor graphs. I don’t know what this means, either. It is written in Scala, though, which is the latest hipster programming language.

Apache Mahout
Java, open source, classification. (I’m not actually certain it does NER, but it looks interesting).

A couple of academic papers I found, that evaluate NER systems:

Tuesday, August 28, 2012


In this post, I'm not discussing drawing but instead a technique to efficiently identify candidate near duplicate news articles.

In previous posts, I've discussed a technique called "w-shingling", which lets you determine that two news articles are near duplicates. Briefly, given two documents, you

1. calculate a set of w-shingles by identifying all sequences of three words that start with stop words (XQuery, Python)
2. determining the similarity of the w-shingles for the two documents by calculating their Jaccard Coefficient

This technique works well enough, but it requires you to compare every single pairwise combination of documents in your corpus. This can result in quite a few comparisons: even for a relatively small set of 100 documents, there are 4,950 permutations! For a large corpus of documents (such as a news archive with millions of documents) or in situations where you're trying to do real time comparisons of documents (such as when crawling websites) you need a way to efficiently narrow down the number of comparisons you need to perform.

The technique of sketching builds upon the w-shingling technique to quickly identify sets of candidate duplicates, which can then be compared using more accurate methods.

To calculate a sketch for a document you:

1. calculate the w-shingles for the document
2. calculate the the hash of each w-shingle
2. group the hashes of w-shingles into n buckets by calculating the modulus (hash mod n)
3. take the smallest hash value as the sketch for this document

The resulting buckets are candidate near duplicates.

The sketch technique described above is linear, i.e. you calculate one sketch per document, and requires simple functions built into every modern language (hashing and modulus arithmetic). It entails calculating the w-shingles per document, which we'll use again in the next step (calculating the Jaccard coefficient for candidate pairs). The sketches and the hashed w-shingles don't require nearly as much memory to store as the original text strings. And it results in *far* fewer pairwise comparisons.

If you sort our 100 example documents into 25 buckets, say, then you'd only need to do 150 pairwise comparisons (assuming that the 100 documents got sorted into 4 documents per bucket, which would mean 25 sets of 6 pairwise comparisons).

So, sketching and w-shingling are a very memory and time efficient way to find near duplicates.

More Precise Duplicate Detection
Can we expand this technique even further? If we regard sketches and w-shingling as being increasingly expensive and increasingly accurate ways to identify near duplicates, can we add a third step to provide even more precise results?

To some extent, it depends upon how you are defining duplicates. The w-shingling technique will generally identify very similar news articles that are "about" the same thing, since so many of the key phrases are identical. These might be news articles from different publications written about the same news events (which might be useful for clustering search results, for example).

However, I'm particularly interested in identifying articles that are variants of "the same" text. For example, two different versions of the same news article (perhaps published at different times on different websites, but from the same source, such as The Associated Press). That's why I would like to examine the technique of Longest Common Substring. But in another post...

Monday, August 27, 2012

Jaccard Coefficient: Calculating Document Similarity in XQuery and Python

How do you determine that two documents are near duplicates?

Here's one way
First, you calculate the w-shingles for your documents. Then you calculate the Jaccard Coefficient of the shingles: if the result exceeds some threshold, you declare them similar.

Calculating the Jaccard Coefficient in Python
Here's a technique to calculate the Jaccard Coeffecient in Python using sets. This method is adapted from a series of answers I found on stackoverflow.

a = frozenset(("Selling", "a", "beautiful", "house", "in", "California"))
b = frozenset(("Buying", "a", "beautiful", "crip", "in", "California"))
jaccardcoefficent = (float(len(a & b)) / float(len (a|b)))

Note that we have to force the values to floats, in order to avoid integer arithmetic. This is a Python gotcha.

(For clarity and concision, I used strings in the sets. But you'd want to use w-shingles instead).

Calculating the Jaccard Coefficient in XQuery
Now, let's use the same technique to calculate the Jaccard Coeffecient in XQuery. We're also using set operations in XQuery (specifically union and intersection on the XQuery sequences).

let $a := ("Selling", "a", "beautiful", "house", "in", "California")
let $b := ("Buying", "a", "beautiful", "crip", "in", "California")
return fn:count(fn:distinct-values(($a[. = $b]))) div fn:count(fn:distinct-values(($a, $b)))

(Again, you'd want to use w-shingles rather than collections of strings).

In the spirit of being a lazy programmer, you might want to check whether the XQuery engine you're using already supports this kind of calculation, without you having to re-implement.

For example, Zorba supports several different techniques for calculating string similarity, including a nice simple method for determining the Jaccard Coefficient.

Now that we can pick out the w-shingles for two documents and calculate the Jaccard Coefficient, you now have enough to do a pretty decent job of identifying near duplicates. And, if you have all the time in the world, you're probably all set. However, it is quite likely that either you have an archive of millions of documents that you want to assess or that you're processing a fast, real time feed of documents. In either case, it is probable that the technique above will be too slow. So, you need a way to more quickly (and less accurately) identify candidate near duplicates. And, there is a technique to do this, known as "sketches".

But this is enough, for this post.

Tuesday, July 24, 2012

Shingling in Python

As a short companion piece to my post about Shingling in XQuery, and as an exercise to keep my Python skills sharp, I rewrote my XQuery example in Python.

Character Shingling in Python

It is slightly easier to find examples of shingling in Python. For example, I found this one line example of character shingling in this blog post

[word[i:i + n] for i in range(len(word) - n + 1)]

This makes use of Python's list comprehension technique to succinctly render n-shingles for a given word. I used that as inspiration for my word-based w-shingle method.

Word Shingling in Python

As before, I am only interested in shingles that start with what look like stop words (approximated as being words consisting of fewer than 4 characters).

theString = "the quick brown fox jumps over the lazy dog. now is the time for all good men to come to the aid of the party"
shingleLength = 3
tokens = theString.split()

print [tokens[i:i+shingleLength] for i in range(len(tokens) - shingleLength + 1) if len(tokens[i]) < 4]


Monday, July 23, 2012

Shingling in XQuery

I am a lazy programmer. So, when I decided that I wanted to try detecting near duplicate news articles in XQuery using the shingling technique, naturally I googled it. Sadly, I couldn't find exactly what I was looking for, so I was forced to write it myself.
Shingle by M J M

Shingling in XQuery for the Impatient

Here's the code I came up with:

let $string := "the quick brown fox jumps over the lazy dog. now is the time for all good men to come to the aid of the party"
let $shingle-length := 3
let $tokens := fn:tokenize ($string , " ")
for $w at $position in $tokens
let $shingle := subsequence($tokens, $position - $shingle-length, $shingle-length)
let $shingleanchor := $tokens[$position - $shingle-length]
where (($position gt $shingle-length) and (fn:string-length($shingleanchor) lt 4))
element shingle 
    fn:string-join($shingle, " ")

And here's what you get when you run it:

<shingle>the quick brown</shingle>
<shingle>fox jumps over</shingle>
<shingle>the lazy dog.</shingle>
<shingle>now is the</shingle>
<shingle>is the time</shingle>
<shingle>the time for</shingle>
<shingle>for all good</shingle>
<shingle>all good men</shingle>
<shingle>men to come</shingle>
<shingle>to come to</shingle>
<shingle>to the aid</shingle>
<shingle>the aid of</shingle>
<shingle>aid of the</shingle>
Pink Shingle by Sue & Pip

Some Explanation

Let's start with the problem I'd like to solve: detecting near duplicates. It is straightforward to determine that two strings of text are exact duplicates of one another. A much more interesting problem is to detect that two documents are near duplicates. For example, consider that two web pages might differ in only small ways - such as placement of adverts or Facebook widgets. To a person looking at them, they would quickly surmise that these small differences are insignificant, if the main text is the same. For news in particular, most people would say that two text articles are the same if the only difference is the headline on the two.
There are many techniques that have been put forward for detecting near duplicates, including simhash and longest common subsquence. But I'm interested in using the shingling technique (I might try out the others, too, of course).

What is a "Shingle"?

In natural language processing, a "shingle" is a sequence of tokens from a document. The set of unique shingles from a document is known as a "w-shingling" (where the w indicates the length of the shingles). Shingles are typically either composed of non-whitespace characters (numbers and letters) or complete words.

For example, the 3-shingling of the phrase

If you think you can win, you can win.

Using characters for tokens would be

(ify, fyo, you, thi, hin, nky, kyo, ouc, uca, can, anw, nwi, win)

Whereas the 3-shingling of the phrase using words for tokens would be

(if you think, you think you, think you can, you can win)
Shingles by Roger Smith
And, for a final twist, I want to ensure that my shingles start with a stop word. This idea comes from the SIGIR 2008 paper "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (PDF). They found that near duplicate news articles (specifically articles from the Associated Press) could be  identified just from the shingles that start with stop words. In my XQuery implementation, I approximate stop words by identifying words that are less than four characters long. (You could use a list of actual stop words, if you prefer. But using the 3 characters or less approximation results in more compact example code).

Determining Near Duplicates

But how do you actually use the shingles to determine that two documents are, in fact, near duplicates?

I'll leave that as an exercise for you to work through. But the essential technique revolves around the Jaccard similarity co-efficient.

Wednesday, July 18, 2012

XML Namespaces

If you only take away one thing from this blog posting, let it be this:

XML Protip:

If you can’t figure out why your XPath expression isn’t working, check the namespace.

XML Namespaces

It is possible to define XML elements and attributes in different namespaces.
This promotes reuse of existing vocabularies because, rather than copying someone else's elements and attributes into your schema, you can just directly incorporate their XML vocabulary into your instance documents.

Namespaces Look Like URLs

XML Namespaces look like URLs, although there is no requirement that there be anything in particular that you can retrieve at the end of the URL. But URLs are convenient because they are a well-established way to have a federated naming scheme. In other words, if I own an Internet domain, I can make up URLs within that domain without needing to consult any centralized authority. That means I can use URLs to mint new, guaranteed unique namespace identifiers.

Declaring Namespaces

You can use xmlns to declare a default namespace
<newsItem xmlns=''>
    <title>Pope Blesses Astronauts</title>

In the above example, newsItem is in the namespace.
itemMeta and title are also in the namespace, even though they don't have any xmlns listed on the elements. That's because child elements inherit the namespace from their parents.

XML Namespace Prefixes

You can use xmlns:prefix to declare a namespace and bind it to a prefix

<nar:newsItem xmlns:nar=''>
    <nar:title>Pope Blesses Astronauts</nar:title>

newsItem is in the namespace. And so are itemMeta and title: to an XML parser, this document and the previous one are identical.

Whether you use a prefix on each namespaced element or you leave them unadorned is a matter of style and personal preference.

Namespace are Useful, but can be Confusing

Although namespaces are important in constructing modular XML documents and avoiding re-inventing the wheel of vocabularies, they can be quite confusing. In particular, the non-prefix syntax (in which an element inherits its namespace from its parent) can catch you out in various ways.

For example, if you were to just copy and paste the "inner" markup from the first document above, you'd wind up with the document below - and lose the namespace!

    <title>Pope Blesses Astronauts</title>

This is bad.

Similarly, it is routine to look at an instance document and construct an XPath such as this to pick out the value of the title element:


And spend hours trying to figure out why it isn't working. Of course, you know that it is because we didn't specify that the itemMeta  and  title elements need to be in the namespace.

Hence my number one tip for debugging XML-related code:

XML Protip:

If you can’t figure out why your XPath expression isn’t working, check the namespace.


Paul Kelly adds his number one XPath tip: check the spelling first.

Thursday, July 5, 2012

Knight News Challenges

The Knight Foundation runs an annual "news challenge":

The Knight News Challenge accelerates media innovation by funding breakthrough ideas in news and information. Winners receive a share of $5 million in funding and support from Knight’s network of influential peers and advisors to help advance their ideas.
Numbers by Koen Vereeken
I was recently browsing through the many entries to the current stage of the challenge: "DATA"

We’re looking for new ways of collecting, understanding, visualizing and helping the public use the large amounts of information generated each day.

I encourage you to look through them, too. To whet your appetite, here are a handful that caught my eye:

This one might be good for a laugh?
Combine satire, open data, and software development to produce an informative, comedy hackathon centered around the 2012 election.

I've got an interest in rNews. It is nice to see it spread in the wild a little bit.
Enable publishers to automatically augment content with meta data in the rNews format, derived from natural language processing.

HTTP for Data / Python
And I've got a soft spot for Python, still one of my favourite programming languages of all time.
Improve Python, Numpy, Scipy for data science & big data. Lay groundwork for an open data web (“HTTP for Data”).

The fledgling Wikidata initiative is very interesting and has generated a lot of excitement in the semantic web world. Perhaps a little ambitious - a repository for all identifiers for everything - but the ambition is laudable.
Turn Wikidata into a common, free identifier repository for tagging, exploring, linking, and visualizing resources, news stories, and data everywhere.

News Event Service
And my name is on this one...
Create a news event registry and a curated event taxonomy based on both a daily crawl of global news and contributions from news providers.

Thursday, May 31, 2012

SemTechBiz 2012

SemTechBiz 2012

It has been four years since I last attended SemTech. I notice that it now has a longer name: SemTechBiz. I will be interested to see what else has changed. I'll find out next week (the conference starts on June 3rd).
Audience by thinkmedialabs


I will be speaking at the conference, along with Andreas Gebhard and Evan Sandhaus. We'll be giving an update on the state of rNews, the IPTC standard for embedding news metadata in web pages. Catch our panel session The State of rNews - One Year after its Release on Tuesday, June 5th, at 11.30-12.15 in room Imperial A. If you'd like to learn a bit more about rNews in the meantime, check out
And Amy Sweigert, my boss, will also be presenting. She'll be unveiling the details of the AP's new metadata services - a platform for taxonomy and classification in the cloud. Catch her talk on AP Metadata Services on Wednesday, June 6, 2012 12:00 PM - 12:30 PM in room Yosemite A. And, even if you can't attend, you can look here to find out more about the AP's Metadata Services offerings.
learn by heycoach


As well as presenting, I hope to learn a little something at the conference. It looks like there's a lot of really interesting sessions, covering 17 different tracks, so I should have plenty of opportunity.
Hello my name is by mlebemle

Say Hi

So, if you're also going to be at SemTechBiz 2012, please say "Hi". And don't forget the hashtag: #semtechbiz.

Tuesday, May 8, 2012

Liking Links

I like the link.

Links by Clips
by RambergMediaImages
This may seem an uncontroversial position to take. But I'm specifically talking about the HTML link element. It is a nice model for expressing relationships between the current document and other resources. It has been imported into various XML standards (such as ATOM and NewsML-G2). But it isn't the only way to express links. Even in HTML, it is much more common to use

 <a href="">


<link href="" rel="related">.
Chain Links
by MyTudut
The Linky Landscape
In fact, there are a range of types of links. If you think about the img element within HTML, it includes a reference to a remote object to include in a web page (via the img/@src attribue). Even though the img element uses an HTTP reference, it isn't very linky - it is more about composing a document from multiple resources of different types.

 by Leo Reynolds
Anchors Aweigh!
The HTML a (anchor) element is a big step up in linkyness. It is generally used to specify an outbound navigational link from the current HTML page to some other resource (such as another HTML page, but it could also be a video clip, an image or anything else that can be addressed via a URL). But the HTML a element lets you specify how the remote resource relates to the current page: the a/@rel attribute lets you specify the relationship from the current document to the destination resource. Similarly, the a/@rev attribute lets you express the reverse relationship - how the remote resource relates to this document. (Remember, not all relationships are symmetrical). So, the anchor tag has some nice (though, in practice, rarely used) attributes for specifying in a machine-readable, controlled way the relationships between web resources. One limitation of the HTML a element is that it is only for use for "inline" links - such as marking up a bit of text within a paragraph.
by Profound Whatever
The link element has all the same attributes as the anchor element but it is designed to be purely semantic - it simply defines a relationship between the current document and the resource referred to in the link/@href attribute. The link element cannot contain content and so it is purely about referencing another addressable resource. As with the HTML a element, the relationship between the document containing the link and the resource being referenced can be specified using the @rel and @rev attributes.
by rubygold
Web Link Types
In an effort to provide some shared semantics about types of links, IANA put together a small, but growing, list of link types - values that you can use in your @rel or @rev attributes. Whenever you are adding a link to your document, check out the existing list of link relations, to see if you can provide some hints as to the relationship between your document and the linked resource:

Chain Links
by Eric M Martin
Spreading the Link
In fact, the HTML link element has proven so useful, that it has been adopted into other markup standards, notably IETF's ATOM and IPTC's NewsML-G2. The IETF issued RFC 5988 specifically about linking, not simply in the context of HTML documents.

So, if you need to establish relationships between resources, then consider using the link element, with @rel or @rev values. And if you happen to be designing a markup language, my advice is to make sure you include the full link element. I predict you'll find it useful over and over again.

Friday, January 6, 2012