SyntaxHighlighter

SyntaxHighlighter

Monday, October 15, 2012

Longest Common Substring in XQuery (Part Two)

In part one, I discussed implementing a recursive implementation of LCS, in order to identify near duplicates. However, the recursive implementation is very expensive in both time and memory, so in this post I discuss a better alternative.
String Art by fotobydave
http://www.flickr.com/photos/fotobydave/2218213257/
A More Efficient Solution
It isn't hard to find non-recursive implementations of LCS in Python. Here's a nice example via WikiBooks:

 def LongestCommonSubstring(S1, S2):
    M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
    longest, x_longest = 0, 0
    for x in xrange(1,1+len(S1)):
        for y in xrange(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest  = x
            else:
                M[x][y] = 0
    return S1[x_longest-longest: x_longest]


But it isn't straightforward to reimplement this in XQuery 1.0. That's because variables in XQuery are not, in fact, variable. XQuery is a functional language; one consequence of this is that variables are immutable, i.e once they have been assigned a value, that value cannot be changed. This makes implementing the more efficient version of LCS in XQuery tricky, since it relies on the technique of memoization. (Basically, "memoization" means avoiding having to repeat function calls by keeping track of the results of previous calculations and using those instead).
map book by danstina
http://www.flickr.com/photos/danstina/464164291/
Maps to the Rescue
Although "pure" XQuery 1.0 variables cannot be altered, each real world implementation of XQuery tends to have its own, proprietary way of getting around this problem. Since I'm using MarkLogic, I used maps to solve the problem.

The MarkLogic map:map is similar to Object in JavaScript or hashes in PERL. A map:map is an associative array.(For more discussion of what maps are and how they can offer performance advantages, I recommend Map Maker, Map Maker, Make Me a Map! and Maps and Profiling Performance).

Get Me Rewrite!
So, here's the above Python rewritten in XQuery, using the map:map API.


Dissecting the map:map Memoization
As in part one, the function lcs is a convenience function: it simply turns the strings into codepoints and then calls the lcs_codepoints function. This function is the driver: it sets up the map datastructures to capture the results and calls lcs_matchingcharacters to perform the pairwise comparison of the codepoints for the two strings.

It took me a while to grok the use of map:map. It helped me to think of it in terms of Python associative arrays. However, the syntax in Python is arguably more elegant. In particular, when I understood that what I needed was a two dimensional matrix, I realized that I need a map of maps! At this point, as I wrote in my notes, my head literally exploded.
Explode! by gainesp2003
http://www.flickr.com/photos/33403047@N00/4011759821/
However, I got it to work. And it now handles much larger string comparisons than the recursive method. It isn't fast, however.

Suggestions for improvement?

Tuesday, October 9, 2012

Longest Common Substring in XQuery (Part One)

I've been looking at ways to determine whether two documents are near duplicates. So far, I've looked at Shingling and Sketching, which are relatively inexpensive ways to calculate the probability that two news articles are near duplicates. My goal is to quickly and cheaply figure out which documents are very likely to be duplicates, so that I can then apply a relatively expensive algorithm for deciding whether they are actually duplicates: Longest Common Substring.
strings by missy and the universe
http://www.flickr.com/photos/missy-and-the-universe/117632001/
The Longest Common Substring
Essentially, this technique compares two strings and finds the longest run of characters that occurs in both of them. You can then declare the two documents as near duplicates if the ratio of the common substring length to the length of the documents exceeds some threshold.

That is probably a bit confusing. So, let's try to illustrate it with some examples. Take the following two strings:

Selling a beautiful house in California.
Buying a beautiful crip in California.

The longest common substring is " in California." (it is 15 characters long, whereas " a beautiful " comes in second at 13 characters long). The first string is 40 characters long. So, you could assess how similar the strings are by taking the ratio: 15/40 = 0.375.

We didn't say what the ratio threshold needs to be, but I think you'll agree that this seems like a low ratio and most people would agree that those two strings are not near duplicates. The shingling and sketching techniques could well have identified that these are candidate duplicates, since so many of the runs of characters are quite similar in the two. And, that's why it is key to combine the techniques: sketching and shingling are relatively inexpensive ways to identify candidate near duplicates, which helps us efficiently narrow down to pairs of documents that are worth checking. Calculating the longest common substring can be quite expensive (in both time and memory, as I will shortly illustrate) but it gives a much more accurate determination of whether two strings are near duplicates. (At least, I feel intuitively that it is more accurate. I plan to try out these techniques on large sets of documents and then have humans assess the results. It could turn out that I need a different technique or sets of techniques).
Recursion by alisdair
http://www.flickr.com/photos/alisdair/4522805/
Implementing Longest Common Subsequence in XQuery Using Recursion
First, I implemented a recursive algorithm for Longest Common Subsequence. (The Longest Common Subsequence is an allied problem to Longest Common Substring, but they are not quite the same). In part, that's because I found an Longest Common Subsequence implementation in Python which used recursion, but also because recursion in XQuery is often a natural algorithmic approach. XQuery is a functional language and many algorithms functionally decompose into recursive solutions.

Here's the Python recursive implementation of Longest Common Subsequence via Algorithmist

def lcs_len(x, y):
    """This function returns length of longest common sequence of x and y."""
 
    if len(x) == 0 or len(y) == 0:
        return 0
 
    xx = x[:-1]   # xx = sequence x without its last element    
    yy = y[:-1]
 
    if x[-1] == y[-1]:  # if last elements of x and y are equal
        return lcs_len(xx, yy) + 1
    else:
        return max(lcs_len(xx, y), lcs_len(x, yy))


And here is my re-implementation in XQuery.

declare function local:lcs_codepoints_len($x as xs:integer*, $y as xs:integer*)
as xs:unsignedLong
{
  let $len_x := count($x)
  let $len_y := count($y)
  let $xx := for $i in 1 to $len_x - 1 return $x[$i]
  let $yy := for $i in 1 to $len_y - 1 return $y[$i]
  return
    if ($len_x = 0 or $len_y = 0) then
      0
    else if ($x[$len_x] = $y[$len_y]) then
      local:lcs_codepoints_len($xx, $yy) + 1
    else
      fn:max((local:lcs_codepoints_len($xx, $y), local:lcs_codepoints_len($x, $yy)))
};

declare function local:lcs_len($x as xs:string, $y as xs:string)
as xs:unsignedLong
{
  let $xc := fn:string-to-codepoints($x)
  let $yc := fn:string-to-codepoints($y)
  return local:lcs_codepoints_len($xc, $yc)
};

let $a := "the quick brown"
let $b := "the quick brow"

return local:lcs_len($a, $b)

You will notice that the lcs_len function is really just a "helper" function that turns the two strings into a collection of codepoints. Code points are the integer values for the corresponding Unicode characters. This seems to be the easiest way in XQuery to deal with the individual characters within a string.

The function lcs_codepoints_len is the one that is really the recursive function that does all the work. Essentially, it  steps through the characters in the two strings and does pairwise comparisons. If the two characters match, the length is incremented. If not, then the function calls itself recursively to figure out the max subsequence length of the split between and takes the max of the result.

This works - for really small strings. But i found that it quickly exceeds the available memory (on my wimpy laptop) for even moderately long strings. So, I needed a more memory efficient solution. I worked one out, and that will be what I discuss in Part Two.

Friday, October 5, 2012

Named Entity Recognition

Let's say you would like to automatically identify people, places, organizations or brands within blocks of text. What you need is a system that will perform Named Entity Recognition.

NER is a challenge that has been extensively studied over the last several years. And there are many software packages available that you can use to identify various different types of entities. Here's a list that I recently compiled, mainly open source, mainly English-centric, but in no particular order.

GATE
An open source, Java framework for various different types of text processing, including NER.

ANNIE
Open source Java, from the University of Sheffield. Part of GATE.

Python NLTK
Very widely used open source Natural Language Tool Kit for Python.

Lingpipe
Commercially supported Java, straight out of Brooklyn. Also, Lingpipe has a nice list of competitors (some of which duplicates what I’ve listed in this post):

MinorThird
Java open source code from CMU.

Stanford Named Entity Recognizer
Java open source code from Stanford.

Illinois Named Entity Tagger
Java open source code from Illinois.

Factorie
Probabilistic programming with imperatively-defined factor graphs. I don’t know what this means, either. It is written in Scala, though, which is the latest hipster programming language.

Apache Mahout
Java, open source, classification. (I’m not actually certain it does NER, but it looks interesting).

A couple of academic papers I found, that evaluate NER systems: