In part one, I discussed implementing a recursive implementation of LCS, in order to identify near duplicates. However, the recursive implementation is very expensive in both time and memory, so in this post I discuss a better alternative.
A More Efficient Solution
It isn't hard to find non-recursive implementations of LCS in Python. Here's a nice example via WikiBooks:
But it isn't straightforward to reimplement this in XQuery 1.0. That's because variables in XQuery are not, in fact, variable. XQuery is a functional language; one consequence of this is that variables are immutable, i.e once they have been assigned a value, that value cannot be changed. This makes implementing the more efficient version of LCS in XQuery tricky, since it relies on the technique of memoization. (Basically, "memoization" means avoiding having to repeat function calls by keeping track of the results of previous calculations and using those instead).
Maps to the Rescue
Although "pure" XQuery 1.0 variables cannot be altered, each real world implementation of XQuery tends to have its own, proprietary way of getting around this problem. Since I'm using MarkLogic, I used maps to solve the problem.
The MarkLogic map:map is similar to Object in JavaScript or hashes in PERL. A map:map is an associative array.(For more discussion of what maps are and how they can offer performance advantages, I recommend Map Maker, Map Maker, Make Me a Map! and Maps and Profiling Performance).
Get Me Rewrite!
So, here's the above Python rewritten in XQuery, using the map:map API.
String Art by fotobydave http://www.flickr.com/photos/fotobydave/2218213257/ |
It isn't hard to find non-recursive implementations of LCS in Python. Here's a nice example via WikiBooks:
def LongestCommonSubstring(S1, S2): M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))] longest, x_longest = 0, 0 for x in xrange(1,1+len(S1)): for y in xrange(1,1+len(S2)): if S1[x-1] == S2[y-1]: M[x][y] = M[x-1][y-1] + 1 if M[x][y]>longest: longest = M[x][y] x_longest = x else: M[x][y] = 0 return S1[x_longest-longest: x_longest]
But it isn't straightforward to reimplement this in XQuery 1.0. That's because variables in XQuery are not, in fact, variable. XQuery is a functional language; one consequence of this is that variables are immutable, i.e once they have been assigned a value, that value cannot be changed. This makes implementing the more efficient version of LCS in XQuery tricky, since it relies on the technique of memoization. (Basically, "memoization" means avoiding having to repeat function calls by keeping track of the results of previous calculations and using those instead).
map book by danstina http://www.flickr.com/photos/danstina/464164291/ |
Although "pure" XQuery 1.0 variables cannot be altered, each real world implementation of XQuery tends to have its own, proprietary way of getting around this problem. Since I'm using MarkLogic, I used maps to solve the problem.
The MarkLogic map:map is similar to Object in JavaScript or hashes in PERL. A map:map is an associative array.(For more discussion of what maps are and how they can offer performance advantages, I recommend Map Maker, Map Maker, Make Me a Map! and Maps and Profiling Performance).
Get Me Rewrite!
So, here's the above Python rewritten in XQuery, using the map:map API.
Dissecting the map:map Memoization
As in part one, the function lcs is a convenience function: it simply turns the strings into codepoints and then calls the lcs_codepoints function. This function is the driver: it sets up the map datastructures to capture the results and calls lcs_matchingcharacters to perform the pairwise comparison of the codepoints for the two strings.
It took me a while to grok the use of map:map. It helped me to think of it in terms of Python associative arrays. However, the syntax in Python is arguably more elegant. In particular, when I understood that what I needed was a two dimensional matrix, I realized that I need a map of maps! At this point, as I wrote in my notes, my head literally exploded.
However, I got it to work. And it now handles much larger string comparisons than the recursive method. It isn't fast, however.
Suggestions for improvement?
As in part one, the function lcs is a convenience function: it simply turns the strings into codepoints and then calls the lcs_codepoints function. This function is the driver: it sets up the map datastructures to capture the results and calls lcs_matchingcharacters to perform the pairwise comparison of the codepoints for the two strings.
It took me a while to grok the use of map:map. It helped me to think of it in terms of Python associative arrays. However, the syntax in Python is arguably more elegant. In particular, when I understood that what I needed was a two dimensional matrix, I realized that I need a map of maps! At this point, as I wrote in my notes, my head literally exploded.
Explode! by gainesp2003 http://www.flickr.com/photos/33403047@N00/4011759821/ |
Suggestions for improvement?