In part one, I discussed implementing a recursive implementation of LCS, in order to identify near duplicates. However, the recursive implementation is very expensive in both time and memory, so in this post I discuss a better alternative.
A More Efficient Solution
It isn't hard to find non-recursive implementations of LCS in Python. Here's a nice example via WikiBooks:
But it isn't straightforward to reimplement this in XQuery 1.0. That's because variables in XQuery are not, in fact, variable. XQuery is a functional language; one consequence of this is that variables are immutable, i.e once they have been assigned a value, that value cannot be changed. This makes implementing the more efficient version of LCS in XQuery tricky, since it relies on the technique of memoization. (Basically, "memoization" means avoiding having to repeat function calls by keeping track of the results of previous calculations and using those instead).
Maps to the Rescue
Although "pure" XQuery 1.0 variables cannot be altered, each real world implementation of XQuery tends to have its own, proprietary way of getting around this problem. Since I'm using MarkLogic, I used maps to solve the problem.
The MarkLogic map:map is similar to Object in JavaScript or hashes in PERL. A map:map is an associative array.(For more discussion of what maps are and how they can offer performance advantages, I recommend Map Maker, Map Maker, Make Me a Map! and Maps and Profiling Performance).
Get Me Rewrite!
So, here's the above Python rewritten in XQuery, using the map:map API.
![]() |
String Art by fotobydave http://www.flickr.com/photos/fotobydave/2218213257/ |
It isn't hard to find non-recursive implementations of LCS in Python. Here's a nice example via WikiBooks:
def LongestCommonSubstring(S1, S2): M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))] longest, x_longest = 0, 0 for x in xrange(1,1+len(S1)): for y in xrange(1,1+len(S2)): if S1[x-1] == S2[y-1]: M[x][y] = M[x-1][y-1] + 1 if M[x][y]>longest: longest = M[x][y] x_longest = x else: M[x][y] = 0 return S1[x_longest-longest: x_longest]
But it isn't straightforward to reimplement this in XQuery 1.0. That's because variables in XQuery are not, in fact, variable. XQuery is a functional language; one consequence of this is that variables are immutable, i.e once they have been assigned a value, that value cannot be changed. This makes implementing the more efficient version of LCS in XQuery tricky, since it relies on the technique of memoization. (Basically, "memoization" means avoiding having to repeat function calls by keeping track of the results of previous calculations and using those instead).
![]() |
map book by danstina http://www.flickr.com/photos/danstina/464164291/ |
Although "pure" XQuery 1.0 variables cannot be altered, each real world implementation of XQuery tends to have its own, proprietary way of getting around this problem. Since I'm using MarkLogic, I used maps to solve the problem.
The MarkLogic map:map is similar to Object in JavaScript or hashes in PERL. A map:map is an associative array.(For more discussion of what maps are and how they can offer performance advantages, I recommend Map Maker, Map Maker, Make Me a Map! and Maps and Profiling Performance).
Get Me Rewrite!
So, here's the above Python rewritten in XQuery, using the map:map API.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
xquery version "1.0-ml"; | |
declare function local:lcs_matchingcharacters($xx as xs:integer, $yy as xs:integer, | |
$yentryprev as xs:integer, $xentry as map:map, $xpos as xs:integer, $ypos as xs:integer, | |
$longestmap as map:map) | |
as xs:integer | |
{ | |
let $matchlen := if ($xx = $yy) then | |
(: Since these two characters match, increment the match length by one :) | |
let $ml := $yentryprev + 1 | |
(: Store the new match length so we can retrieve it when we compare the next pair :) | |
let $me := map:put($xentry, xs:string($ypos), $ml) | |
let $incrementlongest := | |
(: Do we have the longest match length so far? :) | |
if ($ml gt map:get($longestmap, "longest_length")) | |
then | |
(: Record the new match length :) | |
let $ll := map:put($longestmap, "longest_length", $ml) | |
(: And track where the new longest matching substring ends :) | |
return map:put($longestmap, "longest_substring_end", $xpos) | |
else () | |
return $ml | |
else | |
(: There characters don't match, so record a zero length match at this position :) | |
let $me := map:put($xentry, xs:string($ypos), 0) | |
return 0 | |
return $matchlen | |
}; | |
declare function local:lcs_codepoints($x as xs:integer*, $y as xs:integer*) | |
as map:map | |
{ | |
let $len_x := count($x) | |
let $len_y := count($y) | |
let $map := map:map() | |
let $longestmap := map:map() | |
let $initlongestmap := map:put($longestmap, "longest_length", 0) | |
let $initlongestmap := map:put($longestmap, "longest_substring_end", 0) | |
(: Add all the x maps :) | |
let $build := | |
for $xx in 0 to $len_x + 1 | |
let $xmap := map:map() | |
return map:put($map, xs:string($xx), $xmap) | |
(: For each xmap, add a ymap and set its value to 0 :) | |
let $build := | |
for $xx in 0 to $len_x + 1 | |
for $yy in 0 to $len_y + 1 | |
let $xmap := map:get($map, xs:string($xx)) | |
(: Initialize all values to 0 :) | |
return map:put($xmap, xs:string($yy), 0) | |
(: Loop through both sets of codepoints and do a pairwise comparison of every combination :) | |
let $evalulate := | |
for $xx at $xpos in $x | |
for $yy at $ypos in $y | |
let $xentry := map:get($map, xs:string($xpos)) | |
let $xentryprev := map:get($map, xs:string($xpos - 1)) | |
(: $yentryprev is the length of the match between x and y so far :) | |
let $yentryprev := map:get($xentryprev, xs:string($ypos - 1)) | |
return local:lcs_matchingcharacters($xx, $yy, $yentryprev, $xentry, $xpos, $ypos, $longestmap) | |
return $longestmap | |
}; | |
declare function local:lcs($x as xs:string, $y as xs:string) | |
as map:map | |
{ | |
(: It is convenient to turn the strings into codepoints for easier character-by-character comparison :) | |
(: So this function is just a shim to take two strings and turn them into collections of codepoints :) | |
let $xc := fn:string-to-codepoints($x) | |
let $yc := fn:string-to-codepoints($y) | |
return local:lcs_codepoints($xc, $yc) | |
}; | |
let $x := "compare the quick brown fox" | |
let $y := "with the quick brown dog" | |
let $m := local:lcs($x, $y) | |
let $xlen := map:get($m, "longest_length") | |
let $xend := map:get($m, "longest_substring_end") | |
return fn:substring($x, $xend - $xlen + 1, $xlen) |
Dissecting the map:map Memoization
As in part one, the function lcs is a convenience function: it simply turns the strings into codepoints and then calls the lcs_codepoints function. This function is the driver: it sets up the map datastructures to capture the results and calls lcs_matchingcharacters to perform the pairwise comparison of the codepoints for the two strings.
It took me a while to grok the use of map:map. It helped me to think of it in terms of Python associative arrays. However, the syntax in Python is arguably more elegant. In particular, when I understood that what I needed was a two dimensional matrix, I realized that I need a map of maps! At this point, as I wrote in my notes, my head literally exploded.
However, I got it to work. And it now handles much larger string comparisons than the recursive method. It isn't fast, however.
Suggestions for improvement?
As in part one, the function lcs is a convenience function: it simply turns the strings into codepoints and then calls the lcs_codepoints function. This function is the driver: it sets up the map datastructures to capture the results and calls lcs_matchingcharacters to perform the pairwise comparison of the codepoints for the two strings.
It took me a while to grok the use of map:map. It helped me to think of it in terms of Python associative arrays. However, the syntax in Python is arguably more elegant. In particular, when I understood that what I needed was a two dimensional matrix, I realized that I need a map of maps! At this point, as I wrote in my notes, my head literally exploded.
![]() |
Explode! by gainesp2003 http://www.flickr.com/photos/33403047@N00/4011759821/ |
Suggestions for improvement?