I've been looking at ways to determine whether two documents are near duplicates. So far, I've looked at Shingling and Sketching, which are relatively inexpensive ways to calculate the probability that two news articles are near duplicates. My goal is to quickly and cheaply figure out which documents are very likely to be duplicates, so that I can then apply a relatively expensive algorithm for deciding whether they are

Essentially, this technique compares two strings and finds the longest run of characters that occurs in both of them. You can then declare the two documents as near duplicates if the ratio of the common substring length to the length of the documents exceeds some threshold.

That is probably a bit confusing. So, let's try to illustrate it with some examples. Take the following two strings:

The longest common substring is " in California." (it is 15 characters long, whereas " a beautiful " comes in second at 13 characters long). The first string is 40 characters long. So, you could assess how similar the strings are by taking the ratio: 15/40 = 0.375.

We didn't say what the ratio threshold needs to be, but I think you'll agree that this seems like a low ratio and most people would agree that those two strings are not near duplicates. The shingling and sketching techniques could well have identified that these are candidate duplicates, since so many of the runs of characters are quite similar in the two. And, that's why it is key to combine the techniques: sketching and shingling are relatively inexpensive ways to identify candidate near duplicates, which helps us efficiently narrow down to pairs of documents that are worth checking. Calculating the longest common substring can be quite expensive (in both time and memory, as I will shortly illustrate) but it gives a much more accurate determination of whether two strings are near duplicates. (At least, I feel intuitively that it is more accurate. I plan to try out these techniques on large sets of documents and then have humans assess the results. It could turn out that I need a different technique or sets of techniques).

First, I implemented a recursive algorithm for Longest Common Subsequence. (The Longest Common Subsequence is an allied problem to Longest Common Substring, but they are not quite the same). In part, that's because I found an Longest Common Subsequence implementation in Python which used recursion, but also because recursion in XQuery is often a natural algorithmic approach. XQuery is a functional language and many algorithms functionally decompose into recursive solutions.

Here's the Python recursive implementation of Longest Common Subsequence via Algorithmist

And here is my re-implementation in XQuery.

You will notice that the lcs_len function is really just a "helper" function that turns the two strings into a collection of codepoints. Code points are the integer values for the corresponding Unicode characters. This seems to be the easiest way in XQuery to deal with the individual characters within a string.

The function lcs_codepoints_len is the one that is really the recursive function that does all the work. Essentially, it steps through the characters in the two strings and does pairwise comparisons. If the two characters match, the length is incremented. If not, then the function calls itself recursively to figure out the max subsequence length of the split between and takes the max of the result.

This works - for really small strings. But i found that it quickly exceeds the available memory (on my wimpy laptop) for even moderately long strings. So, I needed a more memory efficient solution. I worked one out, and that will be what I discuss in Part Two.

**actually**duplicates: Longest Common Substring.strings by missy and the universe http://www.flickr.com/photos/missy-and-the-universe/117632001/ |

**The Longest Common Substring**Essentially, this technique compares two strings and finds the longest run of characters that occurs in both of them. You can then declare the two documents as near duplicates if the ratio of the common substring length to the length of the documents exceeds some threshold.

That is probably a bit confusing. So, let's try to illustrate it with some examples. Take the following two strings:

*Selling a beautiful house in California.**Buying a beautiful crip in California.*The longest common substring is " in California." (it is 15 characters long, whereas " a beautiful " comes in second at 13 characters long). The first string is 40 characters long. So, you could assess how similar the strings are by taking the ratio: 15/40 = 0.375.

We didn't say what the ratio threshold needs to be, but I think you'll agree that this seems like a low ratio and most people would agree that those two strings are not near duplicates. The shingling and sketching techniques could well have identified that these are candidate duplicates, since so many of the runs of characters are quite similar in the two. And, that's why it is key to combine the techniques: sketching and shingling are relatively inexpensive ways to identify candidate near duplicates, which helps us efficiently narrow down to pairs of documents that are worth checking. Calculating the longest common substring can be quite expensive (in both time and memory, as I will shortly illustrate) but it gives a much more accurate determination of whether two strings are near duplicates. (At least, I feel intuitively that it is more accurate. I plan to try out these techniques on large sets of documents and then have humans assess the results. It could turn out that I need a different technique or sets of techniques).

Recursion by alisdair http://www.flickr.com/photos/alisdair/4522805/ |

**Implementing Longest Common Subsequence in XQuery Using Recursion**First, I implemented a recursive algorithm for Longest Common Subsequence. (The Longest Common Subsequence is an allied problem to Longest Common Substring, but they are not quite the same). In part, that's because I found an Longest Common Subsequence implementation in Python which used recursion, but also because recursion in XQuery is often a natural algorithmic approach. XQuery is a functional language and many algorithms functionally decompose into recursive solutions.

Here's the Python recursive implementation of Longest Common Subsequence via Algorithmist

def lcs_len(x, y): """This function returns length of longest common sequence of x and y.""" if len(x) == 0 or len(y) == 0: return 0 xx = x[:-1] # xx = sequence x without its last element yy = y[:-1] if x[-1] == y[-1]: # if last elements of x and y are equal return lcs_len(xx, yy) + 1 else: return max(lcs_len(xx, y), lcs_len(x, yy))

And here is my re-implementation in XQuery.

declare function local:lcs_codepoints_len($x as xs:integer*, $y as xs:integer*) as xs:unsignedLong { let $len_x := count($x) let $len_y := count($y) let $xx := for $i in 1 to $len_x - 1 return $x[$i] let $yy := for $i in 1 to $len_y - 1 return $y[$i] return if ($len_x = 0 or $len_y = 0) then 0 else if ($x[$len_x] = $y[$len_y]) then local:lcs_codepoints_len($xx, $yy) + 1 else fn:max((local:lcs_codepoints_len($xx, $y), local:lcs_codepoints_len($x, $yy))) }; declare function local:lcs_len($x as xs:string, $y as xs:string) as xs:unsignedLong { let $xc := fn:string-to-codepoints($x) let $yc := fn:string-to-codepoints($y) return local:lcs_codepoints_len($xc, $yc) }; let $a := "the quick brown" let $b := "the quick brow" return local:lcs_len($a, $b)

You will notice that the lcs_len function is really just a "helper" function that turns the two strings into a collection of codepoints. Code points are the integer values for the corresponding Unicode characters. This seems to be the easiest way in XQuery to deal with the individual characters within a string.

The function lcs_codepoints_len is the one that is really the recursive function that does all the work. Essentially, it steps through the characters in the two strings and does pairwise comparisons. If the two characters match, the length is incremented. If not, then the function calls itself recursively to figure out the max subsequence length of the split between and takes the max of the result.

This works - for really small strings. But i found that it quickly exceeds the available memory (on my wimpy laptop) for even moderately long strings. So, I needed a more memory efficient solution. I worked one out, and that will be what I discuss in Part Two.

## No comments:

## Post a Comment