SyntaxHighlighter

SyntaxHighlighter

Monday, February 25, 2013

Mining for eBooks

In February 2013, the W3C, in partnership with IDPF and BISG, organized a workshop on eBooks, in conjunction with O'Reilly's TOC. I was invited to speak about AP's and IPTC's experience with implementing permissions and restrictions with machine readable rights. (ePub lets you include DRM statements; it seems that some publishers are using ODRL v1; IPTC have selected ODRL v2 for the foundation of RightsML). It was a great experience being on the panel and I got a lot of thoughtful and interesting questions.
eBook Readers Galore by libraryman
http://www.flickr.com/photos/libraryman/5052936803/
eBook Newbie
I'm a bit of a eBook neophyte. However, I learnt a lot from hearing the other publishers talking about their experience, hopes and frustrations with this digital publishing mechanism. And it struck me how similar the news industry is to the book industry. In his opening keynote, Bill McCoy talked about the three main ways that publisher deliver books these days: files, apps and websites. Of course, these are also the three main ways that news is delivered, today, too (not to mention dead trees in both cases for non digital publishing). As various other speakers presented at the workshop, they repeatedly used examples from newspapers and magazines (although, sometimes, as illustrations of what *not* to do). And, it has to be said, both book publishers and news publishers are in the same boat of trying to figure out their digital futures.

For more about both the eBook workshop and TOC, I recommend Ivan Herman's reflections.
Evolution of Readers by jblyberg
http://www.flickr.com/photos/jblyberg/4505413539/

Mining for eBooks
Given how easy it is to create and publish an eBook, it would seem that mining a news archive could yield some interesting books. Some news publishers are already conducting experiments with ebooks in this way. For example, the UK's Guardian have a series of Guardian Shorts. (Martin Belam wrote some quite interesting articles about how he worked with the Guardian archive to create ebooks on the Internet and the Olympics). Similarly, Vanity Fair have also started to play with ebooks.

Of course, ebooks aren't the only way to make use of a rich news archive. The New York Times recently launched their TimesMachine which lets you see browse back issues between 1851 and 1922 (“all the news which was fit to print”).

As software continues to eat the world, it will be interesting to see how formerly different kinds of publishers converge and diverge in their attempts to make their digital ways.

Thursday, January 24, 2013

I have been playing a lot with Amazon Web Services. For numerous reasons, I principally like to use these key bits of software in the work I do:

Python
lxml
s3cmd

As a consequence, I find myself repeatedly doing the following steps to bring the Amazon Linux up to scratch for what I need. I thought I would document them here, in case anyone else finds these steps useful. But also as an easy way for me to find it again...

To Just Generally Bring the Server Up To Date
sudo yum update

Amongst other things, this brings you to Python 2.6, which is sufficiently up-to-date for what I need. (By the time you or future me reads this, I suppose it might be more up-to-date than that).

Install lxml
lxml is the best library I've found for working with XML in Python. It is compatible with, but offers lots of nice enhancements beyond, the standard elementtree, including better support for XPath and built in support for XSLT processing.

Based on this very handy blog post, I do the following to install lxml

sudo yum install gcc
sudo yum install python26-devel
sudo yum install libxslt
sudo yum install libxslt-devel
sudo yum install libxml2-devel
sudo easy_install libxml


That last step to install libxml can take a few minutes. But the whole thing typically takes perhaps ten minutes.



Install s3cmd
Since I do a lot of work with Amazon s3, it is handy to have a command line interface to list, get and put files to s3 buckets. I tried following the instructions about how to install s3cmd on the site, but it just wouldn't work. So, now I do this and it works like a charm:

sudo yum --enablerepo epel install s3cmd


And you can run s3cmd --configure to set up and test out the s3 configuration, if you like.

Machine Readable Rights: A One Day Conference in Amsterdam


I'm helping to organize a free, one day conference on 12th March 2013 in Amsterdam, to discuss "Machine Readable Rights and the News Industry".
Sunset Over Amsterdam (Frontpage) by Werner Kunz
http://www.flickr.com/photos/werkunz/4565900446/
We're aiming to bring together the major players who are interested in the topic, across business, legal, editorial and technical groups. We have quite a few people who have signed up already, but it isn't too late to register if you're interested in attending - or speaking.
Coffee cup by Doug88888
http://www.flickr.com/photos/doug88888/2953428679/
The IPTC has organized similar one day conferences before. We've found that the presentations and panel discussions are always thought provoking. And the less formal introductions and discussions that happen over the coffee breaks, lunch time chats and after meeting drinks are at least as important.

If you're interested in machine readable rights, then don't hesitate to sign up!

Monday, October 15, 2012

Longest Common Substring in XQuery (Part Two)

In part one, I discussed implementing a recursive implementation of LCS, in order to identify near duplicates. However, the recursive implementation is very expensive in both time and memory, so in this post I discuss a better alternative.
String Art by fotobydave
http://www.flickr.com/photos/fotobydave/2218213257/
A More Efficient Solution
It isn't hard to find non-recursive implementations of LCS in Python. Here's a nice example via WikiBooks:

 def LongestCommonSubstring(S1, S2):
    M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
    longest, x_longest = 0, 0
    for x in xrange(1,1+len(S1)):
        for y in xrange(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest  = x
            else:
                M[x][y] = 0
    return S1[x_longest-longest: x_longest]


But it isn't straightforward to reimplement this in XQuery 1.0. That's because variables in XQuery are not, in fact, variable. XQuery is a functional language; one consequence of this is that variables are immutable, i.e once they have been assigned a value, that value cannot be changed. This makes implementing the more efficient version of LCS in XQuery tricky, since it relies on the technique of memoization. (Basically, "memoization" means avoiding having to repeat function calls by keeping track of the results of previous calculations and using those instead).
map book by danstina
http://www.flickr.com/photos/danstina/464164291/
Maps to the Rescue
Although "pure" XQuery 1.0 variables cannot be altered, each real world implementation of XQuery tends to have its own, proprietary way of getting around this problem. Since I'm using MarkLogic, I used maps to solve the problem.

The MarkLogic map:map is similar to Object in JavaScript or hashes in PERL. A map:map is an associative array.(For more discussion of what maps are and how they can offer performance advantages, I recommend Map Maker, Map Maker, Make Me a Map! and Maps and Profiling Performance).

Get Me Rewrite!
So, here's the above Python rewritten in XQuery, using the map:map API.


Dissecting the map:map Memoization
As in part one, the function lcs is a convenience function: it simply turns the strings into codepoints and then calls the lcs_codepoints function. This function is the driver: it sets up the map datastructures to capture the results and calls lcs_matchingcharacters to perform the pairwise comparison of the codepoints for the two strings.

It took me a while to grok the use of map:map. It helped me to think of it in terms of Python associative arrays. However, the syntax in Python is arguably more elegant. In particular, when I understood that what I needed was a two dimensional matrix, I realized that I need a map of maps! At this point, as I wrote in my notes, my head literally exploded.
Explode! by gainesp2003
http://www.flickr.com/photos/33403047@N00/4011759821/
However, I got it to work. And it now handles much larger string comparisons than the recursive method. It isn't fast, however.

Suggestions for improvement?

Tuesday, October 9, 2012

Longest Common Substring in XQuery (Part One)

I've been looking at ways to determine whether two documents are near duplicates. So far, I've looked at Shingling and Sketching, which are relatively inexpensive ways to calculate the probability that two news articles are near duplicates. My goal is to quickly and cheaply figure out which documents are very likely to be duplicates, so that I can then apply a relatively expensive algorithm for deciding whether they are actually duplicates: Longest Common Substring.
strings by missy and the universe
http://www.flickr.com/photos/missy-and-the-universe/117632001/
The Longest Common Substring
Essentially, this technique compares two strings and finds the longest run of characters that occurs in both of them. You can then declare the two documents as near duplicates if the ratio of the common substring length to the length of the documents exceeds some threshold.

That is probably a bit confusing. So, let's try to illustrate it with some examples. Take the following two strings:

Selling a beautiful house in California.
Buying a beautiful crip in California.

The longest common substring is " in California." (it is 15 characters long, whereas " a beautiful " comes in second at 13 characters long). The first string is 40 characters long. So, you could assess how similar the strings are by taking the ratio: 15/40 = 0.375.

We didn't say what the ratio threshold needs to be, but I think you'll agree that this seems like a low ratio and most people would agree that those two strings are not near duplicates. The shingling and sketching techniques could well have identified that these are candidate duplicates, since so many of the runs of characters are quite similar in the two. And, that's why it is key to combine the techniques: sketching and shingling are relatively inexpensive ways to identify candidate near duplicates, which helps us efficiently narrow down to pairs of documents that are worth checking. Calculating the longest common substring can be quite expensive (in both time and memory, as I will shortly illustrate) but it gives a much more accurate determination of whether two strings are near duplicates. (At least, I feel intuitively that it is more accurate. I plan to try out these techniques on large sets of documents and then have humans assess the results. It could turn out that I need a different technique or sets of techniques).
Recursion by alisdair
http://www.flickr.com/photos/alisdair/4522805/
Implementing Longest Common Subsequence in XQuery Using Recursion
First, I implemented a recursive algorithm for Longest Common Subsequence. (The Longest Common Subsequence is an allied problem to Longest Common Substring, but they are not quite the same). In part, that's because I found an Longest Common Subsequence implementation in Python which used recursion, but also because recursion in XQuery is often a natural algorithmic approach. XQuery is a functional language and many algorithms functionally decompose into recursive solutions.

Here's the Python recursive implementation of Longest Common Subsequence via Algorithmist

def lcs_len(x, y):
    """This function returns length of longest common sequence of x and y."""
 
    if len(x) == 0 or len(y) == 0:
        return 0
 
    xx = x[:-1]   # xx = sequence x without its last element    
    yy = y[:-1]
 
    if x[-1] == y[-1]:  # if last elements of x and y are equal
        return lcs_len(xx, yy) + 1
    else:
        return max(lcs_len(xx, y), lcs_len(x, yy))


And here is my re-implementation in XQuery.

declare function local:lcs_codepoints_len($x as xs:integer*, $y as xs:integer*)
as xs:unsignedLong
{
  let $len_x := count($x)
  let $len_y := count($y)
  let $xx := for $i in 1 to $len_x - 1 return $x[$i]
  let $yy := for $i in 1 to $len_y - 1 return $y[$i]
  return
    if ($len_x = 0 or $len_y = 0) then
      0
    else if ($x[$len_x] = $y[$len_y]) then
      local:lcs_codepoints_len($xx, $yy) + 1
    else
      fn:max((local:lcs_codepoints_len($xx, $y), local:lcs_codepoints_len($x, $yy)))
};

declare function local:lcs_len($x as xs:string, $y as xs:string)
as xs:unsignedLong
{
  let $xc := fn:string-to-codepoints($x)
  let $yc := fn:string-to-codepoints($y)
  return local:lcs_codepoints_len($xc, $yc)
};

let $a := "the quick brown"
let $b := "the quick brow"

return local:lcs_len($a, $b)

You will notice that the lcs_len function is really just a "helper" function that turns the two strings into a collection of codepoints. Code points are the integer values for the corresponding Unicode characters. This seems to be the easiest way in XQuery to deal with the individual characters within a string.

The function lcs_codepoints_len is the one that is really the recursive function that does all the work. Essentially, it  steps through the characters in the two strings and does pairwise comparisons. If the two characters match, the length is incremented. If not, then the function calls itself recursively to figure out the max subsequence length of the split between and takes the max of the result.

This works - for really small strings. But i found that it quickly exceeds the available memory (on my wimpy laptop) for even moderately long strings. So, I needed a more memory efficient solution. I worked one out, and that will be what I discuss in Part Two.

Friday, October 5, 2012

Named Entity Recognition

Let's say you would like to automatically identify people, places, organizations or brands within blocks of text. What you need is a system that will perform Named Entity Recognition.

NER is a challenge that has been extensively studied over the last several years. And there are many software packages available that you can use to identify various different types of entities. Here's a list that I recently compiled, mainly open source, mainly English-centric, but in no particular order.

GATE
An open source, Java framework for various different types of text processing, including NER.

ANNIE
Open source Java, from the University of Sheffield. Part of GATE.

Python NLTK
Very widely used open source Natural Language Tool Kit for Python.

Lingpipe
Commercially supported Java, straight out of Brooklyn. Also, Lingpipe has a nice list of competitors (some of which duplicates what I’ve listed in this post):

MinorThird
Java open source code from CMU.

Stanford Named Entity Recognizer
Java open source code from Stanford.

Illinois Named Entity Tagger
Java open source code from Illinois.

Factorie
Probabilistic programming with imperatively-defined factor graphs. I don’t know what this means, either. It is written in Scala, though, which is the latest hipster programming language.

Apache Mahout
Java, open source, classification. (I’m not actually certain it does NER, but it looks interesting).

A couple of academic papers I found, that evaluate NER systems:


Tuesday, August 28, 2012

Sketching

In this post, I'm not discussing drawing but instead a technique to efficiently identify candidate near duplicate news articles.

Shingling
In previous posts, I've discussed a technique called "w-shingling", which lets you determine that two news articles are near duplicates. Briefly, given two documents, you

1. calculate a set of w-shingles by identifying all sequences of three words that start with stop words (XQuery, Python)
2. determining the similarity of the w-shingles for the two documents by calculating their Jaccard Coefficient

This technique works well enough, but it requires you to compare every single pairwise combination of documents in your corpus. This can result in quite a few comparisons: even for a relatively small set of 100 documents, there are 4,950 permutations! For a large corpus of documents (such as a news archive with millions of documents) or in situations where you're trying to do real time comparisons of documents (such as when crawling websites) you need a way to efficiently narrow down the number of comparisons you need to perform.

Sketching
The technique of sketching builds upon the w-shingling technique to quickly identify sets of candidate duplicates, which can then be compared using more accurate methods.

To calculate a sketch for a document you:

1. calculate the w-shingles for the document
2. calculate the the hash of each w-shingle
2. group the hashes of w-shingles into n buckets by calculating the modulus (hash mod n)
3. take the smallest hash value as the sketch for this document

The resulting buckets are candidate near duplicates.

Efficient
The sketch technique described above is linear, i.e. you calculate one sketch per document, and requires simple functions built into every modern language (hashing and modulus arithmetic). It entails calculating the w-shingles per document, which we'll use again in the next step (calculating the Jaccard coefficient for candidate pairs). The sketches and the hashed w-shingles don't require nearly as much memory to store as the original text strings. And it results in *far* fewer pairwise comparisons.

If you sort our 100 example documents into 25 buckets, say, then you'd only need to do 150 pairwise comparisons (assuming that the 100 documents got sorted into 4 documents per bucket, which would mean 25 sets of 6 pairwise comparisons).

So, sketching and w-shingling are a very memory and time efficient way to find near duplicates.

More Precise Duplicate Detection
Can we expand this technique even further? If we regard sketches and w-shingling as being increasingly expensive and increasingly accurate ways to identify near duplicates, can we add a third step to provide even more precise results?

To some extent, it depends upon how you are defining duplicates. The w-shingling technique will generally identify very similar news articles that are "about" the same thing, since so many of the key phrases are identical. These might be news articles from different publications written about the same news events (which might be useful for clustering search results, for example).

However, I'm particularly interested in identifying articles that are variants of "the same" text. For example, two different versions of the same news article (perhaps published at different times on different websites, but from the same source, such as The Associated Press). That's why I would like to examine the technique of Longest Common Substring. But in another post...