Tuesday, April 4, 2017

EXTRA Progress - Building an Open Source News Classification Engine

Over the last year, I've been leading a project within the IPTC to build an open source rules-based classification engine for news. Dubbed "EXTRA" (shorthand for EXTraction Rules Apparatus), the software will be freely-available under an MIT license. The work is being funded by a grant of €50,000 from Google's Digital News Initiative Innovation Fund.
“Extra” by Jeremy Brooks
We have drawn up the technical requirements, hired Infalia PC to partner with us on building the software and selected Elasticsearch's percolator as the fundamental technology. We've licensed two news corpora - English from Reuters and German from the Austrian Press Agency. Linguists are creating rules for classifying those corpora with IPTC’s Media Topics using the EXTRA engine.

I'm thrilled to say that the project is on track to deliver a working version of the engine, together with the sample rules, by the summer of 2017. Read more about the project at and feel free to contact me for more information.

Tuesday, February 21, 2017

Serverless LinkLog #1

This is the second in an occasional series of LinkLog posts. I list links to interesting articles, video and audio items on "serverless computing" and allied topics. Just like in my Machine Learning LinkLog #1 I group the items into three broad categories:

Introductory Non Technical - Aimed at the general reader or, perhaps, technical manager who wants to learn about Serverless, but is not aiming to be a practitioner
Introductory Technical - Aimed at someone who is comfortable with programming and technology, but wishes to learn how to work with serverless tools or techniques
In Depth Technical - Aimed at someone who is comfortable with the fundamentals of serverless technology, but wants to learn more about a particular aspect or wants to master "day two" problems.

I also intend to indicate whether the item is (mainly) a slidedeck, a video, a single article or a series of items. A secondary goal is to avoid the trope of using pictures of clouds to illustrate serverless topics - however, I apologize in advance if I fail.

If You Only Read One Thing About Serverless

Having said all of that, if you only want to read one thing about serverless, then I recommend "Serverless Archictectures" by Mike Roberts. It will take you from 0 to 60 - not only understanding what are serverless architectures and how they relate to Backend-As-A-Service and Function-As-A-Service, but also their future potential and current shortcomings. This is for both technical and non technical people and will give you much greater insight than any of the introductory material listed below. However, it is quite long - I wound up reading it in chunks. So, you might read one or more of the "intro" items, before coming back to this one.

Introductory Non Technical

How Can Enterprises Leverage Serverless Computing Platforms?
Two page introduction from Forbes, talks a lot about AWS Lambda, but mentions competitors. Discusses how various large enterprises are blending serverless with their other compute strategies.

A Guide to Serverless Computing with AWS Lambda
Decent introduction for technical managers - has some source code, but it really isn't aimed at help you write lambda code. Very AWS specific, of course, but has a nod to competitor offerings.

AWS Lambda is Everywhere
Adrian Cockcroft gives an overview of AWS Lambda, with particular emphasis on the latest (November 2016) announcements from AWS re:Invent.

Evolution of Business Logic from Monoliths through Microservices, to Functions

Adrian Cockcroft again - portraying AWS Lambda as the logical next step in the business of computing.

Two part article from Charity Majors. Pours cold water on the notion that "serverless == noops" i.e. that just because you are not provisioning servers yourself that you therefore don't need to worry about operational problems anymore. In fact, she makes a compelling case that serverless requires all developers to think much more profoundly about operational, "day two" problems. She also points out how much more opaque serverless applications are, making them much harder to debug.

Introductory Technical

This is a bit confusing but ... there is an open source framework called "Serverless" aimed at making it easier to work with various technologies - including AWS Lambda, Azure Functions and Google CloudFunctions. There are other frameworks (see Zappa and Apex below) but this one seems popular (maybe because it is likely to come up when you google for "serverlesss"? Or, more seriously, maybe because it supports a variety of technologies and deployment scenarios - including local for development)

Open source framework for for turning Python WSGI applications into serverless apps using AWS Lambda and Gateway. Supports advnaced capabilities, including scheduling and keep_warm (for better performance).

Open source framework aimed at making AWS Lambda easier to work with. Adds support for non Lambda-native languages (such as golang) via a node.js shim.

Writing a cron job microservice with Serverless and AWS Lambda
A hands on article takes you through a practical example of how to write a cron-equivalent using Serverless (the framework) and AWS Lambda.

How to build powerful back-ends easily with Serverless
An image processing server in node,js with the Serverless framework, using a bundle of AWS services - Lambda, Rekognition and S3.

Bring static to life using serverless

Serverless Video Playlist
Nine video playlist of hands-on basic coding with AWS serverless with an emphasis on Lambda and node.js.

In Depth Technical

Serverless Workflows on AWS: My Journey From SWF to Step Functions
In depth technical discussion of AWS Simple Workflow and Lambda, with an emphasis on AWS Step Functions - sort of like Finite State Machines.

Fission: Serverless Functions as a Service for Kubernetes
Using Kubernetes to host Functions as a Service. Kubernetes is a way to host Linux containers within a cluster. So, this is a way to achieve a serverless architecture without using any of Lambda, Azure or CloudFunctions. In other words, this is a kind of "private serverless" (akin to "private cloud").

Airbnb's open source framework for analyzing massive log streams and alerting on them by defining rules in Python. Uses AWS Lambda and Kinesis, amongst other technologies.

Wednesday, January 25, 2017

Machine Learning LinkLog #1

This is the first in an occasional series of LinkLog posts. I list links to interesting articles, video and audio items on Machine Learning and allied topics. I group them into three broad categories:

  • Introductory Non Technical - Aimed at the general reader or, perhaps, technical manager who wants to learn about Machine Learning, but is not aiming to be a practioner
  • Introductory Technical - Aimed at someone who is comfortable with programming and technology, but wishes to learn how to work with Machine Learning tools or techniques
  • In Depth Technical - Aimed at someone who is comfortable with the fundamentals of Machine Learning technology, but wants to learn more about a particular aspect or wants to master "day two" problems.

I also intend to indicate whether the item is (mainly) a slidedeck, a video, a single article or a series of items. My goal is to avoid the trope of using pictures of robots to illustrate machine learning topics. However, I apologize in advance if I fail (and you'll certainly find plenty of robot pictures in the items I link to).

Introductory Non Technical

What is Machine Learning?

Short, two-and-a-half minute explainer video from Oxford Sparks ("the amazing stories of science taking place at the University of Oxford"). Accurate and accessible though potentially misleading as it implies that people aren't needed at all. (Note that the doesn't say this directly but you have to listen carefully to pick up that nuance). Avoids the use of unnecessary jargon.

Hype vs. Reality: The AI Explainer

Twenty eight slide explainer deck from Luminary Labs. Covers the different aspects of AI that are currently en vogue. Attempts to predict what is likely to actually succeed and what it probably just hype or misunderstanding.

Top 10 Hot Artificial Intelligence (AI) Technologies

An overview of the findings from a Forrester Research report about Artificial Intelligence in 2017.

Kristen Stewart co-wrote a paper on machine learning

The actress and director co-authored a paper on 'style transfers'. This is a neural-network technique to blend the content of one photo with the style of another. (It is popular with apps such as Prisma). Stewart and her team used the technique in her directoral debut "Come Swim" to create dream-like sequences in the film. The paper describes tricks she used to better control the style transfer effects. This article links in depth technical topics (neural networks and visual processing) with consumer products (feature films).

Introductory Technical

Machine Learning is Fun

A series of articles introducing various technical aspects of Machine Learning. Available in several languages.

A Flask API for serving scikit-learn models

Assumes a lot of scikit knowledge, but a good overview of how to create an API using Flask on the front end and scikit on the backend. The API is not truly RESTful (the endpoints are verbs not nouns, for example). However, it is still a useful introduction and contains the full Python code.

Hitchhiker's Guide to Data Science, Machine Learning, R, Python

A "best of" collection of links, relating to Machine Learning and Data Science, with a particular emphasis on Python and R.

In Depth Technical

Google's 43 Practical Rules of Machine Learning in Industry
Martin Zinkevich compiled 43 rules "intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google." Here are three(!) different presentations of those rules.

Predicting with confidence: the best machine learning idea you never heard of

An article advocating the use of "conformal prediction" a technique for calculating conformance intervals for your machine learning predictions, no matter what the forecast technique or dataset you're working with.

Wednesday, January 11, 2017

Developing the Digital Marketplace for Copyrighted Works

I recently spoke at "Developing the Digital Marketplace for Copyrighted Works", organized by the Commerce Department’s Internet Policy Taskforce. The goal of the public meeting was to "facilitate constructive, cross-industry dialogue among stakeholders about ways to promote a more robust and collaborative digital marketplace for copyrighted works". My impression of the roughly 80 attendees was of a mix of publishers and lawyers - who tended to be pretty conservatively dressed - and music people - who tended to be in more sparkly outfits. Most of the discussion revolved around music, photo and video, but it turned out that a lot of the problems and potential solutions were quite similar across industries and media types. I've linked to the video of the event at the end of this post.

I've been involved in rights work at the Associated Press, including adding rights and pricing metadata to AP's Image API. I lead the IPTC's Rights Working Group. And I'm working within the W3C's POE group to turn ODRL into an official standard.

I spoke on the first panel with the topic of "Unique Identifiers and Metadata". I was teased a bit about "fake news" (this was in Alexandria. VA on December 9th 2016, so close both in time and place to the U.S. Presidential Election). Amongst other things, I spoke about how apparently simple things - "let's agree on identifiers for photos" - turned out to be quite complicated - since a text item, a photo or a video is not really a single, simple atomic thing, but more like a molecule of information. (You can watch the entire panel - which turned out to be quite lively, despite the early hour - in the video linked below).

I also moderated a round table, with the topic "What are the practical steps to adopting standards for identifying and controlling copyrighted works?". As everyone at my table introduced themselves, they mostly said "oh, I'm just here to learn, I don't have much to contribute" but, in fact, we had a very vigorous discussion, which covered *lots* of topics! I summarized them during the "Plenary" session (again, I've linked to the video below). We talked about three areas. First, was why we need standards - creators and rights holders should be compensated for their work, which could be financial compensation or it could be getting distribution and recognition. Second, we talked about the big barriers - technology, the culture of the Internet and human nature itself. Finally, we talked about concrete steps which the government and other organizations could take to get standards developed and adopted. (For the details, you'll need to watch the video. My segment runs from about 39:30 to about 45:55 but I recommend watching everyone's summary of their individual breakout sessions)

If you're interested in rights, then you should consider coming to London for the week of May 15th. That's because the BBC is hosting a Rights Day on May 15th, the IPTC will be holding its Spring Meeting (including discussing RightsML) on May 16th and 17th and W3C will hold its face-to-face meeting on May 18th and 19th. If you're interested in any or all, contact me and I will put you in touch with the right rights people.

Opening Remarks and Panel Session 1: Unique Identifiers and Metadata
Panel Session 2: Registries and Rights Expression Languages

Panel Session 3: Digital Marketplaces

Plenary Session

Tuesday, November 22, 2016

The View From Berlin - IPTC AGM 2016

I Chair the Board of Directors of IPTC, a consortium of news agencies, publishers and system vendors, which develops and maintains technical standards for news, including NewsML-G2, rNews and News-in-JSON. I work with the Board to broaden adoption of IPTC standards, to maximize information sharing between members and to organize successful face-to-face meetings.

We hold face-to-face meetings in several locations throughout the year, although, most of the detailed work of the IPTC is now conducted via teleconferences and email discussions. Our Annual General Meeting for 2016 was held in Berlin in October. As well as being the time for formal votes and elections, the AGM is a chance for the IPTC to look back over the last year and to look ahead about what is in store. What follows are my prepared remarks at the Berlin AGM.

The Only Constant

It is clear that the news industry is experiencing a great degree of change. The business side of news continues to be under pressure. And, in no small part, this is because the technology involved in the creation and distribution of news continues to rapidly evolve.

However, in many ways, this is a golden age of journalism. The demand for news and information has never been higher. The immediate and widespread distribution of news has never been easier.

The IPTC has been around for 51 years. I've been a delegate to the IPTC since 2000 and Chairman of the Board since June 2014. I'd like to give my perspective on the changes going on within the news industry and how IPTC has and will respond.

We're On a Mission

IPTC is rooted in - and foundational to - the news industry. Our open source standards for news technology enable the operations of hundreds of news and media organizations, large and small. IPTC standards are instrumental in the software used to create, edit, archive and distribute news and information around the world.

We are starting to evolve the scope of our work beyond standards - such as via the EXTRA project to build an open source rules-based classification engine. Much of what we do is relevant to not only news agencies and publishers, but also to photographers, videographers, academics and archivists. By bringing together these diverse groups, we can not only create powerful, efficient standards and technologies, but also learn from each other about what works and what does not.


We've introduced quite a bit of change within the IPTC since I've become Chairman and that has continued over the last year.

What's Going On?

We're working to improve our existing family of standards by
  • continuing to improve documentation - to make it easier to get going with a standard and simpler to grasp the nuances when you want to expand your implementation
  • making our standards more coherent and consistent - as many organizations need to use a combination
We're extending the reach of the IPTC, both by working with other organizations (including PRISM, IIIF, WAN-IFRA and W3C). But also by engaging in new types of work such as EXTRA and the Video Metadata Hub, which are not traditional standards but are open source projects for the benefit of the community we serve.

Since I've become Chair, we've renewed our efforts to communicate the great work that we do. You can see a big uptick in our engagement via Twitter and LinkedIn, as well as by refreshing the design of our the IPTC website. Plus we're doing a lot more work "out in the open" on Github.

We're continuing to streamline the operations of the IPTC. We've simplified our processes to better reflect the ways we actually operate these days. For example we have dramatically reduced the number of formal votes we take. But we still have sufficient process in place to ensure that the interests of all members are protected. For 2017, we have decided to have two-plus-one face-to-face meetings, rather than our usual three-plus-one. We will hold two full face-to-face meetings (one in London, the other in Barcelona), plus our one day Photo Metadata conference in association with the CEPIC Conference in Berlin. This will allow us to intensify our work on the meetings, with more ambitious and compelling topics and speakers.

Do Better

As I said, we've been changing our processes, particularly for the face-to-face meetings. But what else could we do to simplify our processes whilst at the same time ensuring that there is a balance between the interests of all members? Are there ways for the IPTC to deliver more value to the membership? How do we continue to balance our policy of consensus-driven decision-making with the need to be more flexible and nimble?

IPTC is a membership-driven organization. Membership fees represent the vast majority of the revenue for our organization. As the news industry as a whole continues to feel pressure - including downsizing, mergers and, unfortunately some members going out of business - the IPTC is experiencing downward pressure on its own revenue. So, we are working on ways to reach new members, whilst at the same time ensuring that existing members continue to derive value. We're also open to exploring new ways of generating revenue which fit with our mission - let us know your ideas!

What new areas should the IPTC focus on? Many journalists are experimenting with an array of technologies - Augmented Reality, Virtual Reality, 360 degree photos, drones and bots, to name but a few. And let's not forget about the "Cambrian Explosion" of technologies related to news and metadata on the Web, including AMP, AppleNews, Instant Articles, rNews, and OpenGraph. How can IPTC help - negotiating standards? Developing best practices? Navigating the ethics of these technologies?


If you're happy with the IPTC, then please tell others.

If you're not happy, then please tell me!

I Want to Thank You

Without you, the members of IPTC, literally none of this is possible. So, I'd like to take a moment to thank everyone involved in the organization, particularly everyone involved in all of the detailed work of the IPTC. And I'd like to acknowledge and thank Andreas Gebhard, who is stepping down from the Board, and Johan Lindgren who has been voted on.

Finally, I'd like to extend a special thanks to Michael Steidl, Managing Director of the IPTC, who is personally involved in almost every aspect of what we do.


No doubt, next year will bring us many new and, often, unexpected challenges. I look forward to tackling with all of you, the IPTC.

Thursday, October 6, 2016

Developers Needed For IPTC's EXTRA Rules-based Classification Engine

Over the last several months, I've been working within the IPTC - along with a number of other news organizations - on "EXTRA" (shorthand for EXTraction Rules Apparatus), an open-source source rules based classification engine for news content. I'm thrilled because this week we reached a significant milestone: we started the formal process of looking for developers to implement the EXTRA engine.
“Extra” by Jeremy Brooks
The IPTC was awarded a grant of 50,000 from Google's Digital News Initiative Innovation Fund to build and freely distribute the initial version of EXTRA. As part of the IPTC, we are working with several news providers to supply sets of news documents, and with linguists to write rules to classify the documents. We've been working on defining the technical requirements and now we’re looking for software developers to design, develop, document and test EXTRA.

Below is the formal announcement. If you know anyone who might be interested, let them know. And if you are interested, please let us know!

Developers Needed For IPTC's EXTRA Rules-based Classification Engine

IPTC is looking for software developers to design, develop, document and test EXTRA, an open source rules-based classification engine for news. First preference will be given to applications received by 21st October 2016, and review will continue until the positions are filled. Applyhere.

"Classification" means assigning one or more categories to the text of a news document. Rules based classifiers use a set of Boolean rules, rather than machine-learning or statistical techniques, to determine which categories to apply.

EXTRA is the EXTraction Rules Apparatus, a multilingual open-source platform for rules-based classification of news content. IPTC was awarded a grant of €50,000 from the first round of Google’s Digital News Initiative Innovation Fund to build and freely distribute the initial version of EXTRA. DNI granted IPTC €50,000 for the entire project.

We are working with news providers to supply sets of news documents and with linguists to write rules to classify the documents. IPTC is looking for qualified developers to create the rules engine to accurately and efficiently categorize the documents using the rules. mandatory and preferred requirements.

Please consult this page for more information and to let us know if you’re interested in being considered.

Monday, September 26, 2016

An ast "Hello World": Getting Started with Python's Abstract Syntax Trees

I've been working on a Python library which - for a number of reasons - needs to dynamically alter itself. Essentially, I want it to parse a document and to generate some code based on that parsed file.


It turns out that Python's ast module lets me do exactly what I need. I came across some quite useful supplementary documentation on ast. But, to get started, I needed something simpler that those advanced examples. I therefore wrote a "Hello world!" program using ast. Here it is, in case you were looking for that, too.

Hello world!

Since I've become test infected, I wanted to structure my "Hello world!" ast program using unit tests.

So, to start, I tracked down a suitable "Hello world!" unit test in Python.

Hello world! in ast

Then I rewrote the class to use ast. My version constructs an abstract syntax tree for an Assignment. Specifically, it assigns the string "Hello world!" to the variable "m". The code then fixes the locations, compiles the code and executes it dynamically.

Obviously, the above code is a lot more work than simply assigning the string value to the variable directly. But it meant I now had the world's simplest ast program.


Armed with this most basic of unit tests, I was then in a position to work out how to support various other types of code in ast.

For example, here's a snippet of ast code which uses ast to generate an abstract syntax tree to assign an empty type to a variable named "nothing". In other words, equivalent to nothing = ()

Invoking Methods

One of the hardest things for me to figure out was how to invoke a method of a class.

First, I worked out to call a function - one not attached to an instance of a class. But to call a method of a class, I needed to understand a bit more about how Python itself is implemented.

Calling Functions

Here's some Python ast code to call a function _foo() and assign the returned value to a variable called "result", i.e. equivalent to result = foo() And here's a variant where you pass in a value, i.e. equivalent to result = bar("some value")

In Python, Methods are Attributes of Classes

Having figured out how to call functions and pass parameters to them, I reckoned that calling a method on a class would be similar.
And it sort of was - I still needed to use ast.Call to invoke the method. But it took me quite a while to figure out how to tell it which class method to call. For example, if I wanted to call

result = self._baz(theResult)

should I pass in a function name of "self._baz"? (I tried that - it didn't work). Eventually, I worked out that self._baz is an attribute of the instance object referred to as "self". In Python, instance objects have two kinds of valid attribute names, data attributes and methods. Which meant that the code to call one method of an instance from another method looks like this:
I had never thought that profoundly about how Python is really implemented behind the scenes. Although many of the Python design decisions are actually quite well documented.

An ast Short Cut

In the process of working out how to invoke instance object methods, I came up with a general-purpose shortcut. It turns out that - since Python 2.6 - ast has a very handy helper method called "ast.parse()". This - in combination with ast.dump() - will let you very quickly figure out the correct ast pattern to use for a given bit of Python code. For example, here's how to figure out how to invoke an instance object method

005-Syntax by vicdunk
Hopefully, that will be enough to get you going on your own Python ast adventures!