Monday, November 20, 2017

The View from Barcelona - IPTC AGM 2017

I Chair the Board of Directors of IPTC, a consortium of news agencies, publishers and system vendors, which develops and maintains technical standards for news, including NewsML-G2, rNews and News-in-JSON. I work with the Board to broaden adoption of IPTC standards, to maximize information sharing between members and to organize successful face-to-face meetings.

We hold face-to-face meetings in several locations throughout the year, although, most of the detailed work of the IPTC is now conducted via teleconferences and email discussions. Our Annual General Meeting for 2017 was held in Barcelona in November. As well as being the time for formal votes and elections, the AGM is a chance for the IPTC to look back over the last year and to look ahead about what is in store. What follows are a slightly edited version of my remarks at the Barcelona AGM.
IPTC has had a good year - the 52nd year for the organization!
We've updated our veteran standards, Photo metadata - our most widely-used standard - and NewsML-G2 - our most comprehensive XML standard, marking its 10th year of development.
We're continuing to work in partnership with other organizations, to maximize the reach and benefits of our work for the news and media industry. In coordination with CEPIC we organized the 10th annual Photo Metadata Conference, looking to the future of auto tagging and search, examining advanced AI techniques - and considering both their benefits and their drawbacks for publishers. With the W3C we have crafted the ODRL rights standard and are launching plans to create RightsML as the official profile of the ODRL standard, endorsed by both the IPTC and W3C.
We've also tackled problems that matter to the media industry with technology solutions which are founded on standards, but go beyond them. The Video Metadata Hub is a comprehensive solution for video metadata management that allows exchange of metadata over multiple existing standards. The EXTRA engine is a Google DNI sponsored project to create an open source rules based classification engine for news.
We've had some changes in the make-up of IPTC. Johan Lindgren of TT joined the Board. Bill Kasdorf has taken over as the PR Chair. And we were thrilled to add Adobe as a voting member of IPTC, after many years of working together on photo metadata standards. Of course, with more mixed emotions, we have also learnt that Michael Steidl, the IPTC Managing Director, for 15 years will retire next Summer. As has been clear throughout this meeting and, indeed, every day between the meetings on numerous emails and phone calls, Michael is the backbone of the work of the IPTC. Once again, I ask you to join me in acknowledging the amazing contributions and dedications that Michael displays towards the IPTC.
Later today, we will discuss in detail our plans to recruit a successor for the crucial role of the Managing Director. And this is not the only challenge that the IPTC faces. We describe ourselves as "the global standards body of the news media" and that "we provide the technical foundation for the news ecosystem". As such, just as the wider news industry is facing a challenging business and technical environment, so is the IPTC.
During this meeting, we've talked about some of the technical challenges - including the continuing evolution of file formats and supporting technologies, whilst many of us are still working to adopt the technologies from 5 or 10 year ago. We've also talked about the erosion of trust in media organizations and whether a combination of editorial and technical solutions can help.
But I thought I would focus on a particular shift in the business and technical environment for news that may well have a bigger impact than all of those. That shift can be traced back to 2014 which, by coincidence, is when I became Chairman of the IPTC. Last week, Andre Staltz published an interesting and detailed article called "The Web Began Dying in 2014, Here's How". If you haven't read it, I recommend it. The article makes a number of interesting points and backs them up with numerous charts and statistics. I will not attempt to summarize the whole thing, but a few key points are worth highlighting.
Staltz points out that, prior to 2014, Google and Facebook accounted for less than 50% of all of the traffic to news publisher websites. Now those two companies alone account for over 75% of referral traffic. Also, through various acquisitions, Google and Facebook properties now share the top ten websites with news publishers - in the USA 6 of the 10 most popular websites are media properties. In Brazil it is also 6 out of 10. In the UK it is 5 out of 10. The rest all belong to Facebook and Google.
Both Facebook and Google reorganized themselves in 2014, to better focus on their core strengths. In 2014, Facebook bought Whastapp and terminated its search relationship with Bing, effectively relinquishing search to Google and doubling down on social. Also in 2014, Google bought DeepMind and shutdown Orkut, its most successful social product. This, along with the reorganization into Alphabet, meant that Google relinquished social to Facebook and allowing it to focus on search and - even more - artificial intelligence. Thus, each company seems happy to dominate their own massive parts of the web.
But ... does that matter to media companies? Well, Facebook said if you want optimal performance on our website, you must adopt Instant Articles. Meanwhile, Google requires publishers to use its Accelerated Mobile Pages or "AMP" format for better performance on mobile devices. And, worldwide, Internet traffic is shifting from the desktop to mobile devices.
Then, if you add in Amazon, Apple and Microsoft, it is clear that another huge shift is going on. All of the Frightful Five are turning away from the Web as a source of growth and instead turning to building brand loyalty via high end devices. Following the successful strategy of Apple, they are all becoming hardware manufacturers with walled gardens. Already we have Siri, Cortana, Alexa and Google Home. But also think about the investments going on by these companies in AR and VR as ways to dominate social interactions, e-commerce and machine learning over the Internet.
So, just as news companies must confront these shifts in the global business and technology environment, so must the IPTC. During this meeting, we've talked about our initial efforts to grapple with metadata for AR, VR and 360 degree imagery. We've also discussed techniques which are relevant to news taxonomy and classification, including machine learning and artificial intelligence. At the same time, Facebook, Google and others are not totally in control, as they - along with Twitter - found themselves having to explain the spread of disinformation on their platforms and under increased government scrutiny, particular in the EU. So, all of us, whether we describe ourselves as news publishers or not, are dealing with a rapidly changing and turbulent information, technical and business environment.
What does this mean for IPTC? IPTC is a news technology standards organization. But it is also unique in that we are composed of news companies from around the world. We know from the membership survey that both of these factors - influence over technical solutions and access to technology peers from competitors, partners, diverse organizations large and small - are very important to current members. In order to prosper as an organization, IPTC needs to preserve these unique benefits to members, but also scale them up. This means that we need to find ways to open up the organization in ways that preserve the value of the IPTC and fit with the mission, but also in ways that are more flexible. We need to continue to move beyond saying that the only thing we work on is standards and instead use standards as a component of the technical solutions we develop, as we are doing with EXTRA and the Video Metadata Hub. We need to work with diverse groups focused on solving specific business and journalistic problems - such as trust in the media - and in helping news companies learn the best ways to work with emerging technologies, whether it is voice assistants, artificial intelligence or virtual reality.
I'm confident that - working together - we can continue to reshape the IPTC to better meet the needs of the membership and to move us further forward in support of solving the business and editorial needs of the news and media industry. I look forward to working with all of you on addressing the challenges in 2018 and beyond.
Thank you.

Wednesday, August 30, 2017

Serverless Tip: Use the "artifact" Directive to Deploy Your Pre-Built Lambda Zip File

tl;dr: you can deploy pre-built zip files (e.g. for your Python Lambda) using the "artifact" directive in the serverless framework.

AWS Lambda is Great!

I've been doing a lot of work recently with AWS Lambda. And I'm a fan. The combination of API Gateway + Lambda + Python, together with other AWS services including DynamoDB and S3, not to mention the awesome array of Python open source libraries, means I'm churning out all sorts of microservices with glee.

The serverless paradigm is quite different than the traditional (serverfull?) paradigm. As well as adjusting the architectural style to take advantage of what Lambda offers, deploying the code and all of its dependencies is quite different. After looking at some alternatives, we concluded that the Serverless Framework best fit our requirements.

The Serverless Framework is Great!

Rather than crafting complex CloudFormation configurations to manage my microservices in AWS, I use the Serverless Framework. (The framework also works with Apache OpenWhisk, Microsoft Azure and Google Cloud). Essentially Serverless is a simpler CloudFormation, specific to Lambda-centric deployments. (To be clear, it doesn't just help with deployments of AWS Lambda - Serverless covers a wide and growing range of AWS services).

Mainly by studying (*cough* copy-n-pasting *cough*) the extensive range of examples, and sometimes resorting to actually reading the manual, I've been able to get even quite complex setups to work, with fairly simple YAML configuration files. So, I recommend Serverless. (Although AWS themselves are developing an eerily similar alternative, in SAM, which you may also want to check out).

AWS Lambda has Limits

Sometimes you need to do stuff outside the Serverless Framework, but you still want to use all the other cool stuff it does for you.

For example, AWS Lambda has certain limits. This includes a 50Mbyte deployment limit per Lambda. Now, Serverless does let you control what goes into the Lambda package via the "include" and "exclude" directives, within the "package" directive. But, sometimes, you're sailing very close to the 50Mb limit and the only way to stay underneath is to directly create your zip package yourself. Or, in my case, you have a zip file which has precisely what you need, but you also need to manipulate it to add in a pickled bit of code. (Which you do via the Python zipfile library).

It took me a while to figure out, but you can use the "artifact" directive as the way to deploy a zip you've packaged already.

So, there you have it: Lambda is great, but you should use Serverless (or something like it) to simplify your deployments.  And you can deploy pre-built zip files using the "artifact" directive.

Tuesday, August 29, 2017

Emoji, Fake News and 99% Invisible

This morning, I was listening to 99% Invisible, the podcast all about architecture and design.
thinking face
This episode "Person in Lotus Position" was about the process of adding a new emoji to the official set. At one point, they spoke to Jennifer 8. Lee who is on the Unicode Emoji Subcommittee. I know Jenny through Misinfocon. This is a new effort to fight the spread of disinformation on the web via a Knight-funded Credibility Schema Working Group. The goal is to create ways which signal whether a given piece of information on the web is credible.
Most of the podcast episode describes the workings of the Unicode committee, which is official standards body for deciding which characters computers and phones will recognize and exchange. It gave a pretty good introduction to the importance and difficulty of this kind of standards work. (As well as being involved in the Credibility Schema Working Group, I'm also the Chairman of the IPTC, the news technology standards body. So, I like to think I have some insight into how these things work).
If you, like me, are interested in emoji and/or the workings of technical standards groups, then I recommend the episode. (Also, if you're interested in stopping the spread of fake news or in promoting technical standards within the global news industry, feel free to get in touch).

Monday, May 1, 2017

Machine Learning LinkLog #2

I knew that my linklog posts would be occasional. But I wasn't expecting to go quite so long between installments as this... But you know what they say: There is no "AI" in "FAILURE". Oh, wait...

Anyway, here is the third in an occasional series of LinkLog posts. I list links to interesting articles, video and audio items on Machine Learning and allied topics. Just like in my first Machine Learning LinkLog, and the one I did on Serverless, I group the items into three broad categories:

  • Introductory Non Technical - Aimed at the general reader or, perhaps, technical manager who wants to learn about Machine Learning, but is not aiming to be a practitioner
  • Introductory Technical - Aimed at someone who is comfortable with programming and technology, but wishes to learn how to work with Machine Learning tools or techniques
  • In Depth Technical - Aimed at someone who is comfortable with the fundamentals of Machine Learning technology, but wants to learn more about a particular aspect or wants to master "day two" problems.
I also indicate whether the item is (mainly) a slidedeck, a video, a single article or a series of items.

Introductory Non Technical

Mix and match analytics: data, metadata, and machine learning for the win

From ZDNet, explores the use case of YouTube video recommendations to illustrate a practical application of machine learning. Along the way, this article touches on video fingerprinting (via hashes) and the importance of descriptive metadata (a topic dear to my own heart).

Gary Marcus on Advancements in Machine Learning

Via MIT Technology Review, an accessible 18 minute video which gives an overview of the current state and challenges for "Deep Learning".

Machine-learning boffins 'summon demons' in AI to find exploitable bugs

A Register news item about a team of researchers using a semi-automated technique called “steered fuzzing” to comb through machine learning programs for bugs. Failures such as mispredictions, or false outcomes, lead to detectable crashes in the program. These types of failures can potentially be exploited as security holes.

6 areas of AI and machine learning to watch closely

Quick overview article of interesting areas in AI/ML - Reinforcement learning (RL), Generative models, Networks with memory, Learning from less data and building smaller models, Hardware for training and inference and Simulation environments.

Introductory Technical


Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.
RAKE (Rapid Automatic Keyword Extraction), is a domain independent keyword extraction algorithm which determines key phrases by analyzing the frequency of word appearance and its co-occurance with other words in the text.
An entire book covering machine learning. The online version is free, although you can donate if you like or buy a print copy. Source for the exercises are available via github (in Python).

Deep Learning Papers Reading Roadmap

A nice, regularly-updated, set of Deep Learning papers, with a particular emphasis on speech and image recognition.

In Depth Technical

Generating Politically-Relevant Event Data

Using convolutional nets (deep learning) for event classification, rather than traditional dictionary-based approaches. Shows good results for both English and Arabic and claims that the technique would work well for ontologies in other domains.

Transparent predictions

Since algorithms are being used more-and-more to make predictions which guide areas such as public policy and policing, should the algorithms be "transparent"? Detailed discussion of what this might mean and why transparency is important. And under circumstances it might not be desirable.

See also which uses data from LinkedIn to predict white collar crime based on people's faces. (Which looks to me to be a parody, however there are quite a few efforts to use Machine Learning techniques to identify potential criminals based on their faces, posture, etc.)

What are Dimentionality Reduction Techniques?

Dimensionality reduction is the process of reducing the number of random variables in a machine learning data set. It can be divided into feature selection and feature extraction.In many problems, the measured data vectors are high-dimensional but we can try to convert into a smaller number of variables to deal with. Outlines several techniques which can be tried. (Links to source code in R).

Previous LinkLogs

Tuesday, April 4, 2017

EXTRA Progress - Building an Open Source News Classification Engine

Over the last year, I've been leading a project within the IPTC to build an open source rules-based classification engine for news. Dubbed "EXTRA" (shorthand for EXTraction Rules Apparatus), the software will be freely-available under an MIT license. The work is being funded by a grant of €50,000 from Google's Digital News Initiative Innovation Fund.
“Extra” by Jeremy Brooks
We have drawn up the technical requirements, hired Infalia PC to partner with us on building the software and selected Elasticsearch's percolator as the fundamental technology. We've licensed two news corpora - English from Reuters and German from the Austrian Press Agency. Linguists are creating rules for classifying those corpora with IPTC’s Media Topics using the EXTRA engine.

I'm thrilled to say that the project is on track to deliver a working version of the engine, together with the sample rules, by the summer of 2017. Read more about the project at and feel free to contact me for more information.

Tuesday, February 21, 2017

Serverless LinkLog #1

This is the second in an occasional series of LinkLog posts. I list links to interesting articles, video and audio items on "serverless computing" and allied topics. Just like in my Machine Learning LinkLog #1 I group the items into three broad categories:

Introductory Non Technical - Aimed at the general reader or, perhaps, technical manager who wants to learn about Serverless, but is not aiming to be a practitioner
Introductory Technical - Aimed at someone who is comfortable with programming and technology, but wishes to learn how to work with serverless tools or techniques
In Depth Technical - Aimed at someone who is comfortable with the fundamentals of serverless technology, but wants to learn more about a particular aspect or wants to master "day two" problems.

I also intend to indicate whether the item is (mainly) a slidedeck, a video, a single article or a series of items. A secondary goal is to avoid the trope of using pictures of clouds to illustrate serverless topics - however, I apologize in advance if I fail.

If You Only Read One Thing About Serverless

Having said all of that, if you only want to read one thing about serverless, then I recommend "Serverless Archictectures" by Mike Roberts. It will take you from 0 to 60 - not only understanding what are serverless architectures and how they relate to Backend-As-A-Service and Function-As-A-Service, but also their future potential and current shortcomings. This is for both technical and non technical people and will give you much greater insight than any of the introductory material listed below. However, it is quite long - I wound up reading it in chunks. So, you might read one or more of the "intro" items, before coming back to this one.

Introductory Non Technical

How Can Enterprises Leverage Serverless Computing Platforms?
Two page introduction from Forbes, talks a lot about AWS Lambda, but mentions competitors. Discusses how various large enterprises are blending serverless with their other compute strategies.

A Guide to Serverless Computing with AWS Lambda
Decent introduction for technical managers - has some source code, but it really isn't aimed at help you write lambda code. Very AWS specific, of course, but has a nod to competitor offerings.

AWS Lambda is Everywhere
Adrian Cockcroft gives an overview of AWS Lambda, with particular emphasis on the latest (November 2016) announcements from AWS re:Invent.

Evolution of Business Logic from Monoliths through Microservices, to Functions

Adrian Cockcroft again - portraying AWS Lambda as the logical next step in the business of computing.

Two part article from Charity Majors. Pours cold water on the notion that "serverless == noops" i.e. that just because you are not provisioning servers yourself that you therefore don't need to worry about operational problems anymore. In fact, she makes a compelling case that serverless requires all developers to think much more profoundly about operational, "day two" problems. She also points out how much more opaque serverless applications are, making them much harder to debug.

Introductory Technical

This is a bit confusing but ... there is an open source framework called "Serverless" aimed at making it easier to work with various technologies - including AWS Lambda, Azure Functions and Google CloudFunctions. There are other frameworks (see Zappa and Apex below) but this one seems popular (maybe because it is likely to come up when you google for "serverlesss"? Or, more seriously, maybe because it supports a variety of technologies and deployment scenarios - including local for development)

Open source framework for for turning Python WSGI applications into serverless apps using AWS Lambda and Gateway. Supports advnaced capabilities, including scheduling and keep_warm (for better performance).

Open source framework aimed at making AWS Lambda easier to work with. Adds support for non Lambda-native languages (such as golang) via a node.js shim.

Writing a cron job microservice with Serverless and AWS Lambda
A hands on article takes you through a practical example of how to write a cron-equivalent using Serverless (the framework) and AWS Lambda.

How to build powerful back-ends easily with Serverless
An image processing server in node,js with the Serverless framework, using a bundle of AWS services - Lambda, Rekognition and S3.

Bring static to life using serverless

Serverless Video Playlist
Nine video playlist of hands-on basic coding with AWS serverless with an emphasis on Lambda and node.js.

In Depth Technical

Serverless Workflows on AWS: My Journey From SWF to Step Functions
In depth technical discussion of AWS Simple Workflow and Lambda, with an emphasis on AWS Step Functions - sort of like Finite State Machines.

Fission: Serverless Functions as a Service for Kubernetes
Using Kubernetes to host Functions as a Service. Kubernetes is a way to host Linux containers within a cluster. So, this is a way to achieve a serverless architecture without using any of Lambda, Azure or CloudFunctions. In other words, this is a kind of "private serverless" (akin to "private cloud").

Airbnb's open source framework for analyzing massive log streams and alerting on them by defining rules in Python. Uses AWS Lambda and Kinesis, amongst other technologies.

Wednesday, January 25, 2017

Machine Learning LinkLog #1

This is the first in an occasional series of LinkLog posts. I list links to interesting articles, video and audio items on Machine Learning and allied topics. I group them into three broad categories:

  • Introductory Non Technical - Aimed at the general reader or, perhaps, technical manager who wants to learn about Machine Learning, but is not aiming to be a practioner
  • Introductory Technical - Aimed at someone who is comfortable with programming and technology, but wishes to learn how to work with Machine Learning tools or techniques
  • In Depth Technical - Aimed at someone who is comfortable with the fundamentals of Machine Learning technology, but wants to learn more about a particular aspect or wants to master "day two" problems.

I also intend to indicate whether the item is (mainly) a slidedeck, a video, a single article or a series of items. My goal is to avoid the trope of using pictures of robots to illustrate machine learning topics. However, I apologize in advance if I fail (and you'll certainly find plenty of robot pictures in the items I link to).

Introductory Non Technical

What is Machine Learning?

Short, two-and-a-half minute explainer video from Oxford Sparks ("the amazing stories of science taking place at the University of Oxford"). Accurate and accessible though potentially misleading as it implies that people aren't needed at all. (Note that the doesn't say this directly but you have to listen carefully to pick up that nuance). Avoids the use of unnecessary jargon.

Hype vs. Reality: The AI Explainer

Twenty eight slide explainer deck from Luminary Labs. Covers the different aspects of AI that are currently en vogue. Attempts to predict what is likely to actually succeed and what it probably just hype or misunderstanding.

Top 10 Hot Artificial Intelligence (AI) Technologies

An overview of the findings from a Forrester Research report about Artificial Intelligence in 2017.

Kristen Stewart co-wrote a paper on machine learning

The actress and director co-authored a paper on 'style transfers'. This is a neural-network technique to blend the content of one photo with the style of another. (It is popular with apps such as Prisma). Stewart and her team used the technique in her directoral debut "Come Swim" to create dream-like sequences in the film. The paper describes tricks she used to better control the style transfer effects. This article links in depth technical topics (neural networks and visual processing) with consumer products (feature films).

Introductory Technical

Machine Learning is Fun

A series of articles introducing various technical aspects of Machine Learning. Available in several languages.

A Flask API for serving scikit-learn models

Assumes a lot of scikit knowledge, but a good overview of how to create an API using Flask on the front end and scikit on the backend. The API is not truly RESTful (the endpoints are verbs not nouns, for example). However, it is still a useful introduction and contains the full Python code.

Hitchhiker's Guide to Data Science, Machine Learning, R, Python

A "best of" collection of links, relating to Machine Learning and Data Science, with a particular emphasis on Python and R.

In Depth Technical

Google's 43 Practical Rules of Machine Learning in Industry
Martin Zinkevich compiled 43 rules "intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google." Here are three(!) different presentations of those rules.

Predicting with confidence: the best machine learning idea you never heard of

An article advocating the use of "conformal prediction" a technique for calculating conformance intervals for your machine learning predictions, no matter what the forecast technique or dataset you're working with.

Wednesday, January 11, 2017

Developing the Digital Marketplace for Copyrighted Works

I recently spoke at "Developing the Digital Marketplace for Copyrighted Works", organized by the Commerce Department’s Internet Policy Taskforce. The goal of the public meeting was to "facilitate constructive, cross-industry dialogue among stakeholders about ways to promote a more robust and collaborative digital marketplace for copyrighted works". My impression of the roughly 80 attendees was of a mix of publishers and lawyers - who tended to be pretty conservatively dressed - and music people - who tended to be in more sparkly outfits. Most of the discussion revolved around music, photo and video, but it turned out that a lot of the problems and potential solutions were quite similar across industries and media types. I've linked to the video of the event at the end of this post.

I've been involved in rights work at the Associated Press, including adding rights and pricing metadata to AP's Image API. I lead the IPTC's Rights Working Group. And I'm working within the W3C's POE group to turn ODRL into an official standard.

I spoke on the first panel with the topic of "Unique Identifiers and Metadata". I was teased a bit about "fake news" (this was in Alexandria. VA on December 9th 2016, so close both in time and place to the U.S. Presidential Election). Amongst other things, I spoke about how apparently simple things - "let's agree on identifiers for photos" - turned out to be quite complicated - since a text item, a photo or a video is not really a single, simple atomic thing, but more like a molecule of information. (You can watch the entire panel - which turned out to be quite lively, despite the early hour - in the video linked below).

I also moderated a round table, with the topic "What are the practical steps to adopting standards for identifying and controlling copyrighted works?". As everyone at my table introduced themselves, they mostly said "oh, I'm just here to learn, I don't have much to contribute" but, in fact, we had a very vigorous discussion, which covered *lots* of topics! I summarized them during the "Plenary" session (again, I've linked to the video below). We talked about three areas. First, was why we need standards - creators and rights holders should be compensated for their work, which could be financial compensation or it could be getting distribution and recognition. Second, we talked about the big barriers - technology, the culture of the Internet and human nature itself. Finally, we talked about concrete steps which the government and other organizations could take to get standards developed and adopted. (For the details, you'll need to watch the video. My segment runs from about 39:30 to about 45:55 but I recommend watching everyone's summary of their individual breakout sessions)

If you're interested in rights, then you should consider coming to London for the week of May 15th. That's because the BBC is hosting a Rights Day on May 15th, the IPTC will be holding its Spring Meeting (including discussing RightsML) on May 16th and 17th and W3C will hold its face-to-face meeting on May 18th and 19th. If you're interested in any or all, contact me and I will put you in touch with the right rights people.

Opening Remarks and Panel Session 1: Unique Identifiers and Metadata
Panel Session 2: Registries and Rights Expression Languages

Panel Session 3: Digital Marketplaces

Plenary Session