SyntaxHighlighter

SyntaxHighlighter

Monday, May 1, 2017

Machine Learning LinkLog #2

I knew that my linklog posts would be occasional. But I wasn't expecting to go quite so long between installments as this... But you know what they say: There is no "AI" in "FAILURE". Oh, wait...

Anyway, here is the third in an occasional series of LinkLog posts. I list links to interesting articles, video and audio items on Machine Learning and allied topics. Just like in my first Machine Learning LinkLog, and the one I did on Serverless, I group the items into three broad categories:

  • Introductory Non Technical - Aimed at the general reader or, perhaps, technical manager who wants to learn about Machine Learning, but is not aiming to be a practitioner
  • Introductory Technical - Aimed at someone who is comfortable with programming and technology, but wishes to learn how to work with Machine Learning tools or techniques
  • In Depth Technical - Aimed at someone who is comfortable with the fundamentals of Machine Learning technology, but wants to learn more about a particular aspect or wants to master "day two" problems.
I also indicate whether the item is (mainly) a slidedeck, a video, a single article or a series of items.

Introductory Non Technical

Mix and match analytics: data, metadata, and machine learning for the win

From ZDNet, explores the use case of YouTube video recommendations to illustrate a practical application of machine learning. Along the way, this article touches on video fingerprinting (via hashes) and the importance of descriptive metadata (a topic dear to my own heart).

Gary Marcus on Advancements in Machine Learning

Via MIT Technology Review, an accessible 18 minute video which gives an overview of the current state and challenges for "Deep Learning".

Machine-learning boffins 'summon demons' in AI to find exploitable bugs

A Register news item about a team of researchers using a semi-automated technique called “steered fuzzing” to comb through machine learning programs for bugs. Failures such as mispredictions, or false outcomes, lead to detectable crashes in the program. These types of failures can potentially be exploited as security holes.
https://www.theregister.co.uk/2017/01/24/summoning_demons_to_find_bugs/

6 areas of AI and machine learning to watch closely

Quick overview article of interesting areas in AI/ML - Reinforcement learning (RL), Generative models, Networks with memory, Learning from less data and building smaller models, Hardware for training and inference and Simulation environments.

Introductory Technical

rake-nltk

Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.
RAKE (Rapid Automatic Keyword Extraction), is a domain independent keyword extraction algorithm which determines key phrases by analyzing the frequency of word appearance and its co-occurance with other words in the text.
An entire book covering machine learning. The online version is free, although you can donate if you like or buy a print copy. Source for the exercises are available via github (in Python).

Deep Learning Papers Reading Roadmap

A nice, regularly-updated, set of Deep Learning papers, with a particular emphasis on speech and image recognition.

In Depth Technical

Generating Politically-Relevant Event Data

Using convolutional nets (deep learning) for event classification, rather than traditional dictionary-based approaches. Shows good results for both English and Arabic and claims that the technique would work well for ontologies in other domains.

Transparent predictions

Since algorithms are being used more-and-more to make predictions which guide areas such as public policy and policing, should the algorithms be "transparent"? Detailed discussion of what this might mean and why transparency is important. And under circumstances it might not be desirable.

See also https://whitecollar.thenewinquiry.com/ which uses data from LinkedIn to predict white collar crime based on people's faces. (Which looks to me to be a parody, however there are quite a few efforts to use Machine Learning techniques to identify potential criminals based on their faces, posture, etc.)

What are Dimentionality Reduction Techniques?

Dimensionality reduction is the process of reducing the number of random variables in a machine learning data set. It can be divided into feature selection and feature extraction.In many problems, the measured data vectors are high-dimensional but we can try to convert into a smaller number of variables to deal with. Outlines several techniques which can be tried. (Links to source code in R).

https://analyticsdataexploration.com/what-are-dimentionality-reduction-techniques/

Previous LinkLogs




Tuesday, April 4, 2017

EXTRA Progress - Building an Open Source News Classification Engine

Over the last year, I've been leading a project within the IPTC to build an open source rules-based classification engine for news. Dubbed "EXTRA" (shorthand for EXTraction Rules Apparatus), the software will be freely-available under an MIT license. The work is being funded by a grant of €50,000 from Google's Digital News Initiative Innovation Fund.
“Extra” by Jeremy Brooks https://flic.kr/p/4aKH3c
We have drawn up the technical requirements, hired Infalia PC to partner with us on building the software and selected Elasticsearch's percolator as the fundamental technology. We've licensed two news corpora - English from Reuters and German from the Austrian Press Agency. Linguists are creating rules for classifying those corpora with IPTC’s Media Topics using the EXTRA engine.

I'm thrilled to say that the project is on track to deliver a working version of the engine, together with the sample rules, by the summer of 2017. Read more about the project at https://iptc.org/news/extra-iptc-infalia-elasticsearch-open-source-rules-based-classification-engine/ and feel free to contact me for more information.

Tuesday, February 21, 2017

Serverless LinkLog #1

This is the second in an occasional series of LinkLog posts. I list links to interesting articles, video and audio items on "serverless computing" and allied topics. Just like in my Machine Learning LinkLog #1 I group the items into three broad categories:

Introductory Non Technical - Aimed at the general reader or, perhaps, technical manager who wants to learn about Serverless, but is not aiming to be a practitioner
Introductory Technical - Aimed at someone who is comfortable with programming and technology, but wishes to learn how to work with serverless tools or techniques
In Depth Technical - Aimed at someone who is comfortable with the fundamentals of serverless technology, but wants to learn more about a particular aspect or wants to master "day two" problems.

I also intend to indicate whether the item is (mainly) a slidedeck, a video, a single article or a series of items. A secondary goal is to avoid the trope of using pictures of clouds to illustrate serverless topics - however, I apologize in advance if I fail.

If You Only Read One Thing About Serverless

Having said all of that, if you only want to read one thing about serverless, then I recommend "Serverless Archictectures" by Mike Roberts. It will take you from 0 to 60 - not only understanding what are serverless architectures and how they relate to Backend-As-A-Service and Function-As-A-Service, but also their future potential and current shortcomings. This is for both technical and non technical people and will give you much greater insight than any of the introductory material listed below. However, it is quite long - I wound up reading it in chunks. So, you might read one or more of the "intro" items, before coming back to this one.

Introductory Non Technical

How Can Enterprises Leverage Serverless Computing Platforms?
Two page introduction from Forbes, talks a lot about AWS Lambda, but mentions competitors. Discusses how various large enterprises are blending serverless with their other compute strategies.

A Guide to Serverless Computing with AWS Lambda
Decent introduction for technical managers - has some source code, but it really isn't aimed at help you write lambda code. Very AWS specific, of course, but has a nod to competitor offerings.

AWS Lambda is Everywhere
Adrian Cockcroft gives an overview of AWS Lambda, with particular emphasis on the latest (November 2016) announcements from AWS re:Invent.

Evolution of Business Logic from Monoliths through Microservices, to Functions

Adrian Cockcroft again - portraying AWS Lambda as the logical next step in the business of computing.

WTF IS OPERATIONS? #SERVERLESS
OPERATIONAL BEST PRACTICES #SERVERLESS
Two part article from Charity Majors. Pours cold water on the notion that "serverless == noops" i.e. that just because you are not provisioning servers yourself that you therefore don't need to worry about operational problems anymore. In fact, she makes a compelling case that serverless requires all developers to think much more profoundly about operational, "day two" problems. She also points out how much more opaque serverless applications are, making them much harder to debug.


Introductory Technical

Serverless
This is a bit confusing but ... there is an open source framework called "Serverless" aimed at making it easier to work with various technologies - including AWS Lambda, Azure Functions and Google CloudFunctions. There are other frameworks (see Zappa and Apex below) but this one seems popular (maybe because it is likely to come up when you google for "serverlesss"? Or, more seriously, maybe because it supports a variety of technologies and deployment scenarios - including local for development)
http://www.serverless.com

Zappa
Open source framework for for turning Python WSGI applications into serverless apps using AWS Lambda and Gateway. Supports advnaced capabilities, including scheduling and keep_warm (for better performance).

Apex
Open source framework aimed at making AWS Lambda easier to work with. Adds support for non Lambda-native languages (such as golang) via a node.js shim.

Writing a cron job microservice with Serverless and AWS Lambda
A hands on article takes you through a practical example of how to write a cron-equivalent using Serverless (the framework) and AWS Lambda.

How to build powerful back-ends easily with Serverless
An image processing server in node,js with the Serverless framework, using a bundle of AWS services - Lambda, Rekognition and S3.

Bring static to life using serverless

Serverless Video Playlist
Nine video playlist of hands-on basic coding with AWS serverless with an emphasis on Lambda and node.js.

In Depth Technical

Serverless Workflows on AWS: My Journey From SWF to Step Functions
In depth technical discussion of AWS Simple Workflow and Lambda, with an emphasis on AWS Step Functions - sort of like Finite State Machines.

Fission: Serverless Functions as a Service for Kubernetes
Using Kubernetes to host Functions as a Service. Kubernetes is a way to host Linux containers within a cluster. So, this is a way to achieve a serverless architecture without using any of Lambda, Azure or CloudFunctions. In other words, this is a kind of "private serverless" (akin to "private cloud").

StreamAlert
Airbnb's open source framework for analyzing massive log streams and alerting on them by defining rules in Python. Uses AWS Lambda and Kinesis, amongst other technologies.

Wednesday, January 25, 2017

Machine Learning LinkLog #1

This is the first in an occasional series of LinkLog posts. I list links to interesting articles, video and audio items on Machine Learning and allied topics. I group them into three broad categories:

  • Introductory Non Technical - Aimed at the general reader or, perhaps, technical manager who wants to learn about Machine Learning, but is not aiming to be a practioner
  • Introductory Technical - Aimed at someone who is comfortable with programming and technology, but wishes to learn how to work with Machine Learning tools or techniques
  • In Depth Technical - Aimed at someone who is comfortable with the fundamentals of Machine Learning technology, but wants to learn more about a particular aspect or wants to master "day two" problems.

I also intend to indicate whether the item is (mainly) a slidedeck, a video, a single article or a series of items. My goal is to avoid the trope of using pictures of robots to illustrate machine learning topics. However, I apologize in advance if I fail (and you'll certainly find plenty of robot pictures in the items I link to).

Introductory Non Technical

What is Machine Learning?

Short, two-and-a-half minute explainer video from Oxford Sparks ("the amazing stories of science taking place at the University of Oxford"). Accurate and accessible though potentially misleading as it implies that people aren't needed at all. (Note that the doesn't say this directly but you have to listen carefully to pick up that nuance). Avoids the use of unnecessary jargon.
https://www.youtube.com/watch?v=f_uwKZIAeM0

Hype vs. Reality: The AI Explainer

Twenty eight slide explainer deck from Luminary Labs. Covers the different aspects of AI that are currently en vogue. Attempts to predict what is likely to actually succeed and what it probably just hype or misunderstanding.

Top 10 Hot Artificial Intelligence (AI) Technologies

An overview of the findings from a Forrester Research report about Artificial Intelligence in 2017.

Kristen Stewart co-wrote a paper on machine learning

The actress and director co-authored a paper on 'style transfers'. This is a neural-network technique to blend the content of one photo with the style of another. (It is popular with apps such as Prisma). Stewart and her team used the technique in her directoral debut "Come Swim" to create dream-like sequences in the film. The paper describes tricks she used to better control the style transfer effects. This article links in depth technical topics (neural networks and visual processing) with consumer products (feature films).
https://www.engadget.com/2017/01/20/kristen-stewart-paper-style-transfers-come-swim/

Introductory Technical

Machine Learning is Fun

A series of articles introducing various technical aspects of Machine Learning. Available in several languages.https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.nzyqhondf

A Flask API for serving scikit-learn models

Assumes a lot of scikit knowledge, but a good overview of how to create an API using Flask on the front end and scikit on the backend. The API is not truly RESTful (the endpoints are verbs not nouns, for example). However, it is still a useful introduction and contains the full Python code.
https://medium.com/@amirziai/a-flask-api-for-serving-scikit-learn-models-c8bcdaa41daa#.ti049hnb2

Hitchhiker's Guide to Data Science, Machine Learning, R, Python

A "best of" collection of links, relating to Machine Learning and Data Science, with a particular emphasis on Python and R.
http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python

In Depth Technical

Google's 43 Practical Rules of Machine Learning in Industry
Martin Zinkevich compiled 43 rules "intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google." Here are three(!) different presentations of those rules.



Predicting with confidence: the best machine learning idea you never heard of

An article advocating the use of "conformal prediction" a technique for calculating conformance intervals for your machine learning predictions, no matter what the forecast technique or dataset you're working with.

Wednesday, January 11, 2017

Developing the Digital Marketplace for Copyrighted Works

I recently spoke at "Developing the Digital Marketplace for Copyrighted Works", organized by the Commerce Department’s Internet Policy Taskforce. The goal of the public meeting was to "facilitate constructive, cross-industry dialogue among stakeholders about ways to promote a more robust and collaborative digital marketplace for copyrighted works". My impression of the roughly 80 attendees was of a mix of publishers and lawyers - who tended to be pretty conservatively dressed - and music people - who tended to be in more sparkly outfits. Most of the discussion revolved around music, photo and video, but it turned out that a lot of the problems and potential solutions were quite similar across industries and media types. I've linked to the video of the event at the end of this post.

I've been involved in rights work at the Associated Press, including adding rights and pricing metadata to AP's Image API. I lead the IPTC's Rights Working Group. And I'm working within the W3C's POE group to turn ODRL into an official standard.


I spoke on the first panel with the topic of "Unique Identifiers and Metadata". I was teased a bit about "fake news" (this was in Alexandria. VA on December 9th 2016, so close both in time and place to the U.S. Presidential Election). Amongst other things, I spoke about how apparently simple things - "let's agree on identifiers for photos" - turned out to be quite complicated - since a text item, a photo or a video is not really a single, simple atomic thing, but more like a molecule of information. (You can watch the entire panel - which turned out to be quite lively, despite the early hour - in the video linked below).

I also moderated a round table, with the topic "What are the practical steps to adopting standards for identifying and controlling copyrighted works?". As everyone at my table introduced themselves, they mostly said "oh, I'm just here to learn, I don't have much to contribute" but, in fact, we had a very vigorous discussion, which covered *lots* of topics! I summarized them during the "Plenary" session (again, I've linked to the video below). We talked about three areas. First, was why we need standards - creators and rights holders should be compensated for their work, which could be financial compensation or it could be getting distribution and recognition. Second, we talked about the big barriers - technology, the culture of the Internet and human nature itself. Finally, we talked about concrete steps which the government and other organizations could take to get standards developed and adopted. (For the details, you'll need to watch the video. My segment runs from about 39:30 to about 45:55 but I recommend watching everyone's summary of their individual breakout sessions)

If you're interested in rights, then you should consider coming to London for the week of May 15th. That's because the BBC is hosting a Rights Day on May 15th, the IPTC will be holding its Spring Meeting (including discussing RightsML) on May 16th and 17th and W3C will hold its face-to-face meeting on May 18th and 19th. If you're interested in any or all, contact me and I will put you in touch with the right rights people.

Opening Remarks and Panel Session 1: Unique Identifiers and Metadata
Panel Session 2: Registries and Rights Expression Languages


Panel Session 3: Digital Marketplaces

Plenary Session



Tuesday, November 22, 2016

The View From Berlin - IPTC AGM 2016

I Chair the Board of Directors of IPTC, a consortium of news agencies, publishers and system vendors, which develops and maintains technical standards for news, including NewsML-G2, rNews and News-in-JSON. I work with the Board to broaden adoption of IPTC standards, to maximize information sharing between members and to organize successful face-to-face meetings.

We hold face-to-face meetings in several locations throughout the year, although, most of the detailed work of the IPTC is now conducted via teleconferences and email discussions. Our Annual General Meeting for 2016 was held in Berlin in October. As well as being the time for formal votes and elections, the AGM is a chance for the IPTC to look back over the last year and to look ahead about what is in store. What follows are my prepared remarks at the Berlin AGM.

The Only Constant

It is clear that the news industry is experiencing a great degree of change. The business side of news continues to be under pressure. And, in no small part, this is because the technology involved in the creation and distribution of news continues to rapidly evolve.

However, in many ways, this is a golden age of journalism. The demand for news and information has never been higher. The immediate and widespread distribution of news has never been easier.

The IPTC has been around for 51 years. I've been a delegate to the IPTC since 2000 and Chairman of the Board since June 2014. I'd like to give my perspective on the changes going on within the news industry and how IPTC has and will respond.

We're On a Mission

IPTC is rooted in - and foundational to - the news industry. Our open source standards for news technology enable the operations of hundreds of news and media organizations, large and small. IPTC standards are instrumental in the software used to create, edit, archive and distribute news and information around the world.

We are starting to evolve the scope of our work beyond standards - such as via the EXTRA project to build an open source rules-based classification engine. Much of what we do is relevant to not only news agencies and publishers, but also to photographers, videographers, academics and archivists. By bringing together these diverse groups, we can not only create powerful, efficient standards and technologies, but also learn from each other about what works and what does not.

Ch-ch-changes

We've introduced quite a bit of change within the IPTC since I've become Chairman and that has continued over the last year.

What's Going On?

We're working to improve our existing family of standards by
  • continuing to improve documentation - to make it easier to get going with a standard and simpler to grasp the nuances when you want to expand your implementation
  • making our standards more coherent and consistent - as many organizations need to use a combination
We're extending the reach of the IPTC, both by working with other organizations (including PRISM, IIIF, WAN-IFRA and W3C). But also by engaging in new types of work such as EXTRA and the Video Metadata Hub, which are not traditional standards but are open source projects for the benefit of the community we serve.

Since I've become Chair, we've renewed our efforts to communicate the great work that we do. You can see a big uptick in our engagement via Twitter and LinkedIn, as well as by refreshing the design of our the IPTC website. Plus we're doing a lot more work "out in the open" on Github.

We're continuing to streamline the operations of the IPTC. We've simplified our processes to better reflect the ways we actually operate these days. For example we have dramatically reduced the number of formal votes we take. But we still have sufficient process in place to ensure that the interests of all members are protected. For 2017, we have decided to have two-plus-one face-to-face meetings, rather than our usual three-plus-one. We will hold two full face-to-face meetings (one in London, the other in Barcelona), plus our one day Photo Metadata conference in association with the CEPIC Conference in Berlin. This will allow us to intensify our work on the meetings, with more ambitious and compelling topics and speakers.

Do Better

As I said, we've been changing our processes, particularly for the face-to-face meetings. But what else could we do to simplify our processes whilst at the same time ensuring that there is a balance between the interests of all members? Are there ways for the IPTC to deliver more value to the membership? How do we continue to balance our policy of consensus-driven decision-making with the need to be more flexible and nimble?

IPTC is a membership-driven organization. Membership fees represent the vast majority of the revenue for our organization. As the news industry as a whole continues to feel pressure - including downsizing, mergers and, unfortunately some members going out of business - the IPTC is experiencing downward pressure on its own revenue. So, we are working on ways to reach new members, whilst at the same time ensuring that existing members continue to derive value. We're also open to exploring new ways of generating revenue which fit with our mission - let us know your ideas!

What new areas should the IPTC focus on? Many journalists are experimenting with an array of technologies - Augmented Reality, Virtual Reality, 360 degree photos, drones and bots, to name but a few. And let's not forget about the "Cambrian Explosion" of technologies related to news and metadata on the Web, including AMP, AppleNews, Instant Articles, rNews, Schema.org and OpenGraph. How can IPTC help - negotiating standards? Developing best practices? Navigating the ethics of these technologies?

Happy

If you're happy with the IPTC, then please tell others.

If you're not happy, then please tell me!

I Want to Thank You

Without you, the members of IPTC, literally none of this is possible. So, I'd like to take a moment to thank everyone involved in the organization, particularly everyone involved in all of the detailed work of the IPTC. And I'd like to acknowledge and thank Andreas Gebhard, who is stepping down from the Board, and Johan Lindgren who has been voted on.

Finally, I'd like to extend a special thanks to Michael Steidl, Managing Director of the IPTC, who is personally involved in almost every aspect of what we do.

2017

No doubt, next year will bring us many new and, often, unexpected challenges. I look forward to tackling with all of you, the IPTC.

Thursday, October 6, 2016

Developers Needed For IPTC's EXTRA Rules-based Classification Engine

Over the last several months, I've been working within the IPTC - along with a number of other news organizations - on "EXTRA" (shorthand for EXTraction Rules Apparatus), an open-source source rules based classification engine for news content. I'm thrilled because this week we reached a significant milestone: we started the formal process of looking for developers to implement the EXTRA engine.
“Extra” by Jeremy Brooks https://flic.kr/p/4aKH3c
The IPTC was awarded a grant of 50,000 from Google's Digital News Initiative Innovation Fund to build and freely distribute the initial version of EXTRA. As part of the IPTC, we are working with several news providers to supply sets of news documents, and with linguists to write rules to classify the documents. We've been working on defining the technical requirements and now we’re looking for software developers to design, develop, document and test EXTRA.

Below is the formal announcement. If you know anyone who might be interested, let them know. And if you are interested, please let us know!

Developers Needed For IPTC's EXTRA Rules-based Classification Engine

IPTC https://iptc.org/ is looking for software developers to design, develop, document and test EXTRA https://iptc.github.io/extra/, an open source rules-based classification engine for news. First preference will be given to applications received by 21st October 2016, and review will continue until the positions are filled. Applyhere.

"Classification" means assigning one or more categories to the text of a news document. Rules based classifiers use a set of Boolean rules, rather than machine-learning or statistical techniques, to determine which categories to apply.

EXTRA is the EXTraction Rules Apparatus, a multilingual open-source platform for rules-based classification of news content. IPTC was awarded a grant of €50,000 from the first round of Google’s Digital News Initiative Innovation Fund https://www.digitalnewsinitiative.com/ to build and freely distribute the initial version of EXTRA. DNI granted IPTC €50,000 for the entire project.


We are working with news providers to supply sets of news documents and with linguists to write rules to classify the documents. IPTC is looking for qualified developers to create the rules engine to accurately and efficiently categorize the documents using the rules. mandatory and preferred requirements.

Please consult this page for more information and to let us know if you’re interested in being considered.