SyntaxHighlighter

SyntaxHighlighter

Friday, August 22, 2014

JSON Design Principles and Lessons Learnt: JSON Style (Part Three of Three)

Lessons Learnt from JSON Designs I've Worked On

Over the last couple of years, I've worked on a few JSON schema. For example, IPTC's NINJS (for representing news) and W3C GC ODRL's ODRL in JSON (for representing permissions and restrictions). I've also done some work on JSON internal to AP, for various APIs and search systems.

Along the way, I've learnt some lessons about better or worse ways to design the JSON - both about the way to do it and some JSON "style" tips. I've broken this into three posts:

JSON's built in datatypes - number, string, Boolean, array, object and null - are a natural fit for many programming languages (particularly scripting / high level languages such as Javascript, Ruby and Python). A large part of the attraction of JSON over XML, therefore, is that it is often easier to deal with data expressed in JSON in those languages.

However, crucially, it makes a big difference as to how the JSON is structured. The simpler the JSON structures are, the easier it is to write correct code to deal with them. This is part of why it is so important to prototype some code to work with a structure you are considering, as it gives you a much better sense of how easy it is to work with a given JSON structure. This principle (Simpler JSON Means Simpler Code) underlies and/or is counter-balanced by most of the rest of the following principles.

Choose Wisely: Once You Commit to a Structure, You Can't Change It (Easily)

Once you declare a given property as having a particular structure, you can't change it (easily). For example, let's say that you want to represent a date in your JSON. You decide that you will call it "arrivalDate" and that you will make this a number (which means it is easy to do date arithmetic, for example):

"arrivalDate" : 20140515

Later, however, you realize that you need to indicate in which timezone this date occurs. Sadly, you can't just tack on a timezone offset indicator (such as "+0500" or "-03") since that would make arrivalDate a string and any existing JSON documents that conform to your first definition would now be invalid.

It is annoying to have an array if there's only ever one item

In XML, you might decide you want to represent a headline like this:

<headline>Dog Bites Man</headline>

And that, since a story can only ever have one headline, your XML schema confidently states that there must be exactly one headline. Later, you realize that there are multiple kinds of headline. Since XML was designed with eXtensiblity in mind (it is the "X" in XML), you can alter your XML schema to allow for multiple headline elements:

<headline>Man Bites Dog</headline>
<headline>This is news!</headline>

The good news? Your old XML documents with a single headline are still valid, according to your new schema.

However, once you have a JSON document like this:

"headline" : "Dog Bites Man"

It is incompatible for a document like this - a JSON property can't be both a string and an array of strings.

"headline" : ["Man Bites Dog", "This is news!"]

So, you might be tempted to construct your JSON defensively - to "future proof" it - by making your properties into arrays, so that you can easily have more than one. However, it becomes really tedious to have to access an array of things when there is only ever one item. (Remember: simpler JSON means simpler code).

Avoid making things arrays, "just in case".

Use the headline / headlines "cheat"


One technique I've used when having to switch from single instances to array properties is to name my properties carefully. When a property is an array, I make it a plural. This leaves open the possibility of having a singular property - such as "headline" - and then later adding an array property with the plural name - "headlines".

Flatter is better

In XML, it is natural to have multi-level structures. When XML is pretty-printed, the indentation of enclosing elements helps to make the document structure easier to grasp. It is tempting to do the same when designing your JSON representation. However, in the spirit of "Simpler JSON Means Simpler Code", it is much easier to deal with your JSON if there are an absolute minimum of "grouping" structures in your JSON.

In the early drafts of IPTC's NINJS, we initially grouped different types of metadata together (into administrative metadata, descriptive metadata and so on. This distinction is a useful one and is still reflected in the "data model" diagram for NINJS:

NINJS Data Model
http://dev.iptc.org/ninjs
However, once we started to create examples, we realized that it was much better to lose those groups in the actual JSON markup. In particular, it is difficult to query JSON using complex criteria. In part, this is due to a lack of standards for how to specify JSON queries. So, the less structure the better.


Use pattern properties to strike a balance between flexibility and interoperability

Most of the JSON formats I've worked on are designed to be used by multiple systems (whether internally to AP or as a standard to be used by many publishers). To help ensure that multiple implementations wind up using your JSON in compatible ways, you want to restrict the degrees of freedom to interpret the format in different ways. On the other hand, you need to allow for future requirements so that your JSON formats will be adopted and adapted, rather than discarded as being too inflexible.

It is pretty easy to add something unknown into a JSON format - just add another property. Implementations should implement the "must ignore" pattern so that they don't break when they encounter an unknown property. However, sometimes, you want to guide future changes to the format, so that certain predictable changes are all done in the same way.

I've found it useful in JSON Schema to use "pattern properties" to strike this balance. For example, in NINJS, we wanted to allow publishers to use whichever geographic geometry JSON they wanted (to represent lat/long shapes for centroids and the like). We therefore added a patternProperty of "geometry_*" to allow publishers to use a property with a name starting with "geometry_" and then a suffix to indicate which type of geometry they are using.

Start property labels with a lower case letter

Whilst developing IPTC's NINJS, we experimented with various libraries, to make sure what we were producing would work everywhere. One gotcha we tripped across: the Java Jaskson library chokes on JSON properties with an initial uppercase letter. It was the only library we discovered that had this problem, but why take the risk?

Use a very restricted set of characters when naming properties

Since it is very common to autogenerate code from JSON property names (such as Java or .Net classes), it makes sense to restrict the character set you use. In order to maximize compatibility across libraries and languages, we determined that this is the safest set of characters to use:

[a-zA-Z_0-9]

i.e. upper and lower case alphas, numbers and underscore.

Inline text markup alternatives

One of the areas where both XML and HTML excel is the rich markup of text, particularly via inline markup. (For a couple of examples using IPTC standards, check out this example in NITF and this one using Schema.org-compatible rNews).

There are different possible ways to tackle rich text markup in JSON. Three alternatives we identified within NINJS are:
  1. Strip out all the inline markup and just leave the plain, unmarked text. Probably fine for short bits of text but tedious as soon as you have any structure - such as paragraphs, never mind hyperlinks.  Could be useful for things like indexing in a full text search engine, though.
  2. Keep the marked up text (such as HTML) in a string, escaping as necessary. Particularly good for delivering bits of web-ready text that can be integrated into a larger page.
  3. Mechanically translate the original markup into JSON structures. JSONML is a nicely-documented example of this approach. However, given that I advise against this mechanical approach in the first place, I would be very careful before adopting something like JSONML for your text markup - for all the same reasons.
Which method you pick needs to be driven by your particular requirements. It isn't a bad idea to consider having the text represented more  than one way, though, if you can afford it.

Support APIs with a full / partial representation indicator

JSON has gained traction as a format for use in APIs. Performance is a key factor in most APIs, so you may well want to deliver an API result in JSON with just a key subset of properties. In which case, you should consider adding a property that indicates whether this is a full or partial representation of the given resource - ideally along with a property that lets you retrieve the entire representation.

When we came up with this idea whilst designing NINJS, we toyed with having a way to describing more than just two possibilities (full/partial). However we decided that - in the general case - those are the only two that matter.

JSON Design: A Series


This is the third and final posting in my series on JSON Design. Part one discussed an approach to designing JSON schema. Part two discussed JSON tools and standards.

Friday, May 9, 2014

JSON Design Principles and Lessons Learnt: Handy JSON Tools and Standards (Part Two of Three)

Lessons Learnt from JSON Designs I've Worked On

Over the last couple of years, I've worked on a few JSON schema. For example, IPTC's NINJS (for representing news) and W3C GC ODRL's ODRL in JSON (for representing permissions and restrictions). I've also done some work on JSON internal to AP, for various APIs and search systems.

Along the way, I've learnt some lessons about better or worse ways to design the JSON - both about the way to do it and some JSON "style" tips. I've broken this into three posts:

A Couple of Handy JSON Tools

For basic syntax checking of your JSON documents, JSONLint is invaluable. Alternatives include JSON Formatter and Validator (online) and demjson (Python). Most of the XML tools (such as XML Spy or oXygen) also support JSON, too.

Even though JSON is touted as being a lightweight alternative to XML, equivalents of many of the features of XML are gradually being added to JSON. One that I make extensive use of is JSON Schema. This is an IETF effort. Even though - at the time of writing - JSON Schema is still a draft, it already has decent software support - including online validation and support in many languages. Having a JSON schema for your format is a great way to document how you intend the format to be used and it can help you spot certain kinds of errors. (My blog post Ban Unknown Properties! discusses some of the finer points of JSON validation).

Selecting and Querying JSON

One of the fundamentals of XML (and related standards including XSLT and XQuery) is XPath. So, imagine my excitement when I discovered JSONPath which has the tag line "XPath for JSON". It holds out the promise of language-independent way to specify properties within a given JSON document. Very handy - and there are a couple of language bindings, already. Unfortunately, it only seems to work for fairly simple expressions - it certainly doesn't have the full power of XPath. And it isn't backed by a standards body or a consortium of companies, so the future path (sic) of JSONPath isn't clear to me.

Perhaps more promising is JSONiq a fully fledged query language for JSON, which claims to be "The SQL of NoSQL". In fact, JSONiq is based very much on XQuery. Again, this is not backed by an independent standards body. It has been implemented on top of some XQuery engines (28.io, zorba.io, IBM's Websphere and Pascal). However, notably, the major JSON-native engines are directly supporting it, which means you need to use their proprietary query languages.

And it seems that there is a bit of a Cambrian Explosion going on in this area. Tim Bray recently published his blog post Fat JSON. In part, he illustrates why you need a tool to pick out properties from within a JSON document (basically, some JSON objects contain way too many properties than you need for a particular purpose). He discusses one approach - support Partial Responses in your API. That works if you're the author of the API but more likely you're the client of an API or are dealing with a complete JSON document from MongoDB or Elasticsearch or the like.

He points out several attempts to recreate XPath for JSON, which are similar to JSONPath (none of which I have tried yet, but which are all imaginatively called "[jJ][Pp]ath"):


Not to be outdone, Mr. Bray has knocked together JWalk - some Java source code to very simply pick out properties based on their names alone (i.e. not based on parent names or child property values as you would want from a more full-fat XPath style library). I suspect that this won't be the last attempt to solve this problem.


JSON Standards

As is probably obvious by now, I'm a big fan of standards. Not just because I've helped to create a few (e.g. MDDL, NewsML-G2, hNews, rNews, RightsML, NINJS) but also because - whenever I'm faced with solving a problem - I think "surely someone has done this before me?". I've found that looking at how someone else has attempted to tackle some domain is very instructive. In the best case, you can simply adopt someone else's hard work, along with documentation, working code and a thriving community who will help to quickly bring you up to speed. Of course, not all prior work is great - the compromises required to create a consensus standard are notorious for producing unwieldy solutions. But, even then, it can be instructive to help you understand what you don't want to do.

Whilst developing IPTC's News in JSON (NINJS), for example, we looked at previous efforts - both public and proprietary - to render articles, blog posts, photos and video using JSON properties. We also researched particular areas that are not directly tied to news. For example, when we were figuring out how to represent place metadata, we found it really helpful to examine the different approaches taken by GeoJSON and Geonames, amongst others. (In the end, rather than pick a winner, we decided to add a "pattern property" into NINJS so that providers could select the JSON geometry representation that best fits their needs).

A somewhat different type of JSON-related standard are things like JSON-LD. JSON Linked Data is a way to serialize the RDF data model to and from the JSON format. This W3C Recommendation is an increasingly popular way to structure JSON and is equivalent to the XML and Turtle serializations of RDF. So, if you are fundamentally working with RDF, then you should consider it (however, there are at least some JSON-LD dissenters). If you are not working with the RDF data model, then I would consider whether the additional features / complexity of JSON-LD is going to be a barrier to adoption.

As I will discuss in the third and final post in this series, one goal I prize when designing a JSON schema is that simple examples make sense "intuitively". I want them to look sufficiently appealing to, say, a Ruby developer that she decides to use that schema rather than make one up herself.

JSON Design: A Series

Part one discussed an approach to designing JSON schema. Part three will discuss JSON style.

Monday, May 5, 2014

JSON Design Principles and Lessons Learnt: An Approach to Designing JSON (Part One of Three)

Lessons Learnt from JSON Designs I've Worked On

Over the last couple of years, I've worked on a few JSON schema. For example, IPTC's NINJS (for representing news) and W3C GC ODRL's ODRL in JSON (for representing permissions and restrictions). I've also done some work on JSON internal to AP, for various APIs and search systems.

Along the way, I've learnt some lessons about better or worse ways to design the JSON - both about the way to do it and some JSON "style" tips. I've broken this into three posts:

Automagic JSON?

One way to create a JSON schema is to automatically generate one from an XML Schema. For any given domain, there's probably a decent XML Schema available, so why not take advantage of that and use of the many tools that are available to automatically generate the JSON for you?

In fact, there are quite a few different ways you can translate between XML and JSON, depending on what you're trying to achieve. Therefore, each tool can potentially generate quite different JSON for a given XML document. For a good overview of the different approaches and techniques involved, I recommend this survey of ways to map between XML and JSON. (That PDF is IBM's submission to the W3C Workshop on Data and Services Integration).

If you have a large amount of XML you want to convert into JSON, you may well need to implement your own tool to do the conversion. Not only does this let you control the choices made, it also can give you the opportunity to fix the niggling issues that inevitably arise in your XML as you extended your design in unexpected ways.

However, I recommend that you hand craft the design of your JSON representation, to make it as natural as possible.

A JSON Design Process

What I've found it a good way to design a JSON schema is to follow this simple process:

  • Identify a list of candidate properties - perhaps by reviewing relevant XML schema for inspiration
  • Think of one or two ways to represent each set of related properties in JSON - and research whether anyone else has designed something like it already
  • Construct sample JSON documents for each of the alternatives
  • Prototype some code to see how they work for your intended use
  • Select the best alternative and add it to your schema
  • Write down the examples and your rationale for picking that representation (otherwise you will forget)
  • Repeat

After a while, you'll see some repeating patterns and you'll need to write fewer prototypes to try things out. But I still recommend writing down your rationale...

Trying out the JSON in code is particularly important if you haven't done a lot of JSON work before. It really gives you a feel for the best, most natural way to work with JSON and can help get you out of your XML Mindset (if that's where you're starting from).

JSON Design: A Series

Part two will discuss JSON tools and standards.

Friday, February 7, 2014

Ban Unknown Properties!

In XML, it is common to define a schema, in part to help with validation. This means that you can take an instance document, which is meant to conform to that XML schema, and test whether it really does, using a validation engine. Such testsing can be very useful to catch otherwise hard-to-spot errors - like misspelling an attribute name or getting the order of elements wrong.
Information Validations by dopey
http://www.flickr.com/photos/dopey/9591636030/
JSON didn't originally have a schema language. However, IETF are developing one. When the IPTC created the standard for News in JSON, we decided to define a schema for NINJS, so that you can check whether your JSON objects conform to the standard.
JSON Card -- Front by equanimity
http://www.flickr.com/photos/equanimity/3762360637/
Currently, in the NINJS schema, we have set "additionalProperties" to false. That's because we want to make it possible to validate a JSON document using that schema. If we didn't set additionalProperties=false, then any property name - whether it is a deliberate provider extension or an inadvertent typo - will pass validation.
Property Line by Whatknot
http://www.flickr.com/photos/whatknot/3401555810/
On the other hand, this means that a provider who wants to add their own properties in the NINJS schema has to create their own local copy. We've explained how to do that on the dev website (http://dev.iptc.org/ninjs-How-To-create-provider-specific-extensions). But it isn't totally obvious that is what you're meant to do. And it is certainly different from other schema that the IPTC has developed using XML (such an NewsML-G2), which instead have extensibility built in from the start. However, with IETF JSON Schema v4 (the latest) that's the only choice we have (http://json-schema.org/latest/json-schema-validation.html#anchor64).
Ban Symbol by uvw916a
http://www.flickr.com/photos/25023895@N02/5350382220/
In the v5 version of JSON Schema, they are introducing "ban unknown properties mode" https://github.com/json-schema/json-schema/wiki/ban-unknown-properties-mode-%28v5-proposal%29. Rather than being something that is built into the schema (like additionalProperties), "ban unknown properties" is a directive to the validation engine. The v5 JSON Schema is due to be finalized Real Soon Now. And at least one JSON Schema validator supports it already https://github.com/geraintluff/tv4.
NEAT by wannaoreo
http://www.flickr.com/photos/wannaoreo/270690022/
This seems to be a much neater solution to the validation problem. And, frankly, without it, the JSON schema isn't nearly as powerful. So, I'd like to see IPTC adopt this for the NINJS schema. And, if you're working on a JSON schema, I recommend you look at it, too.