July 23rd, 2009

NoSQL and the Relational Model: don’t throw the baby out with the bathwater

There’s been a lot of buzz of late about the trend away from SQL and towards distributed databases; it seems the current crop of Relational Databases are increasingly being rejected in certain use cases, sometimes quite fairly.

It would be easy to come away from this criticism with the impression that the Relational Model is equally antiquated, not the right model, and no longer worth developing a full understanding of when thinking about data.

This (pardon me) is wrong, but first let’s summarize why CouchDB and companions are taking over some of the turf of popular RDBMSes when it comes to scaling out.

Problems scaling with RDBMSes

Scaling often requires you to drop down from the abstracted level of a normalised SQL schema and heavily optimise the physical representation of data. Not just how it’s indexed, but which data lives where, what kind of derived data gets cached and how/where/when/by whom, how those cached copies are kept up to date and in sync, what the dataflow is like, how it’s prioritised etc.

Predictably messy consequences ensue when you try to do this lower-level optimisation of the physical storage and dataflow of your particular application at scale, on top of a higher-level, general-purpose tool (RDBMS with SQL).

When an RDBMS tries to anticipate and provide for all the kinds of storage and dataflow optimisations that one might need scale a particular application, you end up with a gigantic, monolithic, tricky-to-configure, tricky-to-understand monster (*cough* Oracle) that ultimately fails as a silver bullet; And so often people will prefer to work with tools which are lower-level and “closer to the metal”, but more focused and well-understood. Hence CouchDB, hence MapReduce, etc etc.

There’s also the fact that current RDBMSes tend to be built around stronger transactional models which aren’t compatible with some of the fault-tolerant distributed consensus cleverness that’s recently in vogue.

Why these are not problems with the Relational Model itself

So, RDBMSes try to abstract the logical model of data away from the physical storage model (a noble goal!). Sure, their implementation of these abstractions often breaks down at scale. Sure, lower-level database tools may often be more appropriate in certain scalability situations.

But, understanding and being able to think clearly about the logical model behind your data independently of the physical storage model will always be valuable. And the relational model was developed through a lot of research to be one of the best formal tools for doing this reasoning. It’s a reformulation of first-order logic (one of the simplest, most elegant and fundamental formal systems in mathematics) into the language of data. In no way is it tied to SQL-based database technology.

So my key point is, this kind of modelling is WORTH DOING, regardless of which database tool you end up using for physical storage. Whether your logical schema is a highly-normalised SQL schema in an RDBMS, or a normalised schema diagram on paper (in your head even) with a distributed key-value store being used to do the actual work with denormalised representations. Being able to reason clearly about that logical schema and the way it’s represented physically, is what will save your arse when it comes to data integrity in a complex system.

Now the technology behind some of these distributed key-value stores is still relatively young; what I hope is that, in time, people will develop good techniques and tools for applying relational reasoning to what’s being done on top of these simpler key-value storage models and distributed replication models. There’s no reason why not—the relational model is the language of data, it’s not tied to the transactional model used by current RDBMSes.

A footnote about ORMs and misconceptions created by their awkwardness

One misunderstanding about relational databases which I’ve read a couple of times, is that SQL and the relational model are somehow ‘lower level’ than a key-value store or OODB. I think this impression is formed because Object databases are a closer match for the semantics of OO languages, whereas dealing with an RDBMS via an ORM feels like having to work on top of some awkwardly-fitting, perhaps-lower-level API.

In fact, the truth is the other way round—the relational model is a higher-level language for talking about data. The problem is that there’s a disconnect between its semantics and those of OO languages, whose data model is inherently lower-level, closer to the physical storage model (full of issues like “which objects need to maintain references to which other objects in order to keep a track of this relationship so it can be traversed efficiently?”). In fact the relational model is a much closer match in its semantics for languages like Prolog; the famous “impendance mismatch” of Object-Relational Mapping is indicative of deeper issues in programming language semantics, which I plan to post about again at some point.

April 25th, 2009

dojo.Deferred is a Monad

If you’ve heard of Monads, you may well view them as a rather exotic artefact of embedding an imperative programming style into a pure functional language like Haskell with a complex type system. In fact, monad-like abstractions are often found “in the wild” in untyped, imperative, dynamic languages. They’re useful as an abstraction for chaining together chunks of code, when the language doesn’t provide native control flow constructs for the desired style of chaining.

If you haven’t heard of them, one of the many notorious Monad tutorials might be a good start. What follows is an attempt to provide additional insight into Deferred from a monadic perspective, and a look at how and why these abstract beasts crop up in languages like Javascript.

Monads in imperative-land

One simple example of this is the sequencing of operations each of which may result in a null, so that if a null is returned at any point along the chain of operations, evaluation stops and the null is returned as the overall result.

Most languages lack a built-in sequencing operator for this (foo && foo.bar && foo.bar.baz doesn’t count), but libraries like Object#andand for Ruby make it easier. This is (more-or-less) a Ruby implementation of the Maybe monad from Haskell, as The Maybe monad in Ruby explains.

The example I wanted to address is more interesting though:

Chaining asynchronous operations

If you’ve written much Javascript, you’ll have felt the pain here. Chaining together asynchronous operations (like AJAX calls) can get messy, with a spaghetti of little ad-hoc callback functions being passed around, and a concern which should be orthogonal to the application logic (whether a particular operation completes synchronously or asynchronously), instead infecting the interfaces you write (and the code that calls them, and the code that calls that code, and someone else’s code which deals with callbacks differently) with all sorts of special cases.

In a language with continuations (or co-routines, or generators, which are weaker but still adequate), you can use these to turn an asynchronous operation into a blocking operation, and chain these together using native control flow constructs—resulting in a simpler, linear flow of code1.

Javascript (and many other languages) lack these features though1, leaving you to chain async operations more explicitly using the other abstractions the language provides. And it really helps to have a consistent, nicely-abstracted approach when doing this, which keeps the chaining mechanism as orthogonal as possible from the application logic.

Most Javascript code fails on this front. The resulting mess sometimes resembles a not-very-rigorous attempt at continuation-passing style (which is actually a technique used by some compilers to target a language without native continuations, but not a particularly nice thing to have to attempt by hand). Sometimes you see variations on the Observer pattern being used to similar effect, sometimes the code copies (or hijacks) the DOM’s event callback interface.

Either way, the code appears to cry out for a simple abstraction which tidily encapsulates “this is a value which will become available asynchronously”, allowing such values to be passed around, composed and sequenced in a consistent fashion.

And that’s where dojo.Deferred steps in. (Credit where credit’s due, this particular formulation seems to originate with the excellent Twisted project and twisted.internet.defer.Deferred, which I’ve used on previous project to implement event-loop-driven chat servers. Similar concepts have existed for a long time in languages like Scheme however).

It takes a while to ‘get’ the use of Deferred, but it’s worth doing if you do Javascript. I’ll leave any further advocacy to the docs though, and instead focus on pointing out some interesting parallels between the Deferred abstraction, and (as promised) the notion of a Monad from functional programming. This results in a new perspective on an existing idea, and some resulting lessons on how the semantics of Deferred might be gently tweaked to be a bit cleaner.

dojo.Deferred as a Monad

So what do we need to do to define a Monad? Wikipedia says:

Formally, a monad is constructed by defining two operations bind and return and a type constructor M that must fulfill several properties

Let’s get the definitions out of the way. Now Javascript doesn’t have a (static) type system, and so we won’t be able to define a ‘type constructor’ per se. Luckily a Javascript constructor function, dojo.Deferred will do instead as our M. (Contrary to Wikipedia, a Monad doesn’t actually require a type system2)

Here’s return, whose job is to wrap a normal value inside the Monad. (More concretely, in our case, to create a Deferred which returns the given value right away).

function monadReturn(value) {
  var d = new dojo.Deferred;
  d.callback(value);
  return d;
}

And here’s bind, whose job is to sequence two operations (more concretely in our case, to pass a Deferred value to a function which returns another Deferred, and get back (you guessed it) a Deferred representing the result:

function bind(deferred, func) {
  return deferred.addCallback(func);
}

For those familiar with Deferred, this takes advantage of two things: the capability of a callback on a Deferred to return a value, changing what gets passed to subsequent callbacks; and the sequencing it does when this returned value is itself a Deferred.

The case where ‘addCallback’ is given a callback that returns a plain value, rather than a Deferred, corresponds to the map or fmap function from Monad lore. The job of map is to apply a function to a value inside the Monad wrapper (more concretely in our case, to take a Deferred value, and return another Deferred where the function has been applied to the result of the first).

It has the same implementation as bind, because of some dynamically-typed special-casing on the return value inside dojo.Deferred:

function map(deferred, func) {
  return deferred.addCallback(func);
}

Note that both these implementations have the unfortunate side-effect of modifying the original Deferred in-place, which isn’t really in keeping with the semantics of a monad from functional programming. So long as you throw away the original Deferred, though, you can safely ignore this.

Lessons for dojo.Deferred

Looking at this from a monadic, FP perspective, we’ve found an uglinesses in dojo.Deferred—the way in which you’re encouraged to use the return value of a callback to modify the return value of a Deferred in-place. Sometimes this is OK, but sometimes you have an API which exposes a Deferred value to various different observers. One observer can inadvertently change the value which the other observers see, and observers lack an easy mechanism to apply a function to the value within the deferred without having such a side-effect.

My solution is this: separate out the ability of observers to register callbacks on a Deferred (addCallback), from the chaining/mapping functionality (bind and map from above). These are separate concerns and mixing them up into the same function causes problems. Secondly, don’t allow callbacks to modify the results of a Deferred; bind and map are now sufficient to address the use cases for this. Finally, for conceptual clarity, consider having separate methods for bind and map, rather than combining them using a special dynamic check on the return type.

I suspect Deferred’s code could be a bit simpler and more concise if it was refactored from this point of view. That said you can work this way with the current implementation by doing two things: firstly, never using addCallback with a return value from the callback. Secondly, using these ‘map’ and ‘bind’ methods, or something like them, to create a new Deferred whenever you want to apply a function to a value inside an existing Deferred:

// A 'map' operation for Deferreds.
// Returns a new Deferred which yields the supplied function
// applied to the result of the original Deferred.
dojo.Deferred.prototype.map = function() {
  var func = dojo.hitch.apply(dojo, arguments);
  var deferred = new dojo.Deferred;
  function callback(result) {deferred.callback(result)};
  function errback(error) {deferred.errback(error)};
 
  this.addCallbacks(function(result) {
    try {result = func(result)}
    catch(error) {errback(error)}
    callback(result);
  }, errback);
 
  return deferred;
}
 
// A 'bind' operation for Deferreds.
// Returns a new Deferred, whose result comes from
// passing the result of the original deferred to the
// supplied function, and waiting for the result of the
// deferred which that function returns.
dojo.Deferred.prototype.bind = function() {
  var func = dojo.hitch.apply(dojo, arguments);
  var deferred = new dojo.Deferred;
  function callback(result) {deferred.callback(result)};
  function errback(error) {deferred.errback(error)};
 
  this.addCallbacks(function(firstResult) {
    try { firstResult = func(firstResult) }
    catch(error) {errback(error)}
    firstResult.addCallbacks(callback, errback);
  }, errback);
 
  return deferred;
}

(The handling of error conditions here reminds me to mention that dojo.Deferred also handles the chaining of error conditions in a nice way which is loosely analogous to the Error monad in Haskell. One might be able to construct a monad similar to Deferred in Haskell using monad transformers, perhaps some combination with ContT and ErrorT ontop of IO, but I’ll leave that to the experts :) )


1 Narrative JS adds a capability like this to Javascript, but requires a compilation step. The point of this article was more to explore the patterns involved in doing the sequencing without such native features.

2 Technically M just needs to be a Functor, not a type constructor necessarily. The term Functor comes from category theory; in the case of a statically typed language, it’s taken that ‘Functor’ means ‘Endofunctor on the category of static types of the language’. You can view a dynamically typed language like Javascript as having a static type system with just one ‘Everything’ type. The category in question is then a one-object category, or more simply a monoid whose elements are functions in that language and whose operation is function composition. A Functor is then an endomorphism on this monoid, a mapping from javascript functions to javascript functions, which respects function composition. In our case, one which allows a general javascript function to be turned into a function that applies specifically to Deferred values, such that the functions behave ‘as you would expect’ on Deferreds in the same way they apply in general. Our map function for Deferreds does essentially this, give or take currying of arguments and javascript method style. There are also conditions to be met (and explained) on the bind and return to qualify as a monad, which I’ll (ahem) leave to the reader to translate into the language of Deferreds and verify.