July 23rd, 2009

NoSQL and the Relational Model: don’t throw the baby out with the bathwater

There’s been a lot of buzz of late about the trend away from SQL and towards distributed databases; it seems the current crop of Relational Databases are increasingly being rejected in certain use cases, sometimes quite fairly.

It would be easy to come away from this criticism with the impression that the Relational Model is equally antiquated, not the right model, and no longer worth developing a full understanding of when thinking about data.

This (pardon me) is wrong, but first let’s summarize why CouchDB and companions are taking over some of the turf of popular RDBMSes when it comes to scaling out.

Problems scaling with RDBMSes

Scaling often requires you to drop down from the abstracted level of a normalised SQL schema and heavily optimise the physical representation of data. Not just how it’s indexed, but which data lives where, what kind of derived data gets cached and how/where/when/by whom, how those cached copies are kept up to date and in sync, what the dataflow is like, how it’s prioritised etc.

Predictably messy consequences ensue when you try to do this lower-level optimisation of the physical storage and dataflow of your particular application at scale, on top of a higher-level, general-purpose tool (RDBMS with SQL).

When an RDBMS tries to anticipate and provide for all the kinds of storage and dataflow optimisations that one might need scale a particular application, you end up with a gigantic, monolithic, tricky-to-configure, tricky-to-understand monster (*cough* Oracle) that ultimately fails as a silver bullet; And so often people will prefer to work with tools which are lower-level and “closer to the metal”, but more focused and well-understood. Hence CouchDB, hence MapReduce, etc etc.

There’s also the fact that current RDBMSes tend to be built around stronger transactional models which aren’t compatible with some of the fault-tolerant distributed consensus cleverness that’s recently in vogue.

Why these are not problems with the Relational Model itself

So, RDBMSes try to abstract the logical model of data away from the physical storage model (a noble goal!). Sure, their implementation of these abstractions often breaks down at scale. Sure, lower-level database tools may often be more appropriate in certain scalability situations.

But, understanding and being able to think clearly about the logical model behind your data independently of the physical storage model will always be valuable. And the relational model was developed through a lot of research to be one of the best formal tools for doing this reasoning. It’s a reformulation of first-order logic (one of the simplest, most elegant and fundamental formal systems in mathematics) into the language of data. In no way is it tied to SQL-based database technology.

So my key point is, this kind of modelling is WORTH DOING, regardless of which database tool you end up using for physical storage. Whether your logical schema is a highly-normalised SQL schema in an RDBMS, or a normalised schema diagram on paper (in your head even) with a distributed key-value store being used to do the actual work with denormalised representations. Being able to reason clearly about that logical schema and the way it’s represented physically, is what will save your arse when it comes to data integrity in a complex system.

Now the technology behind some of these distributed key-value stores is still relatively young; what I hope is that, in time, people will develop good techniques and tools for applying relational reasoning to what’s being done on top of these simpler key-value storage models and distributed replication models. There’s no reason why not—the relational model is the language of data, it’s not tied to the transactional model used by current RDBMSes.

A footnote about ORMs and misconceptions created by their awkwardness

One misunderstanding about relational databases which I’ve read a couple of times, is that SQL and the relational model are somehow ‘lower level’ than a key-value store or OODB. I think this impression is formed because Object databases are a closer match for the semantics of OO languages, whereas dealing with an RDBMS via an ORM feels like having to work on top of some awkwardly-fitting, perhaps-lower-level API.

In fact, the truth is the other way round—the relational model is a higher-level language for talking about data. The problem is that there’s a disconnect between its semantics and those of OO languages, whose data model is inherently lower-level, closer to the physical storage model (full of issues like “which objects need to maintain references to which other objects in order to keep a track of this relationship so it can be traversed efficiently?”). In fact the relational model is a much closer match in its semantics for languages like Prolog; the famous “impendance mismatch” of Object-Relational Mapping is indicative of deeper issues in programming language semantics, which I plan to post about again at some point.

One Response to “NoSQL and the Relational Model: don’t throw the baby out with the bathwater”

  1. SQL is Dead. Long Live SQL. : Dataspora Blog Says: