NoSQL and the Relational Model: don’t throw the baby out with the bathwater
There’s been a lot of buzz of late about the trend away from SQL and towards distributed databases; it seems the current crop of Relational Databases are increasingly being rejected in certain use cases, sometimes quite fairly.
It would be easy to come away from this criticism with the impression that the Relational Model is equally antiquated, not the right model, and no longer worth developing a full understanding of when thinking about data.
This (pardon me) is wrong, but first let’s summarize why CouchDB and companions are taking over some of the turf of popular RDBMSes when it comes to scaling out.
Problems scaling with RDBMSes
Scaling often requires you to drop down from the abstracted level of a normalised SQL schema and heavily optimise the physical representation of data. Not just how it’s indexed, but which data lives where, what kind of derived data gets cached and how/where/when/by whom, how those cached copies are kept up to date and in sync, what the dataflow is like, how it’s prioritised etc.
Predictably messy consequences ensue when you try to do this lower-level optimisation of the physical storage and dataflow of your particular application at scale, on top of a higher-level, general-purpose tool (RDBMS with SQL).
When an RDBMS tries to anticipate and provide for all the kinds of storage and dataflow optimisations that one might need scale a particular application, you end up with a gigantic, monolithic, tricky-to-configure, tricky-to-understand monster (*cough* Oracle) that ultimately fails as a silver bullet; And so often people will prefer to work with tools which are lower-level and “closer to the metal”, but more focused and well-understood. Hence CouchDB, hence MapReduce, etc etc.
There’s also the fact that current RDBMSes tend to be built around stronger transactional models which aren’t compatible with some of the fault-tolerant distributed consensus cleverness that’s recently in vogue.
Why these are not problems with the Relational Model itself
So, RDBMSes try to abstract the logical model of data away from the physical storage model (a noble goal!). Sure, their implementation of these abstractions often breaks down at scale. Sure, lower-level database tools may often be more appropriate in certain scalability situations.
But, understanding and being able to think clearly about the logical model behind your data independently of the physical storage model will always be valuable. And the relational model was developed through a lot of research to be one of the best formal tools for doing this reasoning. It’s a reformulation of first-order logic (one of the simplest, most elegant and fundamental formal systems in mathematics) into the language of data. In no way is it tied to SQL-based database technology.
So my key point is, this kind of modelling is WORTH DOING, regardless of which database tool you end up using for physical storage. Whether your logical schema is a highly-normalised SQL schema in an RDBMS, or a normalised schema diagram on paper (in your head even) with a distributed key-value store being used to do the actual work with denormalised representations. Being able to reason clearly about that logical schema and the way it’s represented physically, is what will save your arse when it comes to data integrity in a complex system.
Now the technology behind some of these distributed key-value stores is still relatively young; what I hope is that, in time, people will develop good techniques and tools for applying relational reasoning to what’s being done on top of these simpler key-value storage models and distributed replication models. There’s no reason why not—the relational model is the language of data, it’s not tied to the transactional model used by current RDBMSes.
A footnote about ORMs and misconceptions created by their awkwardness
One misunderstanding about relational databases which I’ve read a couple of times, is that SQL and the relational model are somehow ‘lower level’ than a key-value store or OODB. I think this impression is formed because Object databases are a closer match for the semantics of OO languages, whereas dealing with an RDBMS via an ORM feels like having to work on top of some awkwardly-fitting, perhaps-lower-level API.
In fact, the truth is the other way round—the relational model is a higher-level language for talking about data. The problem is that there’s a disconnect between its semantics and those of OO languages, whose data model is inherently lower-level, closer to the physical storage model (full of issues like “which objects need to maintain references to which other objects in order to keep a track of this relationship so it can be traversed efficiently?”). In fact the relational model is a much closer match in its semantics for languages like Prolog; the famous “impendance mismatch” of Object-Relational Mapping is indicative of deeper issues in programming language semantics, which I plan to post about again at some point.
November 25th, 2009 at 11:58 am
[...] others have noted, some balance is in [...]
April 6th, 2010 at 5:40 pm
You are right on. I’ve developed an as yet unreleased relational modeling API (called Monarch) that allows you to build up relations with algebraic operations (select, join, project, etc). There’s a server side component in Ruby and a client-side component in JavaScript. You can actually construct a relation on the client and send it to the server, where it is re-interpreted in terms of a set of user-specific views that sandbox the queries to a permitted subset of the database. The tree of algebraic operations can of course be compiled to SQL and run against a relational db (as has already been done by Arel and SQL Alchemy), but you can also do other things with it. For example, you can subscribe to a relation to receive updates on its contents from the browser. I’m basically recapitulating the concept of a materialized view at the application level. It’s a great model for subscribing to real time updates on arbitrary slices of the domain. Anyway, my thought is that by capturing the relational model at the application level, it may be possible to leverage it in a manner less monolithic than just sending SQL to a relational database. I have this idea that I could use the event propagation system to maintain my own materialized views that are stored in Cassandra. By making the application itself aware of the relational model, perhaps physical data layout and access patterns can be better controlled. Some relations can be maintained automatically. Others may need programmer intervention that explicates ad hoc optimizations. Etc. But bringing the theoretical underpinnings closer to the programming language means the language can be leveraged in making these choices. Just dipping my toe in the water of working with relations has unveiled a daunting array of theoretical challenges. For instance, if I do maintain a tree of interdependent materialized views, how do I detect and exploit any redundancy shared between the relations that define them? My research quickly leads me into heavy math and lots of papers behind academic pay walls. But the research exists and is ongoing… why limit the knowledge to the innards of proprietary databases? Anyway, drop me a line and I’d love to add you as a collaborator on the repository if you’re interested.
April 6th, 2010 at 6:03 pm
Nathan – that sounds like a good PhD thesis topic
might be worth trying to address some of the sub-problems with a reasonable degree of rigour before attempting to solve them all at once though?
Eg one big concern with that approach would be the scope for DoS vulnerabilities if remote users are allowed to construct arbitrary queries. You’d need a way to sandbox things in terms of computational resources, not just in terms of authorisation for the logical model.
And I guess the general problem is that of the ‘leaky abstraction’. It’s not just the physical data layout that needs to be optimised beneath the relational abstraction, it’s dynamic policies relating to dataflow, indexing, cache expiry policies, other trade-offs between being consistent and up-to-date and being scalable, especially in distributed scenarios. If you could find a clean formalism for expressing these kinds of policies and trade-offs and optimisations for a logical model that’s expressed relationally, that would be a big part of the puzzle. Probably it has been done already to some extent, but like you say, buried in old papers behind pay-walls and a Really Big Problem to take all that research and turn it into tools that engineers can understand and respect and use.