February 26, 2008

The End of the Database As We Know It?

Filed under: Software Blog — marcstober @ 10:44 pm

I found this paper interesting as it presents an academic view of something I’ve been noticing from a grassroots perspective: the database as we know it is losing relevance, and new technologies will emerge that will make business applications faster to run and faster to develop.

Because of the graduate program I attended, one of the few computer science topics I know a lot about is relational database management systems (RDBMS), which underly most business applications today. In every job I’ve had, I’ve used an RDBMS, usually Microsoft SQL Server. Most developers don’t know fundamental RDBMS concepts; for example that you can’t just delete the log file when you run out of disk space, or what ACID properties are. As the paper points out, RDBMS’s are based on relational algebra and algorithms for managing disk files that haven’t changed since the 1970’s. Their use case is really a systems analyst entering SQL queries at a command prompt, which no one does any more. The fact that so few SQL (say “see-quell,” no one I know says “RDBMS”) users know these things is evidence of their irrelevance. No one decides SQL is the best type of database management system for their application; they choose it because it’s standard, you can find people and tools to support it, and does the job well enough.

I like one example the paper gives for a way to improve databases, which is to support a “constrained tree application” (CTA): an application that operates on a chunk of records from various tables that all relate back to one record from one table. For example, a customer and his various orders, order items, invoices, contacts, locations, etc. Every database I’ve worked on for the past 10 years, and for that matter almost every database everywhere, follows this pattern, but the RDBMS doesn’t provide any good features to support this. I’ve seen developers try to optimize this through various implementations, and spend a lot of time debating the right way to do it. We shouldn’t have to reinvent the wheel; the platform should take care of it. (This would also help minimize the differences between multitenant and single-tenant architecture, but that’s another debate.) I could come up with dozens of patterns that every application needs to reinvent when they store their data in a SQL database.

Actually, the RDBMS doesn’t have to disappear to be irrelevant, it can just be used wrongly, and this has already happened. An RDBMS is supposed to have a schema designed according rigorous rules of normalization, including constraints on and between tables to maintain data integrity. If you don’t do this—relying instead on a layer of code outside the database to enforce business rules—you have a set of indexed, multi-user tables, not a true relational database. Since SQL can’t enforce every business requirement anyway, it may make sense to build applications this way, and not try for a normalized data
model at all (although, as a classically-trained SQL developer, I’m not convinced this is the solution).

Right now I’m doing a lot of work in Intersystems’ Caché. While they call it a “post-relational” database, it has a lot of warts since it’s based on an even older, pre-relational, hierarchical database system developed in the 1960’s (not that much older than the first RDBMS, though) that’s been extended and evolved. Still, other object-oriented databases have come and gone over the years (sometimes under the buzzword “persistence”), and Caché stands as proof that the RDBMS is not the only feasible choice for business applications. There are newer alternatives emerging, at least to solve specific problems; for example, Microsoft’s Linq, or Ruby on Rails, which I’m hearing about everywhere now. If these technologies can do a better job solving problems, it could be a big change, indeed.