Database developers all know the ACID acronym. It says that database transactions should be:
- Atomic: Everything in a transaction succeeds or the entire transaction is rolled back.
- Consistent: A transaction cannot leave the database in an inconsistent state.
- Isolated: Transactions cannot interfere with each other.
- Durable: Completed transactions persist, even when servers restart etc.
These qualities seem indispensable, and yet they are incompatible with availability and performance in very large systems. For example, suppose you run an online book store and you proudly display how many of each book you have in your inventory. Every time someone is in the process of buying a book, you lock part of the database until they finish so that all visitors around the world will see accurate inventory numbers. That works well if you run The Shop Around the Corner but not if you run Amazon.com.
Amazon might instead use cached data. Users would not see not the inventory count at this second, but what it was say an hour ago when the last snapshot was taken. Also, Amazon might violate the “I” in ACID by tolerating a small probability that simultaneous transactions could interfere with each other. For example, two customers might both believe that they just purchased the last copy of a certain book. The company might risk having to apologize to one of the two customers (and maybe compensate them with a gift card) rather than slowing down their site and irritating myriad other customers.
There is a computer science theorem that quantifies the inevitable trade-offs. Eric Brewer’s CAP theorem says that if you want consistency, availability, and partition tolerance, you have to settle for two out of three. (For a distributed system, partition tolerance means the system will continue to work unless there is a total network failure. A few nodes can fail and the system keeps going.)
An alternative to ACID is BASE:
- Basic Availability
- Soft-state
- Eventual consistency
Rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state. (Accounting systems do this all the time. It’s called “closing out the books.”) It’s OK to use stale data, and it’s OK to give approximate answers.
It’s harder to develop software in the fault-tolerant BASE world compared to the fastidious ACID world, but Brewer’s CAP theorem says you have no choice if you want to scale up. However, as Brewer points out in this presentation, there is a continuum between ACID and BASE. You can decide how close you want to be to one end of the continuum or the other according to your priorities.
One clean way to deal with some of these problems is through Resource Versioning (ETags):
http://code.google.com/apis/gdata/docs/2.0/reference.html#ResourceVersioning
The trick is this: the data you get might be stale, but because it is tagged with a version number, it is not inconsistent with anybody’s else version. If each of us got some data about the number of remaining books, and our numbers disagree, then whoever has the latest version number knows he has the right version.
In this universe, data is immutable. That is, you get values as…
“inventory of book x at revision number y was z”
This is a fact. All servers will agree on it, no matter what. Compare with the simpler
“inventory of book x is z”
This can be inconsistent from server to server.
I think that trying to build scalable systems without etags is crazy.
Of course, one central authority must grant these version numbers. Someone, somewhere, must have the definitive say on the number of books remaining. But with a physical resource like books, that’s unavoidable, isn’t it?
Moreover, you can partition the authority. For example, one server is in charge of all of the books starting with the letter A. (Better yet, using consistent hashing… http://michaelnielsen.org/blog/?p=613)
BTW I don’t think it happens often that Amazon sells the same book twice. If these types of errors happen due to database errors, I’d like to know about it.
It’s interesting how today there is a lot of debate about the comparative advantages of functional programming vs. mutable-state systems (especially in distributed or multi-threaded environments), and there is also this long standing object vs. relational impedance mismatch problem. Yet it seems like not a lot of ideas have permeated the mostly arbitrary boundary between “programming” and “databases” (I suppose this is a testament to the success of the relational database model up to this point). We don’t even have very good terminology to talk about non-relational databases, and there doesn’t seem to be much discussion about functional programming ideas (especially immutable data) applied to databases. I think we’re on the right track with versioned records and some of the distributed source control systems out there. CouchDB, big table, git, mercurial, etc, are all good experiments that I think are steps in the right direction. But you can see how limited and immature these ideas are compared to the richness and maturity of the relational db ecosystem.
Ultimately I think the peak of the relational database era is on the horizon (in the next 5-10 years) and we’ll see “databases” founded on different models rise and become dominant (due to the fundamental scalability problems of relational data more than anything). Nevertheless, I don’t think anyone today has yet found the right model/featureset that will form the core of the SQL killer data systems of the future.
I’d like to point out that this problem was solved 30+ years ago. Airlines need ACID, and run really high volume systems. The big difference is that they’re not using SQL databases. AFAIK, attempts to duplicate the SABER system using a SQL database failed.
(Besides SQL databases and object databases, there are heirarchical databases. MUMPS, which is widely used in the healthcare industry uses a heirarchical database in its default implementation, I believe.)
fwiw, ACID’s nothing to do with DBMS. Come to that Relational’s nothing to do with ACID. ACID’s a behavioural model that has semantics that can be understood by programmers and arose, at least partly, from the observation that programmers try to hard to recover their code from errors and usually second guess wrongly.
Relational models arise to remove the typing constraints of CODASYL and the poor a priori understanding of how data elements relate to each other.
As Mike hints, OLTP systems can now comfortably scale to millions of concurrent online users. At that scale, outages that bring down the whole system are a pain – they’re still mostly caused by human error, whether programmer or operator – and the partitioning benefits of BASE kick in.
What can you tell me about COBOL and databases of sorts? Our world’s most dependable systems (such as Banking information) still rely on Mainframes, basically because they do their work very well.
Is it fair to get these into comparison between traditional DBMS ACID DB’s and NoSQL BASE DB’s?
@Wedge I suppose you have not heard about Datomic yet, because it exists in exactly the functional/immutable space you are talking about. You can e.g. look up Rich Hickey’s talk “The Database as a Value” where he explains some of the ideas behind it.