Friday, 25 October 2013

CAP Theorem 2 - The Basic Tradeoff

WARNING - on further reading I'm not at all sure the below is accurate. Take it with a large pinch of salt as part of my ;earning experience...


You can't sacrifice Availability, so you have to choose between being Consistent and being Partition Tolerant. But only in the event of a network partition! You can be Partition Tolerant and still be Consistent when no partition is occurring.

Following up from my previous post on CAP Theorem, I'm going to discuss what in practical terms the CAP trade-off means.

A is non-negotiable - a truly CP data store is a broken idea

Remember, "Available" doesn't mean "working", "Available" means "doesn't hang indefinitely". In the event of a network partition a truly CP data store will simply hang on a request until it has heard from all its replicas. A system that hangs indefinitely is a chocolate tea pot. Poorly written clients will also hang indefinitely, ending up with users sitting staring at some equivalent of the Microsoft sand timer. In the end someone (a well written client, or just the poor schmuck staring at his non-responsive computer) will decide to time the operation out and give up, leaving them in the same not-working state as a CA system but with the additional worry that they've no idea what happened to the request they sent.

Hang on, there are CP data stores out there aren't there?

No, not really - not as I understand CAP theorem, anyway. See below!

The choice is between CA and AP

In fact it can be reduced to a very, very simple trade-off - in the event of a network partition, do I want the data store to continue to work or do I want the data store to remain consistent?

CA means a single point of failure

CA is the simplest model. It's what we get when we run up a single node ACID data store - it's either there, working and consistent or it isn't. There are ways to add a measure of redundancy to it in the form of read-only slaves with a distributed lock, but fundamentally if a network partition occurs between them and the master then the master has to stop accepting writes if it is to remain consistent with the slave.

It's a model that means outages are essentially guaranteed. If that's acceptable then it's nice and easy for developers to work with; but it's rarely acceptable.

Which leaves AP

Nearly all data stores used in scenarios where there is a desire to avoid outages entirely in so far as is possible (human error notwithstanding). Which means having multiple copies of state on machines connected by the network, which means network partitions can and will happen. Which means needing to be available and tolerant of those partitions.

Oh noes! No consistency! Sounds dreadful...

The important point to remember here is that the loss of Consistency implied by Partition Tolerance (i.e. Continuing to Work) only has to be accepted in the event of a partition. This is what lots of so-called "CP" systems are trying to do - remain consistent whilst the network is healthy, and only become inconsistent in the event of a partition.

Wednesday, 23 October 2013

CAP Theorem

WARNING - on further reading I'm not at all sure the below is accurate. Take it with a large pinch of salt as part of my learning experience...

I've been a bit confused over the meaning of the C, A and P of CAP theorem. I think I've got it sussed now, so this post is my attempt to encapsulate that knowledge and get it out there for someone to correct if I'm still wrong!

C - Consistent

This is the easy one - I think i've always understood this, though I'm sure there are nuances to it. If you write some data then, on anyone anywhere trying to read it, then so long as they do not get an error or someone else has independently updated it in the meantime then they will see the same data you wrote.

A - Available

Took me a while to get this one; I was thinking of it in terms of whether a system is up or not. That's not what Available means in this context. All it means is "able to return a response in a timely manner". That response could be as simple as a refusal to allow a new TCP connection - that's a response, and a timely one. An HTTP system returning 500 errors is available. If you're not timing out trying to communicate with the system, it's available, no matter how unhelpful the responses you are getting back are.

In contrast, a system is unavailable when a client gets nothing back at all and are left waiting until you timeout (you've got a timeout set up, right? Right?). Stick a Thread.sleep(Long.MAX_VALUE) in your HTTP handling code and your system is unavailable. Put a firewall in the way that quietly drops all response packets and you're unavailable.

P - Partition Tolerant

There's two aspects to this one. The first is the obvious - a network partition occurs so that two nodes in a cluster are unable to communicate without them getting a chance to sign off from each other beforehand. What was less obvious to me at first is that a node that crashes is an example of a partition - not, as I naively thought, an example of being unavailable. The other nodes in the cluster cannot distinguish between "crashed" and "network issue somewhere between us". 

A system is Partition Tolerant if it a) has more than one node and b) it can handle transactions without returning an error in the event that those nodes cannot communicate. 


It should be obvious that network partitions can always happen wherever a cluster exists with multiple nodes that hold their own copy of state and that need to communicate over a network in order to maintain consistent state. CAP theorem says that when that partition happens, one of C, A and P has to be sacrificed. And now it should be fairly clear why. When a client attempts to write to a cluster which is partitioned, that write will arrive at one side or the other of the partition, and the system will have to do one of three things:

Wait for the Partition to Heal (CP)

The simple solution to maintain consistency is for the node getting the write to wait, and not return a response until it knows that write has been committed on all nodes. Obviously this sacrifices availability - the partition may never heal, or not for a prohibitively long time. However, data will be consistent and no errors are returned, so we have a rather useless Partition Tolerance.

Discard the Write and return an Error (CA)

Option two is to return an error. Consistency is maintained by the simple expedient of not changing state at all. The system is available - it's returning errors in a timely manner. However it's not partition tolerant - indeed it's questionable whether there's any benefit over a single node data store. By having more than one node and a network connection the chances of failure are simply increased. A single node data store is CA - it's either there or not.

Accept the Write (AP)

The system is available and partition tolerant - no hanging, no error returned. The cost is that it is not consistent - the state either side of the partition is different, and someone reading from the other side of the partition will not see it. A dynamo style store with a read/write quota lower than half the nodes has sacrificed C in return for A and P. 

It's Not That Simple

Of course it isn't - the C, A and P qualities are not binary, they are a continuum, and data stores can make trade offs between them. A dynamo style store can choose to sacrifice some tolerance to a partition in return for more consistency by setting quora at a level of n/2 +1. A system could tolerate mild unavailability in the hope of the partition healing quickly. A store can vote up masters so that consistency is only sacrificed between partitioned halves, not sacrificed between all nodes. You get the idea.