System design reading: Spanner

5 min readJan 1, 2021

Paper: http://nil.csail.mit.edu/6.824/2020/papers/spanner.pdf

Spanner is a globally (multi-region, multi-datacenter) distributed database. It offers high availability and strong consistency, which is rare combination in modern database design.

Traditional database, e.g. MySQL, offers ACID transaction. ACID is a strongly consistent programming model. However, ACID database struggles to scale out to more than tens of TB data. Traditional database often use primary/secondary replication, It’s hard to scale up to multi-datacenter, let alone multi-region. If one machine or one datacenter is down, the database is unavailable until problems are repaired. In recent years, many NoSQL databases are designed to solve the problem of scalability and availability. But most NoSQL databases do not offer strong consistency. It‘s commonly recognized that availability and consistency are incompatible, which is famously coined as the “CAP theorem”. I think Spanner is trying to change this convention and prove that highly available database can be strongly consistent at the same time. Spanner grantees the strongest version of consistency, namely external consistency or sequential consistency.

First of all, why do we need consistency? Why developers can not happily live in an eventually consistent world? Most application developers come from a history of using ACID database. ACID is the golden standard of database. ACID guarantees our program will work and save us the trouble of worrying about how the database executes transactions. But when we enter the eventual consistency world, we write X to a database but read Y from it. Then we have to spend a lot of efforts to deal with this inconsistent behavior. That’s why we miss the old ACID world.

Let’s look at a concrete example of sending payment between two accounts. Suppose we have two clients (C1, C2), and account balance stored in two database partition X and Y.

Initial account balance: {X=20; Y=10}. C1 create transaction T1 at 10ms, in which it sends $10 from X to Y (W(X, 10) means write 10 to X). Then C1 read account balance in transaction T2 at 20ms (R(X) means read X). At 30ms, C2 create transaction T3 to send $20 from Y to X. If the system is strongly consistent, What C1 should see is {X=10; Y=20} from T2, and {X=30; Y=0} as result of T3.

Strong Consistent (Serialized) Transactions Execution

If the system is not serializable (not consistent), T2 could interleave with T1 and T3, as a result C1 read uncommitted data {X=10, Y=0}.

We can use standard Multi-version concurrency control (MVCC) to achieve strong consistency. In MVCC database, we store multiple version of data, for example we can use transaction start time as a version number.

As shown above, each transaction is associated with a version number (T1@10, T2@20, T3@30). Even tough T2:R(y) is executed after T3, T2:R(Y) can only read version up to @20 which links to the start time of T2. T2:R(Y) can not read version @30. As a result, C1 would read {X=10, Y=20} from T2.

MVCC based on timestamp is correct only if we have perfectly synchronized clock on all the machines. The assumption is known to be impractical. Clock drift and uncertainty on physical machine is known to be a fundamental problem in distributed system.

When C1 issued a transaction at wall-clock time 20ms, but the transaction coordinate of T2 could have clock drifted to 5ms. As a result T2 is assigned a version of @5 and read stale data. Even tough C1 transfer $10 from Y to X, but it can not see balance change from T2. Due to clock uncertainty, transactions become inconsistent.

How does Spanner grantees strong consistency in such scenario? The secret source is “TrueTime”. TrueTime is a clock API which gives a tight bound to clock uncertainty. As mentioned in the paper’s conclusion:

As the underlying system enforces tighter bounds on clock uncertainty, the overhead of the stronger semantics decreases. As a community, we should no longer depend on loosely synchronized clocks and weak time APIs in designing distributed algorithms.

TrueTime offers API TT.now(), which returns a time interval [earliest_time, latest_time]. If a transaction coordinator call TT.now(), it’s not certain what exactly the time is, but it knows the global wall-clock time can not be later than latest_time.

In our example, the transaction coordinator (TC) of T1 uses TT.now().latest_time as the timestamp of T1, we call it s1. Also, Spanner does not commit the transaction until it’s certain that s1 has passed. How can it make sure that s1 has passed? By periodically calling TrueTime API and wait until TT.now().earliest_time > s1. (Commit Wait rule is 4.1.2 of Spanner paper). When C1 get response of T1 commit, it’s guaranteed that s1 has passed. C1 then issue T2. TC of T2 use TT.now().latest_time as timestamp of T2, we call it s2. Sine it’s guaranteed that s2 > s1, T2 is guaranteed to read latest update from T1.

Strong consistency. Use TrueTime (s1, s2) as version number

Spanner use MVCC + TrueTime to implement read only transaction, which achieves high availability and strong consistency. For Read/Write transaction, it use standard 2-phase locking to guarantee strong consistency. MVCC only guarantee snapshot isolation for R/W transaction. What’s the difference between strong consistency and snapshot isolation? see this blog for details.

2-phase locking is known to suffer from performance and availability problem. Spanner uses a clever design of layering 2-phase locking on top of Paxos group. Paxos guarantees availability of each partition, which reduce system down time machine failure. However 2-phase locking is still very slow when executing transaction across large amount of partitions. As shown in Table 4 of Spanner paper, R/W transaction time ranges from 17ms to 100ms.

In conclusion, Spanner offer strong consistency and high availability in a globally distributed system. It’s very hard to achieve and offer significant benefit for application developers. Developers don’t need to worry about eventual consistency problem any more. However, latency R/W transaction in Spanner could be 100ms, especially when data is distributed cross multiple regions. Application developers should use it with care. Spanner offers many insight into implementing strong consistent system. It turns out a bounded clock uncertainty is extremely useful to implement strong consistency system.

System design reading: Spanner

Written by Ryan Huang