Two-phase commit(http://en.wikipedia.org/wiki/Two-phase_commit_protocol)是分布式事务最基础的协议,Three-phase commit(http://en.wikipedia.org/wiki/Three-phase_commit_protocol)主要解决Two-phase commit中协调者宕机问题。
Two-phase commit的算法实现 (from <<Distributed System: Principles and Paradigms>>):
协调者(Coordinator):
write START_2PC to local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected {
wait for any incoming vote;
if timeout {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
exit;
}
record vote;
}
if all participants sent VOTE_COMMIT and coordinator votes COMMIT {
write GLOBAL_COMMIT to local log;
multicast GLOBAL_COMMIT to all participants;
} else {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
}
参与者(Participants)
write INIT to local log;
wait for VOTE_REQUEST from coordinator;
if timeout {
write VOTE_ABORT to local log;
exit;
}
if participant votes COMMIT {
write VOTE_COMMIT to local log;
send VOTE_COMMIT to coordinator;
wait for DECISION from coordinator;
if timeout {
multicast DECISION_REQUEST to other participants;
wait until DECISION is received; /* remain blocked*/
write DECISION to local log;
}
if DECISION == GLOBAL_COMMIT
write GLOBAL_COMMIT to local log;
else if DECISION == GLOBAL_ABORT
write GLOBAL_ABORT to local log;
} else {
write VOTE_ABORT to local log;
send VOTE_ABORT to coordinator;
}
另外,每个参与者维护一个线程专门处理其它参与者的DECISION_REQUEST请求,处理线程流程如下:
while true {
wait until any incoming DECISION_REQUEST is received;
read most recently recorded STATE from the local log;
if STATE == GLOBAL_COMMIT
send GLOBAL_COMMIT to requesting participant;
else if STATE == INIT or STATE == GLOBAL_ABORT;
send GLOBAL_ABORT to requesting participant;
else
skip; /* participant remains blocked */
}
从上述的协调者与参与者的流程可以看出,如果所有参与者VOTE_COMMIT后协调者宕机,这个时候每个参与者都无法单独决定全局事务的最终结果(GLOBAL_COMMIT还是GLOBAL_ABORT),也无法从其它参与者获取,整个事务一直阻塞到协调者恢复;如果协调者出现类似磁盘坏这种永久性错误,该事务将成为被永久遗弃的孤儿。问题的解决有如下思路:
1. 协调者持久化数据定期备份。为了防止协调者出现永久性错误,这是一种代价最小的解决方法,不容易引入bug,但是事务被阻塞的时间可能特别长,比较适合银行这种正确性高于一切的系统。
2. Three-phase Commit。这是理论上的一种方法,实现起来复杂且效率低。思路如下:假设参与者机器不可能出现超过一半同时宕机的情况,如果协调者宕机,我们需要从活着的超过一半的参与者中得出事务的全局结果。由于不可能知道已经宕机的参与者的状态,所以引入一个新的参与者状态PRECOMMIT,参与者成功执行一个事务需要经过INIT, READY, PRECOMMIT,最后到COMMIT状态;如果至少有一个参与者处于PRECOMMIT或者COMMIT,事务成功;如果至少一个参与者处于INIT或者ABORT,事务失败;如果所有的参与者都处于READY(至少一半参与者活着),事务失败,即使原先宕机的参与者恢复后处于PRECOMMIT状态,也会因为有其它参与者处于ABORT状态而回滚。PRECOMMIT状态的引入给了宕机的参与者回滚机会,所以Three-phase commit在超过一半的参与者活着的时候是不阻塞的。不过,Three-phase Commit只能算是是理论上的探索,效率低并且没有解决网络分区问题。
3. Paxos解决协调者单点问题。Jim Gray和Lamport合作了一篇论文讲这个方法,很适合互联网公司的超大规模集群,Google的Megastore事务就是这样实现的,不过问题在于Paxos和Two-phase Commit都不简单,需要有比较靠谱(代码质量高)的小团队设计和编码才行。后续的blog将详细阐述该方法。
总之,分布式事务只能是系统开发者的乌托邦式理想,Two-phase commit的介入将导致涉及多台机器的事务之间完全串行,没有代价的分布式事务是不存在的。