Two-phase commit(http://en.wikipedia.org/wiki/Two-phase_commit_protocol)是分布式事务最基础的协议,Three-phase commit(http://en.wikipedia.org/wiki/Three-phase_commit_protocol)主要解决Two-phase commit中协调者宕机问题。
 Two-phase commit的算法实现 (from <<Distributed System: Principles and Paradigms>>):
 协调者(Coordinator):
     write START_2PC to local log;
     multicast VOTE_REQUEST to all participants;
     while not all votes have been collected {
         wait for any incoming vote;
         if timeout {
             write GLOBAL_ABORT to local log;
             multicast GLOBAL_ABORT to all participants;
             exit;
         }
         record vote;
     }
     if all participants sent VOTE_COMMIT and coordinator votes COMMIT {
         write GLOBAL_COMMIT to local log;
         multicast GLOBAL_COMMIT to all participants;
     } else {
         write GLOBAL_ABORT to local log;
         multicast GLOBAL_ABORT to all participants;
     }
 参与者(Participants)
     write INIT to local log;
     wait for VOTE_REQUEST from coordinator;
     if timeout {
         write VOTE_ABORT to local log;
         exit;
     }
     if participant votes COMMIT {
         write VOTE_COMMIT to local log;
         send VOTE_COMMIT to coordinator;
         wait for DECISION from coordinator;
         if timeout {
             multicast DECISION_REQUEST to other participants;
             wait until DECISION is received;  /* remain blocked*/
             write DECISION to local log;
         }
         if DECISION == GLOBAL_COMMIT
             write GLOBAL_COMMIT to local log;
         else if DECISION == GLOBAL_ABORT
             write GLOBAL_ABORT to local log;
     } else {
         write VOTE_ABORT to local log;
         send VOTE_ABORT to coordinator;
     }
 另外,每个参与者维护一个线程专门处理其它参与者的DECISION_REQUEST请求,处理线程流程如下:
     while true {
         wait until any incoming DECISION_REQUEST is received;
         read most recently recorded STATE from the local log;
         if STATE == GLOBAL_COMMIT
             send GLOBAL_COMMIT to requesting participant;
         else if STATE == INIT or STATE == GLOBAL_ABORT;
             send GLOBAL_ABORT to requesting participant;
         else
             skip;  /* participant remains blocked */
     }
 从上述的协调者与参与者的流程可以看出,如果所有参与者VOTE_COMMIT后协调者宕机,这个时候每个参与者都无法单独决定全局事务的最终结果(GLOBAL_COMMIT还是GLOBAL_ABORT),也无法从其它参与者获取,整个事务一直阻塞到协调者恢复;如果协调者出现类似磁盘坏这种永久性错误,该事务将成为被永久遗弃的孤儿。问题的解决有如下思路:
 1. 协调者持久化数据定期备份。为了防止协调者出现永久性错误,这是一种代价最小的解决方法,不容易引入bug,但是事务被阻塞的时间可能特别长,比较适合银行这种正确性高于一切的系统。
 2. Three-phase Commit。这是理论上的一种方法,实现起来复杂且效率低。思路如下:假设参与者机器不可能出现超过一半同时宕机的情况,如果协调者宕机,我们需要从活着的超过一半的参与者中得出事务的全局结果。由于不可能知道已经宕机的参与者的状态,所以引入一个新的参与者状态PRECOMMIT,参与者成功执行一个事务需要经过INIT, READY, PRECOMMIT,最后到COMMIT状态;如果至少有一个参与者处于PRECOMMIT或者COMMIT,事务成功;如果至少一个参与者处于INIT或者ABORT,事务失败;如果所有的参与者都处于READY(至少一半参与者活着),事务失败,即使原先宕机的参与者恢复后处于PRECOMMIT状态,也会因为有其它参与者处于ABORT状态而回滚。PRECOMMIT状态的引入给了宕机的参与者回滚机会,所以Three-phase commit在超过一半的参与者活着的时候是不阻塞的。不过,Three-phase Commit只能算是是理论上的探索,效率低并且没有解决网络分区问题。
 3. Paxos解决协调者单点问题。Jim Gray和Lamport合作了一篇论文讲这个方法,很适合互联网公司的超大规模集群,Google的Megastore事务就是这样实现的,不过问题在于Paxos和Two-phase Commit都不简单,需要有比较靠谱(代码质量高)的小团队设计和编码才行。后续的blog将详细阐述该方法。
 总之,分布式事务只能是系统开发者的乌托邦式理想,Two-phase commit的介入将导致涉及多台机器的事务之间完全串行,没有代价的分布式事务是不存在的。