2004
OSDI ‘04
Best Paper:
Recovering Device Drivers
Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy,
University of Washington
Best Paper:
Using Model Checking to Find Serious File System Errors
Junfeng Yang, Paul Twohey, and Dawson Engler,
Stanford University; Madanlal Musuvathi,
Microsoft Research
LISA ‘04
Best Paper:
Scalable Centralized Bayesian Spam Mitigation with Bogofilter
Jeremy Blosser and David Josephsen,
VHA, Inc.
Security ‘04
Best Paper:
Understanding Data Lifetime via Whole System Simulation
Jim Chow, Ben Pfaff, Tal Garfinkel, Kevin Christopher, and Mendel Rosenblum,
Stanford University
Best Student Paper:
Fairplay—A Secure Two-Party Computation System
Dahlia Malkhi and Noam Nisan,
Hebrew University; Benny Pinkas,
HP Labs; Yaron Sella,
Hebrew University
2004 USENIX Annual Technical Conference
Best Paper:
Handling Churn in a DHT
Sean Rhea and Dennis Geels,
University of California, Berkeley; Timothy Roscoe,
Intel Research, Berkeley; John Kubiatowicz,
University of California, Berkeley
Best Paper:
Energy Efficient Prefetching and Caching
Athanasios E. Papathanasiou and Michael L. Scott,
University of Rochester
FREENIX Track
Best Paper:
Wayback: A User-level Versioning File System for Linux
Brian Cornell, Peter A. Dinda, and Fabián E. Bustamante,
Northwestern University
Best Student Paper:
Design and Implementation of Netdude, a Framework for Packet Trace Manipulation
Christian Kreibich,
University of Cambridge, UK
VM ‘04
Best Paper:
Semantic Remote Attestation—A Virtual Machine Directed Approach to Trusted Computing
Vivek Haldar, Deepak Chandra, and Michael Franz,
University of California, Irvine
FAST ‘04
Best Paper:
Row-Diagonal Parity for Double Disk Failure Correction
Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar,
Network Appliance, Inc.
Best Student Paper:
Improving Storage System Availability with D-GRAID
Muthian Sivathanu, Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau,
University of Wisconsin, Madison
Best Student Paper:
A Framework for Building Unobtrusive Disk Maintenance Applications
Eno Thereska, Jiri Schindler, John Bucy, Brandon Salmon, Christopher R. Lumb, and Gregory R. Ganger,
Carnegie Mellon University
NSDI ‘04
Best Paper:
Trickle: A Self-Regulating Algorithm for Code Propagation and Maintenance in Wireless Sensor Networks Philip Levis,
University of California, Berkeley, and Intel Research Berkeley; Neil Patel,
University of California, Berkeley; David Culler,
University of California, Berkeley, and Intel Research Berkeley; Scott Shenker,
University of California, Berkeley, and ICSI
Best Student Paper:
Listen and Whisper: Security Mechanisms for BGP
Lakshminarayanan Subramanian,
University of California, Berkeley; Volker Roth,
Fraunhofer Institute, Germany; Ion Stoica,
University of California, Berkeley; Scott Shenker,
University of California, Berkeley, and ICSI; Randy H. Katz,
University of California, Berkeley
2003
LISA ‘03
Award Paper:
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support Yi-Min Wang, Chad Verbowski, John Dunagan, Yu Chen, Helen J. Wang, Chun Yuan, and Zheng Zhang,
Microsoft Research
Award Paper:
Distributed Tarpitting: Impeding Spam Across Multiple Servers
Tim Hunter, Paul Terry, and Alan Judge,
eircom.net
BSDCon ‘03
Best Paper:
Cryptographic Device Support for FreeBSD
Samuel J. Leffler,
Errno Consulting
Best Student Paper:
Running BSD Kernels as User Processes by Partial Emulation and Rewriting of Machine Instructions Hideki Eiraku and Yasushi Shinjo,
University of Tsukuba
12th USENIX Security Symposium
Best Paper:
Remote Timing Attacks Are Practical
David Brumley and Dan Boneh,
Stanford University
Best Student Paper:
Establishing the Genuinity of Remote Computer Systems
Rick Kennell and Leah H. Jamieson,
Purdue University
2003 USENIX Annual Technical Conference
Award Paper:
Undo for Operators: Building an Undoable E-mail Store
Aaron B. Brown and David A. Patterson,
University of California, Berkeley
Award Paper:
Operating System I/O Speculation: How Two Invocations Are Faster Than One
Keir Fraser,
University of Cambridge Computer Laboratory; Fay Chang,
Google Inc.
FREENIX Track Best Paper:
StarFish: Highly Available Block Storage
Eran Gabber, Jeff Fellin, Michael Flaster, Fengrui Gu, Bruce Hillyer, Wee Teck Ng, Banu Özden, and Elizabeth Shriver,
Lucent Technologies, Bell Labs
Best Student Paper:
Flexibility in ROM: A Stackable Open Source BIOS
Adam Agnew and Adam Sulmicki,
University of Maryland at College Park; Ronald Minnich,
Los Alamos National Labs; William Arbaugh,
University of Maryland at College Park
First International Conference on Mobile Systems, Applications, and Services
Best Paper:
Energy Aware Lossless Data Compression
Kenneth Barr and Krste Asanovic,
Massachusetts Institute of Technology
2nd USENIX Conference on File and Storage Technologies
Best Paper:
Using MEMS-Based Storage in Disk Arrays
Mustafa Uysal and Arif Merchant,
Hewlett-Packard Labs; Guillermo A. Alvarez,
IBM Almaden Research Center
Best Student Paper:
Pond: The OceanStore Prototype
Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz,
University of California, Berkeley
4th USENIX Symposium on Internet Technologies and Systems
Best Paper:
SkipNet: A Scalable Overlay Network with Practical Locality Properties
Nicholas J. A. Harvey,
Microsoft Research and University of Washington; Michael B. Jones, Microsoft Research; Stefan Saroiu,
University of Washington; Marvin Theimer and Alec Wolman,
Microsoft Research
Best Student Paper:
Scriptroute: A Public Internet Measurement Facility
Neil Spring, David Wetherall, and Tom Anderson,
University of Washington
2002
5th Symposium on Operating Systems Design and Implementation
Best Paper:
Memory Resource Management in VMware ESX Server
Carl A. Waldspurger,
VMware, Inc.
Best Student Paper:
An Analysis of Internet Content Delivery Systems
Stefan Saroiu, Krishna P. Gummadi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy,
University of Washington
LISA ‘02: 16th Systems Administration Conference
Best Paper:
RTG: A Scalable SNMP Statistics Architecture for Service Providers
Robert Beverly,
MIT Laboratory for Computer Science
Best Paper:
Work-Augmented Laziness with the Los Task Request System
Thomas Stepleton,
Swarthmore College Computer Society
11th USENIX Security Symposium
Best Paper:
Security in Plan 9
Russ Cox,
MIT LCS; Eric Grosse and Rob Pike,
Bell Labs; Dave Presotto,
Avaya Labs and Bell Labs; Sean Quinlan,
Bell Labs
Best Student Paper:
Infranet: Circumventing Web Censorship and Surveillance
Nick Feamster, Magdalena Balazinska, Greg Harfst, Hari Balakrishnan, and David Karger,
MIT
2nd Java Virtual Machine Research and Technology Symposium
Best Paper:
An Empirical Study of Method In-lining for a Java Just-in-Time Compiler
Toshio Suganuma, Toshiaki Yasue, and Toshio Nakatani,
IBM Tokyo Research Laboratory
Best Student Paper:
Supporting Binary Compatibility with Static Compilation
Dachuan Yu, Zhong Shao, and Valery Trifonov,
Yale University
2002 USENIX Annual Technical Conference
Best Paper:
Structure and Performance of the Direct Access File System
Kostas Magoutis, Salimah Addetia, Alexandra Fedorova, and Margo I. Seltzer,
Harvard University; Jeffrey S. Chase, Andrew J. Gallatin, Richard Kisley, and Rajiv G. Wickremesinghe,
Duke University; and Eran Gabber,
Lucent Technologies
Best Student Paper:
EtE: Passive End-to-End Internet Service Performance Monitoring
Yun Fu and Amin Vahdat,
Duke University; Ludmila Cherkasova and Wenting Tang,
Hewlett-Packard Laboratories
FREENIX Track
Best FREENIX Paper:
CPCMS: A Configuration Management System Based on Cryptographic Names
Jonathan S. Shapiro and John Vanderburgh,
Johns Hopkins University
Best FREENIX Student Paper:
SWILL: A Simple Embedded Web Server Library
Sotiria Lampoudi and David M. Beazley,
University of Chicago
BSDCon ‘02
Best Paper:
Running “fsck” in the Background Marshall Kirk McKusick,
Author and Consultant
Best Paper:
Design And Implementation of a Direct Access File System (DAFS) Kernel Server for FreeBSD
Kostas Magoutis,
Division of Engineering and Applied Sciences, Harvard University
Conference on File and Storage Technologies
Best Paper:
VENTI - A New Approach to Archival Data Storage
Sean Quinlan and Sean Dorward,
Bell Labs, Lucent Technologies
Best Student Paper:
Track-aligned Extents: Matching Access Patterns to Disk Drive Characteristics
Jiri Schindler, John Linwood Griffin, Christopher R. Lumb, Gregory R. Ganger,
Carnegie Mellon University
阅读全文
类别:默认分类 查看评论文章来源:
http://hi.baidu.com/knuthocean/blog/item/8218034f4a01523caec3ab1c.html
posted @
2009-12-03 13:43 Programmers 阅读(272) |
评论 (0) |
编辑 收藏
前几天有同学问分布式系统方向有哪些会议。网上查了一下,顶级的会议是OSDI(Operating System Design and Implementation)和SOSP(Symposium on Operating System Principles)。其它几个会议,如NSDI,FAST,VLDB也常常有让人眼前一亮的论文。值得庆幸的是,现在云计算太火了,GFS/Mapreduce/Bigtable等工程性文章都发表在最牛的OSDI上,并且Google Bigtable和Microsoft的Dryad LINQ还获得了最佳论文奖。下面列出了每个会议的历年最佳论文,希望我们可以站在一个制高点上。
USENIX ‘09
Best Paper:
Satori: Enlightened Page Sharing
Grzegorz Miłoś, Derek G. Murray, and Steven Hand,
University of Cambridge Computer Laboratory; Michael A. Fetterman,
NVIDIA Corporation
Best Paper:
Tolerating File-System Mistakes with EnvyFS
Lakshmi N. Bairavasundaram, NetApp., Inc.; Swaminathan Sundararaman, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison
NSDI ‘09
Best Paper:
TrInc: Small Trusted Hardware for Large Distributed Systems
Dave Levin, University of Maryland; John R. Douceur, Jacob R. Lorch, and Thomas Moscibroda, Microsoft Research
Best Paper:
Sora: High Performance Software Radio Using General Purpose Multi-core Processors
Kun Tan and Jiansong Zhang, Microsoft Research Asia; Ji Fang, Beijing Jiaotong University; He Liu, Yusheng Ye, and Shen Wang, Tsinghua University; Yongguang Zhang, Haitao Wu, and Wei Wang, Microsoft Research Asia; Geoffrey M. Voelker, University of California, San Diego
FAST ‘09
Best Paper:
CA-NFS: A Congestion-Aware Network File System
Alexandros Batsakis, NetApp and Johns Hopkins University; Randal Burns, Johns Hopkins University; Arkady Kanevsky, James Lentini, and Thomas Talpey, NetApp
Best Paper:
Generating Realistic Impressions for File-System Benchmarking
Nitin Agrawal, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau,
University of Wisconsin, Madison
2008
OSDI ‘08
Jay Lepreau Best Paper:
Difference Engine: Harnessing Memory Redundancy in Virtual Machines
Diwaker Gupta, University of California, San Diego; Sangmin Lee, University of Texas at Austin; Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat, University of California, San Diego
Jay Lepreau Best Paper:
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
Yuan Yu, Michael Isard, Dennis Fetterly, and Mihai Budiu, Microsoft Research Silicon Valley; Úlfar Erlingsson, Reykjavík University, Iceland, and Microsoft Research Silicon Valley; Pradeep Kumar Gunda and Jon Currey, Microsoft Research Silicon Valley
Jay Lepreau Best Paper:
KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs
Cristian Cadar, Daniel Dunbar, and Dawson Engler, Stanford University
LISA ‘08
Best Paper:
ENAVis: Enterprise Network Activities Visualization
Qi Liao, Andrew Blaich, Aaron Striegel, and Douglas Thain, University of Notre Dame
Best Student Paper:
Automatic Software Fault Diagnosis by Exploiting Application Signatures
Xiaoning Ding, The Ohio State University; Hai Huang, Yaoping Ruan, and Anees Shaikh, IBM T.J. Watson Research Center; Xiaodong Zhang, The Ohio State University
USENIX Security ‘08
Best Paper:
Highly Predictive Blacklisting
Jian Zhang and Phillip Porras, SRI International; Johannes Ullrich, SANS Institute
Best Student Paper:
Lest We Remember: Cold Boot Attacks on Encryption Keys
J. Alex Halderman, Princeton University; Seth D. Schoen, Electronic Frontier Foundation; Nadia Heninger and William Clarkson, Princeton University; William Paul, Wind River Systems; Joseph A. Calandrino and Ariel J. Feldman, Princeton University; Jacob Appelbaum; Edward W. Felten, Princeton University
USENIX ‘08
Best Paper:
Decoupling Dynamic Program Analysis from Execution in Virtual Environments
Jim Chow, Tal Garfinkel, and Peter M. Chen, VMware
Best Student Paper:
Vx32: Lightweight User-level Sandboxing on the x86
Bryan Ford and Russ Cox, Massachusetts Institute of Technology
NSDI ‘08
Best Paper:
Remus: High Availability via Asynchronous Virtual Machine Replication
Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, and Norm Hutchinson, University of British Columbia; Andrew Warfield, University of British Columbia and Citrix Systems, Inc.
Best Paper:
Consensus Routing: The Internet as a Distributed System
John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson, University of Washington; Arun Venkataramani, University of Massachusetts Amherst
LEET ‘08
Best Paper:
Designing and Implementing Malicious Hardware (PDF) or read in HTML
Samuel T. King, Joseph Tucek, Anthony Cozzie, Chris Grier, Weihang Jiang, and Yuanyuan Zhou, University of Illinois at Urbana-Champaign
FAST ‘08
Best Paper:
Portably Solving File TOCTTOU Races with Hardness Amplification
Dan Tsafrir, IBM T.J. Watson Research Center; Tomer Hertz, Microsoft Research; David Wagner, University of California, Berkeley; Dilma Da Silva, IBM T.J. Watson Research Center
Best Student Paper:
An Analysis of Data Corruption in the Storage Stack
Lakshmi N. Bairavasundaram,
University of Wisconsin, Madison; Garth Goodson,
Network Appliance Inc.; Bianca Schroeder,
University of Toronto; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau,
University of Wisconsin, Madison
2007
LISA ‘07
Best Paper:
Application Buffer-Cache Management for Performance: Running the World’s Largest MRTG
David Plonka, Archit Gupta, and Dale Carder, University of Wisconsin Madison
Best Paper:
PoDIM: A Language for High-Level Configuration Management
Thomas Delaet and Wouter Joosen, Katholieke Universiteit Leuven, Belgium
16th USENIX Security Symposium
Best Paper:
Towards Automatic Discovery of Deviations in Binary Implementations with Applications to Error Detection and Fingerprint Generation
David Brumley, Juan Caballero, Zhenkai Liang, James Newsome, and Dawn Song, Carnegie Mellon University
Best Student Paper:
Keep Your Enemies Close: Distance Bounding Against Smartcard Relay Attacks
Saar Drimer and Steven J. Murdoch, Computer Laboratory, University of Cambridge
USENIX ‘07
Best Paper:
Hyperion: High Volume Stream Archival for Retrospective Querying
Peter Desnoyers and Prashant Shenoy, University of Massachusetts Amherst
Best Paper:
SafeStore: A Durable and Practical Storage System
Ramakrishna Kotla, Lorenzo Alvisi, and Mike Dahlin, The University of Texas at Austin
NSDI ‘07
Best Paper:
Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code
Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat, University of California, San Diego
Best Student Paper:
Do Incentives Build Robustness in BitTorrent?
Michael Piatek, Tomas Isdal, Thomas Anderson, and Arvind Krishnamurthy, University of Washington; Arun Venkataramani, University of Massachusetts Amherst
FAST ‘07
Best Paper:
Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?
Bianca Schroeder and Garth A. Gibson, Carnegie Mellon University
Best Paper:
TFS: A Transparent File System for Contributory Storage
James Cipar, Mark D. Corner, and Emery D. Berger, University of Massachusetts Amherst
2006
LISA ‘06
Best Paper:
A Platform for RFID Security and Privacy Administration
Melanie R. Rieback, Vrije Universiteit Amsterdam; Georgi N. Gaydadjiev, Delft University of Technology; Bruno Crispo, Rutger F.H. Hofman, and Andrew S. Tanenbaum, Vrije Universiteit Amsterdam
Honorable Mention:
A Forensic Analysis of a Distributed Two-Stage Web-Based Spam Attack
Daniel V. Klein, LoneWolf Systems
OSDI ‘06
Best Paper:
Rethink the Sync
Edmund B. Nightingale, Kaushik Veeraraghavan, Peter M. Chen, and Jason Flinn, University of Michigan
Best Paper:
Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Google, Inc.
15th USENIX Security Symposium
Best Paper:
Evaluating SFI for a CISC Architecture
Stephen McCamant, Massachusetts Institute of Technology; Greg Morrisett, Harvard University
Best Student Paper:
Keyboards and Covert Channels
Gaurav Shah, Andres Molina, and Matt Blaze, University of Pennsylvania
2006 USENIX Annual Technical Conference
Best Paper:
Optimizing Network Virtualization in Xen
Aravind Menon, EPFL; Alan L. Cox, Rice University; Willy Zwaenepoel, EPFL
Best Paper:
Replay Debugging for Distributed Applications
Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica, University of California, Berkeley
NSDI ‘06
Best Paper:
Experience with an Object Reputation System for Peer-to-Peer Filesharing
Kevin Walsh and Emin Gün Sirer, Cornell University
Best Paper:
Availability of Multi-Object Operations
Haifeng Yu,
Intel Research Pittsburgh and Carnegie Mellon University; Phillip B. Gibbons,
Intel Research Pittsburgh; Suman Nath,
Microsoft Research
2005
FAST ‘05
Best Paper:
Ursa Minor: Versatile Cluster-based Storage
Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, and Jay J. Wylie, Carnegie Mellon University
Best Paper:
On Multidimensional Data and Modern Disks
Steven W. Schlosser, Intel Research Pittsburgh; Jiri Schindler, EMC Corporation; Stratos Papadomanolakis, Minglong Shao, Anastassia Ailamaki, Christos Faloutsos, and Gregory R. Ganger, Carnegie Mellon University
LISA ‘05
Best Paper:
Toward a Cost Model for System Administration
Alva L. Couch, Ning Wu, and Hengky Susanto, Tufts University
Best Student Paper:
Toward an Automated Vulnerability Comparison of Open Source IMAP Servers
Chaos Golubitsky, Carnegie Mellon University
Best Student Paper:
Reducing Downtime Due to System Maintenance and Upgrades
Shaya Potter and Jason Nieh, Columbia University
IMC 2005
Best Student Paper:
Measurement-based Characterization of a Collection of On-line Games
Chris Chambers and Wu-chang Feng, Portland State University; Sambit Sahu and Debanjan Saha, IBM Research
Security ‘05
Best Paper:
Mapping Internet Sensors with Probe Response Attacks
John Bethencourt, Jason Franklin, and Mary Vernon University of Wisconsin, Madison
Best Student Paper:
Security Analysis of a Cryptographically-Enabled RFID Device
Steve Bono, Matthew Green, and Adam Stubblefield, Johns Hopkins University; Ari Juels, RSA Laboratories; Avi Rubin, Johns Hopkins University; Michael Szydlo, RSA Laboratories
MobiSys ‘05
Best Paper:
Reincarnating PCs with Portable SoulPads
Ramón Cáceres, Casey Carter, Chandra Narayanaswami, and Mandayam Raghunath, IBM T.J. Watson Research Center
NSDI ‘05
Best Paper:
Detecting BGP Configuration Faults with Static Analysis
Nick Feamster and Hari Balakrishnan, MIT Computer Science and Artificial Intelligence Laboratory
Best Student Paper:
Botz-4-Sale: Surviving Organized DDoS Attacks That Mimic Flash Crowds
Srikanth Kandula and Dina Katabi, Massachusetts Institute of Technology; Matthias Jacob, Princeton University; Arthur Berger, Massachusetts Institute of Technology/Akamai
2005 USENIX Annual Technical Conference
General Track
Best Paper:
Debugging Operating Systems with Time-Traveling Virtual Machines
Samuel T. King, George W. Dunlap, and Peter M. Chen, University of Michigan
Best Student Paper:
Itanium—A System Implementor’s Tale
Charles Gray, University of New South Wales; Matthew Chapman and Peter Chubb, University of New South Wales and National ICT Australia; David Mosberger-Tang, Hewlett-Packard Labs; Gernot Heiser, University of New South Wales and National ICT Australia
FREENIX Track
Best Paper:
USB/IP—A Peripheral Bus Extension for Device Sharing over IP Network
Takahiro Hirofuchi, Eiji Kawai, Kazutoshi Fujikawa, and Hideki Sunahara,
Nara Institute of Science and Technology
阅读全文
类别:默认分类 查看评论文章来源:
http://hi.baidu.com/knuthocean/blog/item/7f32925830ed16d49d82040f.html
posted @
2009-12-03 13:43 Programmers 阅读(539) |
评论 (0) |
编辑 收藏
分布式系统设计开发过程中有几个比较有意思的现象:
1. CAP原理。CAP分别表示Consistency(一致性), Availability(可访问性), Partition-tolerance(网络分区容忍性)。Consistency指强一致性,符合ACID;Availability指每一个请求都能在确定的时间内返回结果;Partition-tolerance指系统能在网络被分成多个部分,即允许任意消息丢失的情况下正常工作。CAP原理指出,CAP三者最多取其二,没有完美的结果。因此,我们设计replication策略、一致性模型、分布式事务时都应该有所折衷。
2. 一致性的不可能性原理。该原理指出在允许失败的异步系统下,进程间是不可能达成一致的。典型的问题就是分布式选举问题,实际系统如Bigtable的tablet加载问题。所以,Google Chubby/Hadoop Zookeeper实现时都需要对服务器时钟误差做一个假设。当时钟出现不一致时,工作机只能下线以防止出现不正确的结果。
3. 错误必然出现原理。只要是理论上有问题的设计/实现,运行时一定会出现,不管概率有多低。如果没有出现问题,要么是稳定运行时间不够长,要么是压力不够大。
4. 错误的必然复现原则。实践表明,分布式系统测试中发现的错误等到数据规模增大以后必然会复现。分布式系统中出现的多机多线程问题有的非常难于排查,但是,没关系,根据现象推测原因并补调试日志吧,加大数据规模,错误肯定会复现的。
5. 两倍数据规模原则。实践表明,分布式系统最大数据规模翻番时,都会发现以前从来没有出现过的问题。这个原则当然不是准确的,不过可以指导我们做开发计划。不管我们的系统多么稳定,不要高兴太早,数据量翻番一定会出现很多意想不到的情况。不信就试试吧!
阅读全文
类别:默认分类 查看评论文章来源:
http://hi.baidu.com/knuthocean/blog/item/d291ab64301ddbfaf73654bc.html
posted @
2009-12-03 13:43 Programmers 阅读(224) |
评论 (0) |
编辑 收藏
Hypertable和Hbase二者同源,设计也有诸多相似之处,最主要的区别当然还是编程语言的选择。Hbase选择Java主要是因为Apache和Hadoop的公共库、历史项目基本都采用该语言,并且Java项目在设计模式和文档上一般都比C++项目好,非常适合开源项目。C++的优势当然还是在性能和内存使用上。Yahoo曾经给出了一个很好的Terasort结果(
perspectives.mvdirona.com/2008/07/08/HadoopWinsTeraSort.aspx),它们认为对于大多数Mapreduce任务,比如分布式排序,性能瓶颈在于IO和网络,Java和C++在性能上基本没有区别。不过,使用Java的Mapreduce在每台服务器上明显使用了更多的CPU和内存,如果用于分布式排序的服务器还需要部署其它的CPU/内存密集型应用,Java的性能劣势将显现。对于Hypertable/HBase这样的表格系统,Java的选择将带来如下问题:
1. Hyertable/Hbase是内存和CPU密集型的。Hypertable/Hbase采用Log-Structured Merge Tree设计,系统可以使用的内存直接决定了系统性能。内存中的memtable和表格系统内部的缓存都大量使用内存,可使用的内存减少将导致merge-dump频率加大,直接加重底层HDFS的压力。另外,读取和dump操作大量的归并操作也可能使CPU成为一个瓶颈,再加上对数据的压缩/解压缩,特别是Bigtable中最经常使用的BM-diff算法在压缩/解压缩过程完全跑满一个CPU核,很难想象Java实现的Hbase能够与C++实现的Hypertable在性能上抗衡。
2. Java垃圾回收。目前Java虚拟机垃圾回收时将停止服务一段时间,这对Hypertable/HBase中大量使用的Lease机制是一个很大的考验。虽然Java垃圾回收可以改进,但是企图以通用的方式完全解决内存管理问题是不现实的。内存管理没有通用做法,需要根据应用的访问模式采取选择不同的策略。
当然,Hadoop由于采用了Java设计,导致开源合作变得更加容易,三大核心系统之上开发的辅助系统,如Hadoop的监控,Pig等都相当成功。所以,我的观点依然是:对于三驾马车的核心系统,采用C++相对合理;对于辅助模块,Java是一个不错的选择。
阅读全文
类别:默认分类 查看评论文章来源:
http://hi.baidu.com/knuthocean/blog/item/ef201038f5d866f8b311c746.html
posted @
2009-12-03 13:43 Programmers 阅读(532) |
评论 (0) |
编辑 收藏
对于Web应用来说,RDBMS在性能和扩展性上有着天生的缺陷,而key-value存储系统通过牺牲关系数据库的事务和范式等要求来换取性能和扩展性,成为了不错的替代品。key-value存储系统设计时一般需要关注扩展性,错误恢复,可靠性等,大致可以分类如下:
1. “山寨“流派:国产的很多系统属于这种类型。这种类型的系统一般不容易扩展,错误恢复和负载平衡等都需要人工介入。由于国内的人力成本较低,这类系统通过增加运维人员的数量来回避分布式系统设计最为复杂的几个问题,具有强烈的中国特色。这种系统的好处在于设计简单,适合几台到几十台服务器的互联网应用。比如,现在很多多机mysql应用通过人工分库来实现系统扩展,即每次系统将要到达服务上限时,增加机器重新分库。又如,很多系统将更新节点设计成单点,再通过简单的冗余方式来提高系统可靠性;又如,很多系统规定单个表格最大的数据量,并通过人工指定机器服务每个表格来实现负载平衡。在这样的设计下,应用规模增加一倍,服务器和运营各项成本增加远大于一倍,不能用来提供云计算服务。然而由于其简单可依赖,这类系统非常适合小型互联网公司或者大型互联网公司的一些规模较小的产品。
2. "P2P"流派:代表作为Amazon的Dynamo。Amazon作为提供云计算服务最为成功的公司,其商业模式和技术实力都异常强大。Amazon的系统典型特点是采用P2P技术,组合使用了多种流行的技术,如DHT,Vector Clock,Merkle Tree等,并且允许配置W和R值,在可靠性和一致性上求得一个平衡。Dynamo的负载平衡需要通过简单的人工配置机器来配合,它的很多技术点可以单独被其它系统借鉴。如,国内的“山寨”系统可以借鉴Dynamo的设计提高扩展性。
3. Google流派:代表作有Google的三驾马车:GFS+Mapreduce+Bigtable。这种系统属于贵族流派,模仿者众多,知名的有以Yahoo为代表的Hadoop, 与Hadoop同源的Hypertable以及国内外众多互联网公司。Google的设计从数据中心建设,服务器选购到系统设计,数据存储方式(数据压缩)到系统部署都有一套指导原则,自成体系。如Hadoop的HDFS设计时不支持多个客户端同时并发Append操作,导致后续的HBase及Hypertable实现极其困难。模仿者虽多,成功者少,HBase和Hypertable都在响应的延时及宕机恢复上有一系列的问题,期待后续发布的版本能有较大的突破。小型互联网公司可以使用Hadoop的HDFS和Mapreduce,至于类似Hbase/Hypertable的表格系统,推荐自己做一个“山寨”版后不断优化。
4. 学院派:这种类型的系统为研究人员主导,设计一般比较复杂,实现的时候以Demo为主。这类系统代表未来可能的方向,但实现的Demo可能有各种各样的问题,如有的系统不能长期稳定运行,又如,有的系统不支持异构的机器环境。本人对这类系统知之甚少,著名的类Mapreduce系统Microsoft Dryad看起来有这种味道。
【注:"山寨“如山寨手机,山寨开心网主要表示符合中国国情,非贬义】
阅读全文
类别:默认分类 查看评论文章来源:
http://hi.baidu.com/knuthocean/blog/item/ae38ebf8891acb05d9f9fdb9.html
posted @
2009-12-03 13:43 Programmers 阅读(265) |
评论 (0) |
编辑 收藏
从Google App Engine中挖出的关于Megastore/Bigtable跨数据中心replication的文章,里面有提到一点点实现,希望对理解Bigtable及其衍生品的replication机制有用。我想指出几点:
1. Bigtable的跨机房replication是保证最终一致性的,Megastore是通过Paxos 将tablet变成可以被跨机房的tablet server服务的。Bigtable的问题在于机器断电会丢数据,Megastore可以做到不丢数据,但是实现起来极其复杂。Megastore的机制对性能还有一定影响,因为Google Chubby不适合访问量过大的环境,所以,Bigtable和Megastore这两个team正在合作寻找一个平衡点。
2. Bigtable内部的replication是后台进行的,按照
列级别执行复制;Megastore是按照Entity group级别进行Paxos控制。为什么Bigtable按照列级别复制?难道和locality group有关?
At Google, we've learned through experience to treat everything with healthy skepticism. We expect that servers, racks, shared GFS cells, and even entire datacenters will occasionally go down, sometimes with little or no warning. This has led us to try as hard as possible to design our products to run on multiple servers, multiple cells, and even multiple datacenters simultaneously, so that they keep running even if any one (or more) redundant underlying parts go down. We call this multihoming. It's a term that usually applies narrowly, to networking alone, but we use it much more broadly in our internal language.
Multihoming is straightforward for read-only products like web search, but it's more difficult for products that allow users to read and write data in real time, like GMail, Google Calendar, and App Engine. I've personally spent a while thinking about how multihoming applies to the App Engine datastore. I even gave a talk about it at this year's Google I/O.
While I've got you captive, I'll describe how multihoming currently works in App Engine, and how we're going to improve it with a release next week. I'll wrap things up with more detail about App Engine's maintenance schedule.
Bigtable replication and planned datacenter moves
When we launched App Engine, the datastore served each application's data out of one datacenter at a time. Data was replicated to other datacenters in the background, using Bigtable's built-in replication facility. For the most part, this was a big win. It gave us mature, robust, real time replication for all datastore data and metadata.
For example, if the datastore was serving data for some apps from datacenter A, and we needed to switch to serving their data from datacenter B, we simply flipped the datastore to read only mode, waited for Bigtable replication to flush any remaining writes from A to B, then flipped the switch back and started serving in read/write mode from B. This generally works well, but it depends on the Bigtable cells in both A and B to be healthy. Of course, we wouldn't want to move to B if it was unhealthy, but we definitely would if B was healthy but A wasn't.
Planning for trouble
Google continuously monitors the overall health of App Engine's underlying services, like GFS and Bigtable, in all of our datacenters. However, unexpected problems can crop up from time to time. When that happens, having backup options available is crucial.
You may remember the unplanned outage we had a few months ago. We published a detailed postmortem; in a nutshell, the shared GFS cell we use went down hard, which took us down as well, and it took a while to get the GFS cell back up. The GFS cell is just one example of the extent to which we use shared infrastructure at Google. It's one of our greatest strengths, in my opinion, but it has its drawbacks. One of the most noticeable drawback is loss of isolation. When a piece of shared infrastructure has problems or goes down, it affects everything that uses it.
In the example above, if the Bigtable cell in A is unhealthy, we're in trouble. Bigtable replication is fast, but it runs in the background, so it's usually at least a little behind, which is why we wait for that final flush before switching to B. If A is unhealthy, some of its data may be unavailable for extended periods of time. We can't get to it, so we can't flush it, we can't switch to B, and we're stuck in A until its Bigtable cell recovers enough to let us finish the flush. In extreme cases like this, we might not know how soon the data in A will become available. Rather than waiting indefinitely for A to recover, we'd like to have the option to cut our losses and serve out of B instead of A, even if it means a small, bounded amount of disruption to application data. Following our example, that extreme recovery scenario would go something like this:
We give up on flushing the most recent writes in A that haven't replicated to B, and switch to serving the data that is in B. Thankfully, there isn't much data in A that hasn't replicated to B, because replication is usually quite fast. It depends on the nature of the failure, but the window of unreplicated data usually only includes a small fraction of apps, and is often as small as a few thousand recent puts, deletes, and transaction commits, across all affected apps.
Naturally, when A comes back online, we can recover that unreplicated data, but if we've already started serving from B, we can't automatically copy it over from A, since there may have been conflicting writes in B to the same entities. If your app had unreplicated writes, we can at least provide you with a full dump of those writes from A, so that your data isn't lost forever. We can also provide you with tools to relatively easily apply those unreplicated writes to your current datastore serving out of B.
Unfortunately, Bigtable replication on its own isn't quite enough for us to implement the extreme recovery scenario above. We use Bigtable single-row transactions, which let us do read/modify/write operations on multiple columns in a row, to make our datastore writes transactional and consistent. Unfortunately, Bigtable replication operates at the column value level, not the row level. This means that after a Bigtable transaction in A that updates two columns, one of the new column values could be replicated to B but not the other.
If this happened, and we switched to B without flushing the other column value, the datastore would be internally inconsistent and difficult to recover to a consistent state without the data in A. In our July 2nd outage, it was partly this expectation of internal inconsistency that prevented us from switching to datacenter B when A became unhealthy.
Megastore replication saves the day!
Thankfully, there's a solution to our consistency problem: Megastore replication. Megastore is an internal library on top of Bigtable that supports declarative schemas, multi-row transactions, secondary indices, and recently, consistent replication across datacenters. The App Engine datastore uses Megastore liberally. We don't need all of its features - declarative schemas, for example - but we've been following the consistent replication feature closely during its development.
Megastore replication is similar to Bigtable replication in that it replicates data across multiple datacenters, but it replicates at the level of entire entity group transactions, not individual Bigtable column values. Furthermore, transactions on a given entity group are always replicated in order. This means that if Bigtable in datacenter A becomes unhealthy, and we must take the extreme option to switch to B before all of the data in A has flushed, B will be consistent and usable. Some writes may be stuck in A and unavailable in B, but B will always be a consistent recent snapshot of the data in A. Some scattered entity groups may be stale, ie they may not reflect the most recent updates, but we'd at least be able to start serving from B immediately, as opposed waiting for A to recover.
To Paxos or not to Paxos
Megastore replication was originally intended to replicate across multiple datacenters synchronously and atomically, using Paxos. Unfortunately, as I described in my Google I/O talk, the latency of Paxos across datacenters is simply too high for a low-level, developer facing storage system like the App Engine datastore.
Due to that, we've been working with the Megastore team on an alternative: asynchronous, background replication similar to Bigtable's. This system maintains the write latency our developers expect, since it doesn't replicate synchronously (with Paxos or otherwise), but it's still consistent and fast enough that we can switch datacenters at a moment's notice with a minimum of unreplicated data.
Onward and upward
We've had a fully functional version of asynchronous Megastore replication for a while. We've been testing it heavily, working out the kinks, and stressing it to make sure it's robust as possible. We've also been using it in our internal version of App Engine for a couple months. I'm excited to announce that we'll be migrating the public App Engine datastore to use it in a couple weeks, on September 22nd.
This migration does require some datastore downtime. First, we'll switch the datastore to read only mode for a short period, probably around 20-30 minutes, while we do our normal data replication flush, and roll forward any transactions that have been committed but not fully applied. Then, since Megastore replication uses a new transaction log format, we need to take the entire datastore down while we drop and recreate our transaction log columns in Bigtable. We expect this to only take a few minutes. After that, we'll be back up and running on Megastore replication!
As described, Megastore replication will make App Engine much more resilient to hiccoughs and outages in individual datacenters and significantly reduce the likelihood of extended outages. It also opens the door to two new options which will give developers more control over how their data is read and written. First, we're exploring allowing reads from the non-primary datastore if the primary datastore is taking too long to respond, which could decrease the likelihood of timeouts on read operations. Second, we're exploring full Paxos for write operations on an opt-in basis, guaranteeing data is always synchronously replicated across datacenters, which would increase availability at the cost of additional write latency.
Both of these features are speculative right now, but we're looking forward to allowing developers to make the decisions that fit their applications best!
Planning for scheduled maintenance
Finally, a word about our maintenance schedule. App Engine's scheduled maintenance periods usually correspond to shifts in primary application serving between datacenters. Our maintenance periods usually last for about an hour, during which application serving is continuous, but access to the Datastore and memcache may be read-only or completely unavailable.
We've recently developed better visibility into when we expect to shift datacenters. This information isn't perfect, but we've heard from many developers that they'd like more advance notice from App Engine about when these maintenance periods will occur. Therefore, we're happy to announce below the preliminary maintenance schedule for the rest of 2009.
- Tuesday, September 22nd, 5:00 PM Pacific Time (migration to Megastore)
- Tuesday, November 3rd, 5:00 PM Pacific Time
- Tuesday, December 1st, 5:00 PM Pacific Time
We don't expect this information to change, but if it does, we'll notify you (via the App Engine Downtime Notify Google Group) as soon as possible. The App Engine team members are personally dedicated to keeping your applications serving without interruption, and we realize that weekday maintenance periods aren't ideal for many. However, we've selected the day of the week and time of day for maintenance to balance disruption to App Engine developers with availability of the full engineering teams of the services App Engine relies upon, like GFS and Bigtable. In the coming months, we expect features like Megastore replication to help reduce the length of our maintenance periods.
Posted by Ryan Barrett, App Engine Team
阅读全文
类别:默认分类 查看评论文章来源:
http://hi.baidu.com/knuthocean/blog/item/12bb9f3dea0e400abba1673c.html
posted @
2009-12-03 13:43 Programmers 阅读(233) |
评论 (0) |
编辑 收藏