2012年4月6日
http://incubator.apache.org/kafka/design.html
1.
Why we built this
asd(activity stream data)数据是任何网站的一部分,反映网站使用情况,如:那些内容被搜索、展示。通常,此部分数据被以log方式记录在文件,然后定期的整合和分析。od(operation data)是关于机器性能数据,和其它不同途径整合的操作数据。
在近几年,asd和od变成一个网站重要的一部分,更复杂的基础设施是必须的。
数据特点:
a、大吞吐量的不变的ad,对实时计算是一个挑战,会很容易超过10倍or100倍。
b、传统的记录log方式是respectable and scalable方式去支持离线处理,但是延迟太高。
Kafka is intended to be a single queuing platform that can support both offline and online use cases.
2.Major Design Elements
There is a small number of major design decisions that make Kafka different from most other messaging systems:
- Kafka is designed for persistent messages as the common case;消息持久
- Throughput rather than features are the primary design constraint;吞吐量是第一要求
- State about what has been consumed is maintained as part of the consumer not the server;状态由客户端维护
- Kafka is explicitly distributed. It is assumed that producers, brokers, and consumers are all spread over multiple machines;必须是分布式
3.Basics
Messages are the fundamental unit of communication;
Messages are published to a topic by a producer which means they are physically sent to a server acting as a broker,消息被生产者发布到一个topic,意味着物理的发送消息到broker;
多个consumer订阅一个topic,则此topic的每个消息都会被分发到每个consumer;
kafka是分布式:producer、broker、consumer,均可以由集群的多台机器组成,相互协作 a logic group;
属于同一个consumer group的每一个consumer process,每个消息能准确的由其中的一个process消费;
A more common case in our own usage is that we have multiple logical consumer groups, each consisting of a cluster of consuming machines that act as a logical whole.
kafka不管一个topic有多少个consumer,其消息仅会存储一份。
4.
Message Persistence and Caching
4.1 Don't fear the filesystem !
kafka完全依赖文件系统去存储和cache消息;
大家通常对磁盘的直觉是'很慢',则使人们对持久化结构,是否能提供有竞争力的性能表示怀疑;实际上,磁盘到底有多慢或多块,完全取决于如何使用磁盘,
a properly designed disk structure can often be as fast as the network.
http://baike.baidu.com/view/969385.htm raid-5
http://www.china001.com/show_hdr.php?xname=PPDDMV0&dname=66IP341&xpos=172 磁盘种类
磁盘顺序读写的性能非常高,
linear writes on a 6 7200rpm SATA RAID-5 array is about 300MB/sec;These linear reads and writes are the most predictable of all usage patterns, and hence the one detected and optimized best by the operating system using read-ahead and write-behind techniques。顺序读写是最可预见的模式,因此操作系统通过read-head和write-behind技术去优化。
现代操作系统,用mem作为disk的cache;Any modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache.
Jvm:a、对象的内存开销是非常大的,通常是数据存储的2倍;b、当heap数据增大时,gc代价越来越大;
As a result of these factors using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure。依赖文件系统和pagecache是优于mem cahce或其它结构的。
数据压缩,Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties.
This suggests a design which is very simple: maintain as much as possible in-memory and flush to the filesystem only when necessary. 尽可能的维持在内存中,仅当必须时写回到文件系统.
当数据被立即写回到持久化的文件,而未调用flush,其意味着数据仅被写入到os pagecahe,在后续某个时间由os flush。Then we add a configuration driven flush policy to allow the user of the system to control how often data is flushed to the physical disk (every N messages or every M seconds) to put a bound on the amount of data "at risk" in the event of a hard crash. 提供flush策略。
4.2 Constant Time Suffices
The persistent data structure used in messaging systems metadata is often a BTree. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system.
Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead.
Furthermore BTrees require a very sophisticated page or row locking implementation to avoid locking the entire tree on each operation.
The implementation must pay a fairly high price for row-locking or else effectively serialize all reads.
持久化消息的元数据通常是BTree结构,但磁盘结构,其代价太大。原因:寻道、避免锁整棵树。
Intuitively a persistent queue could be built on simple reads and appends to files as is commonly the case with logging solutions.
持久化队列可以构建在读和append to 文件。所以不支持BTree的一些语义,但其好处是:O(1)消耗,无锁读写。
the performance is completely decoupled from the data size--one server can now take full advantage of a number of cheap, low-rotational speed 1+TB SATA drives.
Though they have poor seek performance, these drives often have comparable performance for large reads and writes at 1/3 the price and 3x the capacity.
4.3 Maximizing Efficiency
Furthermore we assume each message published is read at least once (and often multiple times), hence we optimize for consumption rather than production. 更进一步,我们假设被发布的消息至少会读一次,因此优化consumer优先于producer。
There are two common causes of inefficiency :
two many network requests, (
APIs are built around a "message set" abstraction,
This allows network requests to group messages together and amortize the overhead of the network roundtrip rather than sending a single message at a time.) 仅提供批量操作api,则每次网络开销是平分在一组消息,而不是单个消息。
and excessive byte copying.(
The message log maintained by the broker is itself just a directory of message sets that have been written to disk.
Maintaining this common format allows optimization of the most important operation : network transfer of persistent log chunks.)
To understand the impact of sendfile, it is important to understand the common data path for transfer of data from file to socket:
- The operating system reads data from the disk into pagecache in kernel space
- The application reads the data from kernel space into a user-space buffer
- The application writes the data back into kernel space into a socket buffer
- The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network
利用os提供的zero-copy,
only the final copy to the NIC buffer is needed.
4.4 End-to-end Batch Compression
In many cases the bottleneck is actually not CPU but network. This is particularly true for a data pipeline that needs to send messages across data centers.
Efficient compression requires compressing multiple messages together rather than compressing each message individually.
Ideally this would be possible in an end-to-end fashion — that is, data would be compressed prior to sending by the producer and remain compressed on the server, only being decompressed by the eventual consumers.
A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be delivered all to the same consumer and will remain in compressed form until it arrives there.
理解:kafka
producer api 提供批量压缩,broker不对此批消息做任何操作,且以压缩的方式,一起被发送到consumer。
4.5 Consumer state
Keeping track of what has been consumed is one of the key things a messaging system must provide.
State tracking requires updating a persistent entity and potentially causes random accesses.
Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker records that fact locally. 大部分消息系统,存储是否被消费的元信息在broker。则是说,一个消息被分发到一个consumer,broker记录。
问题:当consumer消费失败后,会导致消息丢失;改进:每次consumer消费后,给broker ack,若broker在超时时间未收到ack,则重发此消息。
问题:1.当消费成功,但未ack时,会导致消费2次 2.
now the broker must keep multiple states about every single message 3.当broker是多台机器时,则状态之间需要同步
4.5.1 Message delivery semantics
So clearly there are multiple possible message delivery guarantees that could be provided : at most once 、at least once、exactly once。
This problem is heavily studied, and is a variation of the "transaction commit" problem. Algorithms that provide exactly once semantics exist, two- or three-phase commits and Paxos variants being examples, but they come with some drawbacks. They typically require multiple round trips and may have poor guarantees of liveness (they can halt indefinitely).
消费分发语义,是 ‘事务提交’ 问题的变种。算法提供 exactly onece 语义,两阶段 or 三阶段提交,paxos 均是例子,但它们存在缺点。典型的问题是要求多次round trip,且
poor guarantees of liveness。
Kafka does two unusual things with respect to metadata.
First the stream is partitioned on the brokers into a set of distinct partitions.
Within a partition messages are stored in the order in which they arrive at the broker, and will be given out to consumers in that same order. This means that rather than store metadata for each message (marking it as consumed, say), we just need to store the "high water mark" for each combination of consumer, topic, and partition.
4.5.2 Consumer state
In Kafka, the consumers are responsible for maintaining state information (offset) on what has been consumed.
Typically, the Kafka consumer library writes their state data to zookeeper.
This solves a distributed consensus problem, by removing the distributed part!
There is a side benefit of this decision. A consumer can deliberately rewind back to an old offset and re-consume data.
4.5.3 Push vs. pull
A related question is whether consumers should pull data from brokers or brokers should push data to the subscriber.
There are pros and cons to both approaches.
However a push-based system has difficulty dealing with diverse consumers as the broker controls the rate at which data is transferred. push目标是consumer能在最大速率去消费,可不幸的是,当consume速率小于生产速率时,the consumer tends to be overwhelmed。
A pull-based system has the nicer property that the consumer simply falls behind and catches up when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed, but getting the rate of transfer to fully utilize (but never over-utilize) the consumer is trickier than it seems. Previous attempts at building systems in this fashion led us to go with a more traditional pull model. 不存在push问题,且也保证充分利用consumer能力。
5. Distribution
Kafka is built to be run across a cluster of machines as the common case. There is no central "master" node. Brokers are peers to each other and can be added and removed at anytime without any manual configuration changes. Similarly, producers and consumers can be started dynamically at any time. Each broker registers some metadata (e.g., available topics) in Zookeeper. Producers and consumers can use Zookeeper to discover topics and to co-ordinate the production and consumption. The details of producers and consumers will be described below.
6. Producer
6.1 Automatic producer load balancing
Kafka supports client-side load balancing for message producers or use of a dedicated load balancer to balance TCP connections.
The advantage of using a level-4 load balancer is that each producer only needs a single TCP connection, and no connection to zookeeper is needed.
The disadvantage is that the balancing is done at the TCP connection level, and hence it may not be well balanced (if some producers produce many more messages then others, evenly dividing up the connections per broker may not result in evenly dividing up the messages per broker).
Client-side zookeeper-based load balancing solves some of these problems. It allows the producer to dynamically discover new brokers, and balance load on a per-request basis. It allows the producer to partition data according to some key instead of randomly.
The working of the zookeeper-based load balancing is described below. Zookeeper watchers are registered on the following events—
- a new broker comes up
- a broker goes down
- a new topic is registered
- a broker gets registered for an existing topic
Internally, the producer maintains an elastic pool of connections to the brokers, one per broker. This pool is kept updated to establish/maintain connections to all the live brokers, through the zookeeper watcher callbacks. When a producer request for a particular topic comes in, a broker partition is picked by the partitioner (see section on semantic partitioning). The available producer connection is used from the pool to send the data to the selected broker partition.
producer通过zk,管理与broker的连接。当一个请求,根据partition rule 计算分区,从连接池选择对应的connection,发送数据。
6.2 Asynchronous send
Asynchronous non-blocking operations are fundamental to scaling messaging systems.
This allows buffering of produce requests in a in-memory queue and batch sends that are triggered by a time interval or a pre-configured batch size.
6.3 Semantic partitioning
The producer has the capability to be able to semantically map messages to the available kafka nodes and partitions. This allows partitioning the stream of messages with some semantic partition function based on some key in the message to spread them over broker machines.
1.Js代码,login.js文件
//用户的登陆信息写入cookies
function SetCookie(form)//两个参数,一个是cookie的名子,一个是值
{
var name = form.name.value;
var password = form.password.value;
var Days = 1; //此 cookie 将被保存 7 天
var exp = new Date(); //生成一个现在的日期,加上保存期限,然后设置cookie的生存期限!
exp.setTime(exp.getTime() + Days*24*60*60*1000);
document.cookie = "user="+ escape(name) + "/" + escape(password) + ";expires=" + exp.toGMTString();
}
//取cookies函数--正则表达式(不会,学习正则表达式)
function getCookie(name)
{
var arr = document.cookie.match(new RegExp("(^| )"+name+"=([^;]*)(;|$)"));
if(arr != null) return unescape(arr[2]);
return null;
}
//取cookies函数--普通实现
function readCookie(form){
var cookieValue = "";
var search = "user=";
if(document.cookie.length > 0) {
offset = document.cookie.indexOf(search);
if(offset != -1){
offset += search.length;
end = document.cookie.indexOf(";",offset);
if (end == -1)
end = document.cookie.length;
//获取cookies里面的值
cookieValue = unescape(document.cookie.substring(offset,end))
if(cookieValue != null){
var str = cookieValue.split("/");
form.name.value = str[0];
form.password.value = str[1];
}
}
}
}
//删除cookie,(servlet里面:设置时间为0,设置为-1和session的范围是一样的),javascript好像是有点区别
function delCookie()
{
var name = "admin";
var exp = new Date();
exp.setTime(exp.getTime() - 1);
var cval=getCookie(name);
if(cval!=null) document.cookie= name + "="+cval+";expires="+exp.toGMTString();
}
2.jsp代码,文件login.jsp
<%@ page contentType="text/html; charset=gb2312" language="java"
import="java.sql.*" errorPage=""%>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title>javascript 控制 cookie</title>
<link href="css/style.css" rel="stylesheet" type="text/css">
<script type="text/javascript" src="js/login.js"></script>
</head>
<script language="javascript">
function checkEmpty(form){
for(i=0;i<form.length;i++){
if(form.elements[i].value==""){
alert("表单信息不能为空");
return false;
}
}
}
</script>
<body onload="readCookie(form)"> <!-- 作为JavaScript控制的cookie-->
<div align="center">
<table width="324" height="225" border="0" cellpadding="0" cellspacing="0">
<tr height="50">
<td ></td>
</tr>
<tr align="center">
<td background="images/back.jpg">
<br>
<br>
登陆
<form name="form" method="post" action="" onSubmit="return checkEmpty(form)">
<input type="hidden" name="id" value="-1">
<table width="268" border="1" cellpadding="0" cellspacing="0">
<tr align="center">
<td width="63" height="30">
用户名:
</td>
<td width="199">
<input type="text" name="name" id="name">
</td>
</tr>
<tr align="center">
<td height="30">
密码:
</td>
<td>
<input type="password" name="password" id="password">
</td>
</tr>
</table>
<br>
<input type="submit" value="提交">
<input type="checkbox" name="cookie" onclick="SetCookie(form)">记住我
</form>
</td>
</tr>
</table>
</div>
</body>
</html>
目的:当你再次打开login.jsp页面,表单里面的内容已经写好了,是你上一次的登陆信息!
问题:1.JavaScript里面取cookie都是写死的,不是很灵活!
2.JavaScript的cookie是按照字符串的形式存放的,所以拿出的时候,你要按照你放进去的形式来选择!
3.本来是想实现自动登陆的,可我的每个页面都要session的检查!一个客户端,一个服务器端,没能实现!
1.变量类型
- undefined
- null
- string
- == 与 === 区别
- number
- boolean
- string、number、boolean均有对应的 '对象类'
2.函数
- 定义函数
- function 关键字
- 参数(见例子),arguments
- 函数内变量声明,var区别
- 作用域
- 链式结构(子函数可以看见父函数的变量)
- 匿名函数
- 使用场景(非复用场景,如:jsonp回调函数)
- this特征
例子:var add = function(x) {
return x++;}add(1,2,3); // 参数可以随意多个,类似Java中的(int x ...)
var fn = function(name, pass) {
alert(name);
alert(pass);
};
fn("hello","1234",5); // 按照传递的顺序排列
var name = "windows";
var fn = function() {
var name = "hello";
alert(this.name);}fn(); // windows,this在匿名函数内部是指向windows范围
var name = "windows";
var fn = function() {
name = "hello";
alert(this.name);}fn(); // 因函数内部变量name未声明为var,则属于全局变量,且this指向windows,则为'hello'
function add(a) {
return ++a;
}
var fn = function(x,add){
return add(x);
}
fn(1, add); // 函数作为参数
3.闭包
1.句柄就是一个标识符,只要获得对象的句柄,我们就可以对对象进行任意的操作。
2.句柄不是指针,操作系统用句柄可以找到一块内存,这个句柄可能是标识符,map的key,也可能是指针,看操作系统怎么处理的了。
fd算是在某种程度上替代句柄吧;
Linux 有相应机制,但没有统一的句柄类型,各种类型的系统资源由各自的类型来标识,由各自的接口操作。
3.http://tech.ddvip.com/2009-06/1244006580122204_11.html
在操作系统层面上,文件操作也有类似于FILE的一个概念,在Linux里,这叫做文件描述符(File Descriptor),而在Windows里,叫做句柄(Handle)(以下在没有歧义的时候统称为句柄)。用户通过某个函数打开文件以获得句柄,此 后用户操纵文件皆通过该句柄进行。
设计这么一个句柄的原因在于句柄可以防止用户随意读写操作系统内核的文件对象。无论是Linux还是Windows,文件句柄总是和内核的文件对象相关联的,但如何关联细节用户并不可见。内核可以通过句柄来计算出内核里文件对象的地址,但此能力并不对用户开放。
下面举一个实际的例子,在Linux中,值为0、1、2的fd分别代表标准输入、标准输出和标准错误输出。在程序中打开文件得到的fd从3开始增长。 fd具体是什么呢?在内核中,每一个进程都有一个私有的“打开文件表”,这个表是一个指针数组,每一个元素都指向一个内核的打开文件对象。而fd,就是这 个表的下标。当用户打开一个文件时,内核会在内部生成一个打开文件对象,并在这个表里找到一个空项,让这一项指向生成的打开文件对象,并返回这一项的下标 作为fd。由于这个表处于内核,并且用户无法访问到,因此用户即使拥有fd,也无法得到打开文件对象的地址,只能够通过系统提供的函数来操作。
在C语言里,操纵文件的渠道则是FILE结构,不难想象,C语言中的FILE结构必定和fd有一对一的关系,每个FILE结构都会记录自己唯一对应的fd。
句柄 http://zh.wikipedia.org/wiki/%E5%8F%A5%E6%9F%84
在程序设计 中,句柄是一种特殊的智能指针 。当一个应用程序 要引用其他系统(如数据库、操作系统 )所管理的内存 块或对象 时,就要使用句柄。
句柄与普通指针 的区别在于,指针包含的是引用对象 的内存地址 ,而句柄则是由系统所管理的引用标识,该标识可以被系统重新定位到一个内存地址 上。这种间接访问对象 的模式增强了系统对引用对象 的控制。(参见封装 )。
在上世纪80年代的操作系统(如Mac OS 和Windows )的内存管理 中,句柄被广泛应用。Unix 系统的文件描述符 基本上也属于句柄。和其它桌面环境 一样,Windows API 大量使用句柄来标识系统中的对象 ,并建立操作系统与用户空间 之间的通信渠道。例如,桌面上的一个窗体由一个HWND 类型的句柄来标识。
如今,内存 容量的增大和虚拟内存 算法使得更简单的指针 愈加受到青睐,而指向另一指针的那类句柄受到冷淡。尽管如此,许多操作系统 仍然把指向私有对象 的指针以及进程传递给客户端 的内部数组 下标称为句柄。
What 、How、Why,从细节中寻找不断的成长点