人在江湖

  BlogJava :: 首页 :: 联系 :: 聚合  :: 管理
  82 Posts :: 10 Stories :: 169 Comments :: 0 Trackbacks

分类问题使用决策树有一些好处,比如不需要太多domain知识。Learning和分类的过程也比较快。

 

决策树的算法需要三个参数:被分类的对象, 对象的属性列表, 属性的选择算法(attribute_selection_method),学习决策树,主要是学习属性选择算法。

 

Data Mining Concepts and Techniques书里对决策树的构建过程阐述的很清晰,书可以在之前的博客找到: http://www.blogjava.net/vcycyv/archive/2011/09/05/357967.html

(1)    create a node N;
(2)    if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4)    if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6)    apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(7)    label node N with splitting criterion;
(8)    if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
(9) attribute list = attribute list - splitting attribute; // remove splitting attribute
(10)  for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11)         let D j be the set of data tuples in D satisfying outcome j; // a partition
(12)         if D j is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14)         else attach the node returned by Generate decision tree(D j , attribute list) to node N;
endfor
(15)  return N;

生成的树是否是binary的主要取决于属性的选择算法(attribute_selection_method),比如gini index算法生成的tree就是binary的,information gain生成的没有这样的限制。

关于算法:

information gain:

给对象集合D的某一个对象分组所需要的information可以这样算:

image

其中Pi代表任一对象属于类别Ci的概率

如果用某个属性A来分D,经过A把D分成几组之后,给某一个对象分组所需要的information表述如下(看不懂没关系,下面有例子)

image

information gain就可以这样算:

image

例子:

image

The class label attribute, buys computer, has two distinct values (namely, {yes, no}); therefore, there are two distinct
classes (that is, m = 2). Let class C1 correspond to yes and class C2 correspond to no.There are nine tuples of class yes and five tuples of class no. A (root) node N is createdfor the tuples in D. To find the splitting criterion for these tuples, we must compute the information gain of each attribute. We first compute the expected information needed to classify a tuple in D:

image

Next, we need to compute the expected information requirement for each attribute. Let’s start with the attribute age. We need to look at the distribution of yes and no tuples for each category of age. For the age category youth, there are two yes tuples and three no tuples. For the category middle aged, there are four yes tuples and zero no tuples. For the category senior, there are three yes tuples and two no tuples.

image

image

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and Gain(credit rating) = 0.048 bits. Because age has the highest information gain among the attributes, it is selected as the splitting attribute.

 

Gain ratio简述:

information gain在处理多值属性的时候效果不好,比如如果有一个属性是product_id,那么经过他分所有对象之后,每个对象自成一组,也就是说每个组都是pure的,所以分组后的info(D)就是0,所以用product_id分组自然gain的值最大,但是显然这样分组没意义。Gain ratio相当于调整了information gain, 它用比值来计算而不是减法。具体在书里有例子,不详述。

Gini index:

Gini index是用来算impurity of D的。上面说过,这种算法是binary split的。举个binary split的例子:

if income has three possible values, namely {low,
medium, high}, then the possible subsets are {low, medium, high}, {low, medium}, {low,high}, {medium, high}, {low}, {medium}, {high}, and {}. We exclude the power set,{low, medium, high}, and the empty set from consideration since, conceptually, they do not represent a split.

image

image

image

示例:

there are nine tuples belonging to the class buys computer = yes and the remaining five tuples belong to the class buys computer = no.

image

To find the splitting criterion for the tuples in D, we need to compute the gini index for each attribute. Let’s start with the attribute income and consider each of the possible splitting subsets. Consider the subset {low, medium}. This would result in 10 tuples in partition D1 satisfying the condition “income 属于{low, medium}.” The remaining four tuples of D would be assigned to partition D2 . The Gini index value computed based on this partitioning is

image

类似地可以算出来其他Income的subset, 在扩展到可以类似算age 之类的attribute。

 

关于Prune:

为了去掉噪音和outlier,需要修剪(prune) tree.有两种修剪方法:preprune和postprune.

preprune就是一边split tree,一边算着统计量,比如information gain, 或者gini index之类的,一旦算出来的值够不到threshold,就停下来变成叶子节点。

postprune就是把tree grow完了之后,再从底向上剪掉一些分支。剪掉分支的算法叫cost complexity, 它主要基于Leaf的个数和error rate. 就是算一个node如果prune它和保留它之间哪个cost complexisty更大,如果prune之后cost complexity更小了,就prune它。注意算cost complexity要基于一个单独的data set, 不用training dataset, 也不用validation data set.

往往认为postprune可靠性更好一些。 实际中,preprune和postprune可以结合在一起使用。

 

Data Mining techniques for Marketing Sales and Customer Relationship Management这本书提出了两个需要注意的地方:

1. 经过某次split之后的生成两个leaf节点,他们可能是同一个category的。只是概率不一样,但是都过了threshold.

2. binary tree可以分target是多个category的。反过来,多值tree也可以给binary target的分类。

posted on 2011-09-18 23:27 人在江湖 阅读(1943) 评论(1)  编辑  收藏 所属分类: BI

Feedback

# re: 决策树 2011-09-19 08:53 tb
很好狠强大啊   回复  更多评论
  


只有注册用户登录后才能发表评论。


网站导航: