Change Dir

先知cd——热爱生活是一切艺术的开始

统计

随笔 - 222
文章 - 0
评论 - 182
引用 - 0

留言簿(18)

积分与排名

积分 - 420518
排名 - 133

“牛”们的博客

各个公司技术

我的链接

淘宝技术

阅读排行榜

评论排行榜

weka特征预处理的一些tip

首先，提供两个地址，这里包含了全部的内容原文：
http://weka.wikispaces.com/Text+categorization+with+Weka
http://weka.wikispaces.com/ARFF+files+from+Text+Collections

weka可以以目录形式读入数据。
然后再简单说一下weka在做文本特征内容处理时候需要注意的东西：
声明一点，在weka的gui下是没法使用这个功能的：以目录形式读入数据。
首先，把要处理的数据写入到这样的目录结构下：

...
|
+- text_example
|
+- class1
|  |
|  + file1.txt
|  |
|  + file2.txt
|  |
|  ...
|
+- class2
|  |
|  + another_file1.txt
|  |
|  + another_file2.txt
|  |
|  ...

然后在源码包下，命令行执行 java weka.core.converters.TextDirectoryLoader -dir text_example > text_example.arff
其中text_example就是数据所在的目录，而后面的arff文件就是生成的arff文件。另外值得补充的一点是在获得这样的arff后哦，文本内容是作为一个字符串特征存在的，也就是说生成的arff就是一个特征项加一个类标签，其中的类标就是text_example目录下级classX子目录的名字。为了更方便使用，weka提供了一个有监督的属性过滤器，帮助分词（这里指英文的split） ——StringToWordVector，这个是可以做TF/IDF的~~~
下面的简单代码可以完成一个分类：

import weka.core.*;
2

import weka.core.converters.*;
3

import weka.classifiers.trees.*;
4

import weka.filters.*;
5

import weka.filters.unsupervised.attribute.*;
6

import java.io.*;
8

/**
10

* Example class that converts HTML files stored in a directory structure into
11

* and ARFF file using the TextDirectoryLoader converter. It then applies the
12

* StringToWordVector to the data and feeds a J48 classifier with it.
13

*
14

* @author FracPete (fracpete at waikato dot ac dot nz)
15

*/
16

public class TextCategorizationTest {
17

/**
19

* Expects the first parameter to point to the directory with the text files.
20

* In that directory, each sub-directory represents a class and the text
21

* files in these sub-directories will be labeled as such.
22

*
23

* @param args the commandline arguments
24

* @throws Exception if something goes wrong
25

*/
26

public static void main(String[] args) throws Exception {
27

// convert the directory into a dataset
28

TextDirectoryLoader loader = new TextDirectoryLoader();
29

loader.setDirectory(new File("./text_example"));
30

Instances dataRaw = loader.getDataSet();
31

System.out.println("\n\nImported data:\n\n" + dataRaw.numClasses());
32

// apply the StringToWordVector
34

// (see the source code of setOptions(String[]) method of the filter
35

// if you want to know which command-line option corresponds to which
36

// bean property)
37

StringToWordVector filter = new StringToWordVector();
38

filter.setInputFormat(dataRaw);
39

Instances dataFiltered = Filter.useFilter(dataRaw, filter);
40

System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
41

// train J48 and output model
43

J48 classifier = new J48();
44

classifier.buildClassifier(dataFiltered);
45

System.out.println("\n\nClassifier model:\n\n" + classifier);
46

}
47

}
48

最后，我还是建议数据建模和生成都自己写程序，数据准备往往自己的程序才能准确的控制，weka最多是帮我们做一下selection和classification。
另外补充一点，很多朋友问到了如何做文本分类，好吧，如果大家懒得去读paper的话，首先我普及一点，不管什么分类，分类器基本是可以通用的，注意是基本。关键是模型的构建和特征的生成。至于文本分类中用到的特征，TF*IDF还有其他如互信息，卡方统计，期望交叉熵等等，公式摆在那里，计算真的不难。因为就我接触过的分类问题，文本分类的特征计算应该是很容易的了。

posted on 2012-04-24 16:09 changedi 阅读(3905) 评论(0) 编辑收藏所属分类: 机器学习

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: 决策树和Random Forests——优秀的群体智慧 Logistic Regression——用线解决问题 weka定制计划已添加到github weka特征预处理的一些tip weka的java使用(3)——特征选择 weka的java使用(2)——分类 weka的java使用(1)——聚类贝叶斯决策——总结笔记