网络上有很多lucene的分词介绍,但没有注释,看的云里雾里,自己看了点源代码,做了点注释。
自己写的分词都必须继承Analyzer,而这个Analyzer的源代码是这样的:
package org.apache.lucene.analysis;
import java.io.Reader;
public abstract class Analyzer
{
public abstract TokenStream tokenStream(String string, Reader reader);
public int getPositionIncrementGap(String fieldName) {
return 0;
}
}
红色抽象方法需要实现,返回的类型是TokenStream,而TokenStream是个抽象类,看源代码:
package org.apache.lucene.analysis;
import java.io.IOException;
public abstract class TokenStream
{
public abstract Token next() throws IOException;
public void close() throws IOException {
/* empty */
}
}
所以返回的应该是这个抽象类的实现类的实例。,在这个抽象类中,有个抽象方法(红色粗体)需要具体实现,返回Token,Token又是一个类,源代码是:
package org.apache.lucene.analysis;
public final class Token
{
String termText;
int startOffset;
int endOffset;
String type = "word";
private int positionIncrement = 1;
public Token(String text, int start, int end) {
termText = text;
startOffset = start;
endOffset = end;
}
public Token(String text, int start, int end, String typ) {
termText = text;
startOffset = start;
endOffset = end;
type = typ;
}
。。。
public final String toString() {
StringBuffer sb = new StringBuffer();
sb.append("(" + termText + "," + startOffset + "," + endOffset);
if (!type.equals("word"))
sb.append(",type=" + type);
if (positionIncrement != 1)
sb.append(",posIncr=" + positionIncrement);
sb.append(")");
return sb.toString();
}
}
四个基本参数构造了他的样子Token 格式:(word,开始,结束,类型)
所以我们要在next()方法中得到这样的Token。
分析到此为止,看个实在的:
首先有个类要继承Analyzer
public class ChineseAnalyzer extends Analyzer {
public final static String[] STOP_WORDS = {"的","和"};
private Set stopTable;
public MMChineseAnalyzer() {
stopTable = StopFilter.makeStopSet(STOP_WORDS);
}
public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter(new ChineseTokenizer(reader), stopTable);
}
}
StopFilter是StopFilter extends TokenFilter,而TokenFilter是TokenFilter extends TokenStream
所以StopFilter也是个TokenStream。
最主要的是ChineseTokenizer(reader)
他也是个TokenStream,他继承ChineseTokenizer extends Tokenizer,而Tokenizer extends TokenStream,所以他也要重载next()方法;
这里采用前向最大匹配,用到字典;
字典加载用TreeMap保存
TreeMap类通过使用树来实现Map接口.TreeMap提供了按排序顺序存储关键字/值对的有效手段, 同时允许快速检索。不像散列映射,树映射保证它的元素按照关键字升序排序。
字典加载代码
public void loadWords() {
if (dictionary == null) {//防止不停的加载,吧以及加载的放到全局变量
dictionary = new TreeMap<String, String>();
InputStream is = null;
InputStreamReader isr = null;
BufferedReader br = null;
try {
is = new FileInputStream("c:/dictionary.txt");//字典文件路径
isr = new InputStreamReader(is, "UTF-8");
br = new BufferedReader(isr);
String word = null;
while ((word = br.readLine()) != null) {
int wordLength = word.length();
if ((word.indexOf("#") == -1)//可以为字典增加注释,主要前面加#就可以
&& (wordLength <= WORD_MAX_LENGTH)) {
dictionary.put(word.intern(), "1");
int i = wordLength - 1;
while (i >= 2) {
String temp = word.substring(0, i).intern();
if (!dictionary.containsKey(temp)) {
dictionary.put(temp, "2");
}
i--;
}
}
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
if (isr != null) {
isr.close();
}
if (is != null) {
is.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
System.out.println(dictionary.size());
}
下面就是next()方法
//这个next就是返回Token 格式:(word,开始,结束,类型)
public Token next() throws IOException {
System.out.println("load dictory");
// 装载词典
loadWords();
System.out.println("load dictory over");
StringBuffer word = new StringBuffer();
while (true) {
char c;//一个字符
char nextChar;//下个字符
Character.UnicodeBlock cUnicodeBlock;//一个字符的所属unicode块
Character.UnicodeBlock nextCharUnicodeBlock;//下个字符的所属unicode块
offset++;//偏移量
if (bufferIndex >= dataLength) {//初始化,读取input,并且缓存的指针为开始
dataLength = input.read(ioBuffer);
bufferIndex = 0;
}
if (dataLength == -1) {//结束返回
if (word.length() == 0) {
return null;
} else {
break;
}
}
c = ioBuffer[bufferIndex++];//取得第一个字符
cUnicodeBlock = Character.UnicodeBlock.of(c);//取得第一个字符的unicode块
nextChar = ioBuffer[bufferIndex];//取得下字符
nextCharUnicodeBlock = Character.UnicodeBlock.of(nextChar);
//这2个字符是否是一样
boolean isSameUnicodeBlock = cUnicodeBlock.toString()
.equalsIgnoreCase(nextCharUnicodeBlock.toString());
//第一字符是亚洲字
if (cUnicodeBlock == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS) {
//类型为双字节
tokenType = "double";//双字节
if (word.length() == 0) {
word.append(c);
// 增强部分--开始 字符所属unicode块不同,退出循环 多余代码
if (word.length() != 0 && (!isSameUnicodeBlock)) {
break;
}
// 增强部分--结束
} else {
//吧2个字符组合,是否是字典里面有的,如果是,增加到word
String temp = (word.toString() + c).intern();
if (dictionary.containsKey(temp)) {
word.append(c);
// 增强部分--开始
if (word.length() != 0 && (!isSameUnicodeBlock)) {
break;
}
// 增强部分--结束
} else {
bufferIndex--;
offset--;
break;
}
}
} else if (cUnicodeBlock == Character.UnicodeBlock.BASIC_LATIN) {
tokenType = "single";//单字节
if (Character.isWhitespace(c)) {
if (word.length() != 0)
break;
} else {
word.append(c);
// 增强部分--开始
if (word.length() != 0 && (!isSameUnicodeBlock)) {
break;
}
// 增强部分--结束
}
}
System.out.println("word="+word);
}
//构造token返回
Token token = new Token(word.toString(), offset - word.length(),
offset, tokenType);
//word清空
word.setLength(0);
System.out.println(token);
return token;
}
整个while循环就是最主要的了.(END)