网络上有很多lucene的分词介绍,但没有注释,看的云里雾里,自己看了点源代码,做了点注释。

 

自己写的分词都必须继承Analyzer,而这个Analyzer的源代码是这样的:

package org.apache.lucene.analysis;

import java.io.Reader;

 

public abstract class Analyzer

{

   public abstract TokenStream tokenStream(String string, Reader reader);

    public int getPositionIncrementGap(String fieldName) {

       return 0;

    }

}

红色抽象方法需要实现,返回的类型是TokenStream,而TokenStream是个抽象类,看源代码:

package org.apache.lucene.analysis;

import java.io.IOException;

 

public abstract class TokenStream

{

    public abstract Token next() throws IOException;

    public void close() throws IOException {

       /* empty */

    }

}

所以返回的应该是这个抽象类的实现类的实例。,在这个抽象类中,有个抽象方法(红色粗体)需要具体实现,返回TokenToken又是一个类,源代码是:

package org.apache.lucene.analysis;

 

public final class Token

{

    String termText;

    int startOffset;

    int endOffset;

    String type = "word";

    private int positionIncrement = 1;

   

    public Token(String text, int start, int end) {

       termText = text;

       startOffset = start;

       endOffset = end;

    }

   

    public Token(String text, int start, int end, String typ) {

       termText = text;

       startOffset = start;

       endOffset = end;

       type = typ;

    }

。。。

 

    public final String toString() {

       StringBuffer sb = new StringBuffer();

       sb.append("(" + termText + "," + startOffset + "," + endOffset);

       if (!type.equals("word"))

           sb.append(",type=" + type);

       if (positionIncrement != 1)

           sb.append(",posIncr=" + positionIncrement);

       sb.append(")");

       return sb.toString();

    }

}

四个基本参数构造了他的样子Token 格式:(word,开始,结束,类型)

所以我们要在next()方法中得到这样的Token

 

分析到此为止,看个实在的:

首先有个类要继承Analyzer

public class ChineseAnalyzer extends Analyzer {

       public final static String[] STOP_WORDS = {"",""};

 

       private Set stopTable;

 

       public MMChineseAnalyzer() {

              stopTable = StopFilter.makeStopSet(STOP_WORDS);

       }

 

       public TokenStream tokenStream(String fieldName, Reader reader) {

              return new StopFilter(new ChineseTokenizer(reader), stopTable);

       }

}

 

 

 

StopFilterStopFilter extends TokenFilter,而TokenFilterTokenFilter extends TokenStream

所以StopFilter也是个TokenStream

 

最主要的是ChineseTokenizer(reader)

他也是个TokenStream,他继承ChineseTokenizer extends Tokenizer,而Tokenizer extends TokenStream,所以他也要重载next()方法;

 

这里采用前向最大匹配,用到字典;

字典加载用TreeMap保存

TreeMap类通过使用树来实现Map接口.TreeMap提供了按排序顺序存储关键字/值对的有效手段, 同时允许快速检索。不像散列映射,树映射保证它的元素按照关键字升序排序。

字典加载代码

public void loadWords() {

              if (dictionary == null) {//防止不停的加载,吧以及加载的放到全局变量

                     dictionary = new TreeMap<String, String>();

 

                     InputStream is = null;

                     InputStreamReader isr = null;

                     BufferedReader br = null;

                     try {

                            is = new FileInputStream("c:/dictionary.txt");//字典文件路径

                            isr = new InputStreamReader(is, "UTF-8");

                            br = new BufferedReader(isr);

                            String word = null;

                            while ((word = br.readLine()) != null) {

                                   int wordLength = word.length();

                                   if ((word.indexOf("#") == -1)//可以为字典增加注释,主要前面加#就可以

                                                 && (wordLength <= WORD_MAX_LENGTH)) {

                                          dictionary.put(word.intern(), "1");

                                          int i = wordLength - 1;

                                          while (i >= 2) {

                                                 String temp = word.substring(0, i).intern();

                                                 if (!dictionary.containsKey(temp)) {

                                                        dictionary.put(temp, "2");

                                                 }

                                                 i--;

                                          }

                                   }

                            }

                     } catch (IOException e) {

                            e.printStackTrace();

                     } finally {

                            try {

                                   if (br != null) {

                                          br.close();

                                   }

                                   if (isr != null) {

                                          isr.close();

                                   }

                                   if (is != null) {

                                          is.close();

                                   }

                            } catch (IOException e) {

                                   e.printStackTrace();

                            }

                     }

              }

              System.out.println(dictionary.size());

       }

 

下面就是next()方法

//这个next就是返回Token 格式:(word,开始,结束,类型)

       public Token next() throws IOException {

              System.out.println("load dictory");

              // 装载词典

              loadWords();

              System.out.println("load dictory over");

              StringBuffer word = new StringBuffer();

 

              while (true) {

                     char c;//一个字符

                     char nextChar;//下个字符

                     Character.UnicodeBlock cUnicodeBlock;//一个字符的所属unicode

                     Character.UnicodeBlock nextCharUnicodeBlock;//下个字符的所属unicode

 

                     offset++;//偏移量

 

                     if (bufferIndex >= dataLength) {//初始化,读取input,并且缓存的指针为开始

                            dataLength = input.read(ioBuffer);

                            bufferIndex = 0;

                     }

 

                     if (dataLength == -1) {//结束返回

                            if (word.length() == 0) {

                                   return null;

                            } else {

                                   break;

                            }

                     }

 

                     c = ioBuffer[bufferIndex++];//取得第一个字符

                     cUnicodeBlock = Character.UnicodeBlock.of(c);//取得第一个字符的unicode

 

                     nextChar = ioBuffer[bufferIndex];//取得下字符

                     nextCharUnicodeBlock = Character.UnicodeBlock.of(nextChar);

                     //2个字符是否是一样

                     boolean isSameUnicodeBlock = cUnicodeBlock.toString()

                                   .equalsIgnoreCase(nextCharUnicodeBlock.toString());

 

                     //第一字符是亚洲字

                     if (cUnicodeBlock == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS) {

                            //类型为双字节

                            tokenType = "double";//双字节

                            if (word.length() == 0) {

                                   word.append(c);

                                   // 增强部分--开始 字符所属unicode块不同,退出循环 多余代码

                                   if (word.length() != 0 && (!isSameUnicodeBlock)) {

                                          break;

                                   }

                                   // 增强部分--结束

                            } else {

                                   //2个字符组合,是否是字典里面有的,如果是,增加到word

                                   String temp = (word.toString() + c).intern();

                                   if (dictionary.containsKey(temp)) {

                                          word.append(c);

                                          // 增强部分--开始

                                          if (word.length() != 0 && (!isSameUnicodeBlock)) {

                                                 break;

                                          }

                                          // 增强部分--结束

                                   } else {

                                          bufferIndex--;

                                          offset--;

                                          break;

                                   }

                            }

                     } else if (cUnicodeBlock == Character.UnicodeBlock.BASIC_LATIN) {

                            tokenType = "single";//单字节

                            if (Character.isWhitespace(c)) {

                                   if (word.length() != 0)

                                          break;

                            } else {

                                   word.append(c);

                                   // 增强部分--开始

                                   if (word.length() != 0 && (!isSameUnicodeBlock)) {

                                          break;

                                   }

                                   // 增强部分--结束

                            }

                     }

                     System.out.println("word="+word);

              }

 

              //构造token返回

              Token token = new Token(word.toString(), offset - word.length(),

                            offset, tokenType);

              //word清空

              word.setLength(0);

              System.out.println(token);

              return token;

       }

 

整个while循环就是最主要的了.(END)