Lucene In Action ch 4
笔记(I) -- Analysis
----- 2006-2-12
本章详细的讨论了 Lucene的分析处理过程和几个Analyzer.
在indexing过程中要把需要indexing的text分析处理一下, 经过处理和切词 然后建立index. 而不通的Analyzer有不同的分析规则, 因此在程序中使用Lucene时 选择正确的Analyzer是很重要的.
1.Using Analyzers
在使用Analyzer以前 先来看看text经过Analyzer分析后的效果吧:
Listing 4.1 Visualizing analyzer effects
Analyzing "The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
上面是在下面我们要提到的一个例子的运行结果. 可以看出不同的Analyzer 是如何来分析text的.在分析The quick brown fox jumped over the lazy dogs 时, WhitespaceAnalyzer和 SimpleAnalyzer只是简单的把词分开,建立Term就可以了;而另外两个Analyzer则去掉了stop word. 而在分析XY&Z Corporation - xyz@example.com 的时候 不同的Analyzer 对待 & 和 - 的方式也是不一样的 . 现在对Analysis有个感性的了解,下面来看看不同处理阶段的分析过程.
I. Indexing Analysis
还记得在ch2 indexing 中 讲到 ,在建立index时,使用IndexWriter 在构造IndexWriter时,要使用到Analyser.如下所示:
Analyzer analyzer = new StandardAnalyzer();
IndexWriter writer = new IndexWriter(directory,
analyzer
, true);
然后就可以使用
writer
对
document
来
indexing
了
.
如下
Document doc = new Document();
doc.add(
Field.Text
("title", "This is the title"));
doc.add(
Field.UnStored
("contents", "...document contents..."));
writer.addDocument(doc);
使用的是在构造
IndexWriter
时
指定的
Analyzer.
如果要给一个文档单独指定一个
Analyzer
可以用下面的一个方法
:
writer.addDocument(doc,analyzer);
II.QueryParser Analysis
Analysis
是
term
搜索的关键
.
要确保经过
Analyzer
分析后的
term
和被索引的一样
这样才可以得到搜索结果
.
在使用
QueryParser parse
用户输入的搜索表达式时可以
指定一个
Analyzer
如下所示
:
Query query = QueryParser.parse(expression, "contents",
analyzer
);
通过
QueryParser
的静态方法实现
.
如果使用
QueryParser
实例
,
则可以在构造
QueryParser
时候
提供一个
Analyzer
如下
:
QueryParser parser = new QueryParser("contents",
analyzer
);
query = parser.parse(expression);
QueryParser
analyzes individual pieces of the expression, not the expression as a
whole, which may include operators, parenthesis, and other special expression
syntax to denote range, wildcard, and fuzzy searches.
QueryParser
平等的分析所有的
text,
她并不知道他们是如何每
indxed,
这时如果当搜索一个被索引为
Keyword
的
filed
时
就可能会遇到问题
.
还有一个问题就是在分析一些包含其他元素的
text
时该如何处理
,
如
Html xml
文档
,
他们都带有元素标签
而这些标签一般是不索引的
.
以及如何处理分域
(field)
索引
,
如
Html
有
Header
和
Body
域
如何分开搜索
这个问题
Analyzer
现在也不能解决的
,
因为在每次
Analyzer
都处理单个域
.
在后面我们在进一步讨论该问题
.
2. Analyzing the Analyzer
要详细了解Lucene分析文本的过程就要知道Analyzer是如何工作的,下面就来看看Analyzer是怎么工作的吧. Analyzer是各个XXXAnalyzer的基类 ,该类出奇的简单(比我想象的要简单多了) 只要一个方法 tokenStream(String fieldName, Reader reader); fieldName 参数对有些Analyzer实现是没有作用的,如SimpleAnalyzer, 该类的代码如下:
public final class SimpleAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
}
}
可以看到该类也是出奇的简单, 只用到了LowerCaseTokenizer; 但LowerCaseTokenizer是干什么的呢? 看看名字就可以猜个差不多啦 ,
该类把Text 中非字母(nonletters)的字符去掉,并把所有Text转换为小写.
而返回的
TokenStream
是一个
enumerator-like class ,
通过她可以得到连续的
Token
s,
当到达末尾时候返回
null.
I. What's in a token;
A stream of tokens is the fundamental output of the analysis process.
关于什么是
Token
我查了一下中文的解释
,
但是没有找到很好的解释
所以还是引用书中的话吧
,
这样看起来更清楚
.
During indexing, fields designated for tokenization are processed with the specified analyzer,and each token is written to the index as a term. This distinction between tokens and terms may seem confusing at first. Let’s see what forms a
Token
; we’ll come back to how that translates into a term.
For example, let’s analyze the text “the quick brown fox”. Each token represents an individual word of that text. A token carries with it a text value (the word itself) as well as some meta-data: the start and end offsets in the original text, a token type, and a position increment. Figure 4.1 shows the details of the token stream analyzing this phrase with the
SimpleAnalyzer
.
The start offset is the character position in the original text where the token text begins, and the end offset is the position just after the last character of the token text. The token type is a
String
, defaulting to
"word"
, that you can control and use in the token-filtering process if desired. As text is tokenized, the position relative to the previous token is recorded as the position increment value. All the built-in tokenizers leave the position increment at the default value of 1, indicating that all tokens are in successive positions, one after the other.
关于
Position Increment
的问题
,
在遇到
stop word
时
不同的
Analyzer
会有不同的处理
,
请注意
.
II.
TokenStreams uncensored
有
两种
TokenStream
是
:
Tokenizer
and
TokenFilter
.
前者通过一个
Reader
来
tokenizes
你输入的文本
,
如果输入是个
String,
则把她包装为一个
StringReader
处理
.
而后者可以让你把一些
TokenFilter
串连起来
这和
Java
中的
IO
库
还有
JSP
中的
Filter
设计是差不多的
,
这样不同的
Filter
串连起来
提供的功能就很强大了
.
下面的图表给出了
TokenStream
的继承体系和简要描述
.
图
: TokenStream
的继承体系
Class name
|
Description
|
TokenStream
|
Base class with next() and close() methods.
|
Tokenizer
|
TokenStream whose input is a Reader.
|
CharTokenizer
|
Parent class of character-based tokenizers, with abstract isTokenChar() method. Emits tokens for contiguous blocks when isTokenChar == true. Also provides the capability to normalize (for example, lowercase) characters. Tokens are limited to a maximum size of 255 characters.
|
WhitespaceTokenizer
|
CharTokenizer with isTokenChar() true for all nonwhitespace characters.
|
LetterTokenizer
|
CharTokenizer with isTokenChar() true when Character.isLetter is true.
|
LowerCaseTokenizer
|
LetterTokenizer that normalizes all characters to lowercase.
|
StandardTokenizer
|
Sophisticated grammar-based tokenizer, emitting tokens for high-level types like e-mail addresses (see section 4.3.2 for more details). Each emitted token is tagged with a special type, some of which are handled specially by StandardFilter.
|
TokenFilter
|
TokenStream whose input is another TokenStream.
|
LowerCaseFilter
|
Lowercases token text.
|
StopFilter
|
Removes words that exist in a provided set of words.
|
PorterStemFilter
|
Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri.
|
StandardFilter
|
Designed to be fed by a StandardTokenizer. Removes dots from acronyms and's (apostrophe followed by S) from words with apostrophes.
|
表:
Analyzer building blocks provided in Lucene’s core API
其中 StopAnalyzer就使用到了Filter 代码如下:
public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter( new LowerCaseTokenizer(reader),stopTable);
}
在下面还会多次看到
Filter
的串连使用
.
3.
Visualizing analyzers
明白每个Analyzer 是如何对待你的text是很重要的. 下面来看一个产生开始给出的Analyzer分析结果的一个例子:
AnalyzerDemo.java
01
package lia.analysis;
02
03
import org.apache.lucene.analysis.Analyzer;
04
import org.apache.lucene.analysis.StopAnalyzer;
05
import org.apache.lucene.analysis.SimpleAnalyzer;
06
import org.apache.lucene.analysis.WhitespaceAnalyzer;
07
import org.apache.lucene.analysis.standard.StandardAnalyzer;
08
import java.io.IOException;
09
10
/**
11
* Adapted from code which first appeared in a java.net article
12
* written by Erik
13
*/
14
public class AnalyzerDemo {
15
private static final String[] examples = {
16
"The quick brown fox jumped over the lazy dogs",
17
"XY&Z Corporation - xyz@example.com"
18
};
19
20
private static final Analyzer[] analyzers = new Analyzer[]{
21
new WhitespaceAnalyzer(),
22
new SimpleAnalyzer(),
23
new StopAnalyzer(),
24
new StandardAnalyzer()
25
};
26
27
public static void main(String[] args) throws IOException {
28
// Use the embedded example strings, unless
29
// command line arguments are specified, then use those.
30
String[] strings = examples;
31
if (args.length > 0) {
32
strings = args;
33
}
34
35
for (int i = 0; i < strings.length; i++) {
36
analyze(strings[i]);
37
}
38
}
39
40
private static void analyze(String text) throws IOException {
41
System.out.println("Analyzing \"" + text + "\"");
42
for (int i = 0; i < analyzers.length; i++) {
43
Analyzer analyzer = analyzers[i];
44
String name = analyzer.getClass().getName();
45
name = name.substring(name.lastIndexOf(".") + 1);
46
System.out.println(" " + name + ":");
47
System.out.print(" ");
48
AnalyzerUtils.displayTokens(analyzer, text);
49
System.out.println("\n");
50
}
51
}
52
}
里面使用到了
AnalyzerUtils.java
如下
:
001
package lia.analysis;
002
003
import junit.framework.Assert;
004
import org.apache.lucene.analysis.Analyzer;
005
import org.apache.lucene.analysis.SimpleAnalyzer;
006
import org.apache.lucene.analysis.Token;
007
import org.apache.lucene.analysis.TokenStream;
008
import org.apache.lucene.analysis.standard.StandardAnalyzer;
009
010
import java.io.IOException;
011
import java.io.StringReader;
012
import java.util.ArrayList;
013
014
public class AnalyzerUtils {
015
public static Token[] tokensFromAnalysis(Analyzer analyzer,
016
String text) throws IOException { //
分析
Text
017
TokenStream stream =
018
analyzer.tokenStream("contents", new StringReader(text)); //
开始分析处理
019
ArrayList tokenList = new ArrayList();
020
while (true) {
021
Token token = stream.next();
022
if (token == null) break;
023
024
tokenList.add(token);
025
}
026
027
return (Token[]) tokenList.toArray(new Token[0]);
028
}
029
030
public static void displayTokens(Analyzer analyzer,
031
String text) throws IOException {
032
Token[] tokens = tokensFromAnalysis(analyzer, text);
033
034
for (int i = 0; i < tokens.length; i++) {
035
Token token = tokens[i];
036
037
System.out.print("[" + token.termText() + "] "); //
打印结果
结果
(3)
038
}
039
}
040
041
public static void displayTokensWithPositions(Analyzer analyzer,
042
String text) throws IOException { //
打印带有
Positions
的
token
043
Token[] tokens = tokensFromAnalysis(analyzer, text);
044
045
int position = 0;
046
047
for (int i = 0; i < tokens.length; i++) {
048
Token token = tokens[i];
049
050
int increment = token.getPositionIncrement();
051
052
if (increment > 0) {
053
position = position + increment;
054
System.out.println();
055
System.out.print(position + ": ");
056
}
057
058
System.out.print("[" + token.termText() + "] ");
059
}
060
System.out.println();
061
}
062
063
public static void displayTokensWithFullDetails( //
打印所有
token
的信息
064
Analyzer analyzer, String text) throws IOException {
065
Token[] tokens = tokensFromAnalysis(analyzer, text);
066
067
int position = 0;
068
069
for (int i = 0; i < tokens.length; i++) {
070
Token token = tokens[i];
071
072
int increment = token.getPositionIncrement();
073
074
if (increment > 0) {
075
position = position + increment;
076
System.out.println();
077
System.out.print(position + ": ");
078
}
079
080
System.out.print("[" + token.termText() + ":" +
081
token.startOffset() + "->" +
082
token.endOffset() + ":" +
083
token.type() + "] ");
084
}
085
System.out.println();
086
}
087
088
public static void assertTokensEqual(Token[] tokens,
089
String[] strings) {
090
Assert.assertEquals(strings.length, tokens.length);
091
092
for (int i = 0; i < tokens.length; i++) {
093
Assert.assertEquals("index " + i, strings[i], tokens[i].termText());
094
}
095
}
096
097
public static void main(String[] args) throws IOException {
098
System.out.println("SimpleAnalyzer");
099
displayTokensWithFullDetails(new SimpleAnalyzer(), //
测试并打印
Token
的详细信息
结果如下
(1)
:
100
"The quick brown fox....");
101
102
System.out.println("\n----");
103
System.out.println("StandardAnalyzer");
104
displayTokensWithFullDetails(new StandardAnalyzer(), //
测试并打印
Token
的详细信息
结果如下
(2)
:
105
"I'll e-mail you at xyz@example.com");
106
}
107
}
结果(1): 注意使用的是
SimpleAnalyzer Analyzer
1: [the:0->3:word]
2: [quick:4->9:word]
3: [brown:10->15:word]
4: [fox:16->19:word]
结果(2):注意使用的是
StandardAnalyzer Analyzer
1: [i'll:0->4:<APOSTROPHE>] //
StandardAnalyzer
知道她是一个缩略语
所以保留不变
2: [e:5->6:<ALPHANUM>]
3: [mail:7->11:<ALPHANUM>]
4: [you:12->15:<ALPHANUM>]
5: [xyz@example.com:19->34:<EMAIL>]
结果
(3):
输出结果见本文开头: 通过结果可以得出如下结论:
■
WhitespaceAnalyzer
didn’t lowercase, left in the dash, and did the bare minimum of tokenizing at whitespace boundaries.
■
SimpleAnalyzer
left in what may be considered irrelevant (stop) words, but it did lowercase and tokenize at nonalphabetic character boundaries.
■
Both
SimpleAnalyzer
and
StopAnalyzer
mangled the corporation name by splitting
XY&Z
and removing the ampersand.
■
StopAnalyzer
and
StandardAnalyzer
threw away occurrences of the word
the
.
■
StandardAnalyzer
kept the corporation name intact and lowercased it, removed the dash, and kept the e-mail address together. No other built-in analyzer is this thorough.
你也可以通过命令行 输入自己的text 看看是什么样的结果.
通过上面的例子 可以很好的了解Token的机制, 值得好好研究一些.
另外在使用filter时 filter的顺序也是很重要的,并且对处理性能也是有很大的关系的.关于这一点 可以看看这几个测试代码(点击我) 慢慢研究一下.
3.
使用内建的Analyzer
关于内建的Analyzer
WhitespaceAnalyzer
和
SimpleAnalyzer
通过上面的介绍
已经没有什么可说的了
.
关于
StopAnalyzer
她要处理分词和小写转换
另外还有去掉一些
stop word ,
在
StopAnalyzer
中有一个英文的
Stop word
列表
,
但是通过他的另外一个构造函数
你可以传入一个
String[]
来使用直接的
stop word
列表
.
使用
stopAnalyzer
后有会有新问题出现
.
当
stop word
移除后
剩下的空位如何处理
,
例如
:
你要索引
“one is not enough”.
经过
stopAnalyzer
后还剩下
one
和
enough.
这时如果使用
QueryParser
来索引
并且也使用
StopAnalyzer.
这样
one
和
enough
就可以匹配这些的查询条件
“one enough”, “one is enough”, “one but not enough”,
和原来的
“one is not enough” ,
所以
作者告诉我们
:
Remember,
QueryParser
also analyzes phrases, and each of these reduces to “one enough” and matches the terms indexed. There is a “hole” lot more to this topic, which we cover in section 4.7.3 (after we provide more details about token positions).
Having the stop words removed presents an interesting semantic question. Do you lose some potential meaning? The answer to this question is, “It depends.” It depends on your use of Lucene and whether searching on these words is meaningful to your application. We briefly revisit this somewhat rhetorical question later, in section 4.7.3. To emphasize and reiterate an important point, only the tokens emitted
关于StandardAnalyzer 有JCC语法分析作为其基础,所以可以很容易处理这些问题:
alphanumerics, acronyms, company names, e-mail addresses, computer host names, numbers, words with an interior apostrophe, serial numbers,
IP
addresses, and
CJK
(Chinese Japanese Korean) characters.
所以通常情况下
使用
StandardAnalyzer
是可以处理大部分情况的
.
其使用方法和其他的都是一样的
.
4. dealing with keyword fields
在处理
Keyword
时
如果使用
term
来搜索
是很好的
,
但是如果使用
QueryParser
就不那么好用了
.
来看个例子
:
01
package lia.analysis.keyword;
02
03
import junit.framework.TestCase;
04
import org.apache.lucene.index.IndexWriter;
05
import org.apache.lucene.index.Term;
06
import org.apache.lucene.analysis.SimpleAnalyzer;
07
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
08
import org.apache.lucene.store.RAMDirectory;
09
import org.apache.lucene.document.Document;
10
import org.apache.lucene.document.Field;
11
import org.apache.lucene.search.IndexSearcher;
12
import org.apache.lucene.search.Query;
13
import org.apache.lucene.search.TermQuery;
14
import org.apache.lucene.search.Hits;
15
import org.apache.lucene.queryParser.QueryParser;
16
import lia.analysis.keyword.KeywordAnalyzer;
17
18
public class KeywordAnalyzerTest extends TestCase {
19
RAMDirectory directory;
20
private IndexSearcher searcher;
21
22
public void setUp() throws Exception {
23
directory = new RAMDirectory();
24
IndexWriter writer = new IndexWriter(directory,
25
new SimpleAnalyzer(),
26
true);
27
28
Document doc = new Document();
29
doc.add(Field.Keyword("partnum", "Q36")); //
索引
keyword
30
doc.add(Field.Text("description", "Illidium Space Modulator"));
31
writer.addDocument(doc);
32
33
writer.close();
34
35
searcher = new IndexSearcher(directory);
36
}
37
38
public void testTermQuery() throws Exception { //
使用
term
来搜索
39
Query query = new TermQuery(new Term("partnum", "Q36"));
40
Hits hits = searcher.search(query);
41
assertEquals(1, hits.length());
42
}
43
44
public void testBasicQueryParser() throws Exception { //
使用
QueryParser
来搜索
45
Query query = QueryParser.parse("partnum:Q36 AND SPACE",
46
"description",
47
new SimpleAnalyzer());
48
49
Hits hits = searcher.search(query);
50
assertEquals("note Q36 -> q", //
注意此处
Q36
被
SimpleAnalyzer
分析为
q
了
.
51
"+partnum:q +space", query.toString("description"));
52
assertEquals("doc not found :(", 0, hits.length()); //
没有结果
解决办法看下个测试方法
53
}
54
55
public void testPerFieldAnalyzer() throws Exception { //
使用
PerFieldAnalyzerWrapper
和
KeywordAnalyzer
来指定一个
field
56
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(
57
new SimpleAnalyzer());
58
analyzer.addAnalyzer("partnum", new KeywordAnalyzer()); //
在
partnum
中使用
keywordAnalyzer
59
60
Query query = QueryParser.parse("partnum:Q36 AND SPACE",
61
"description",
62
analyzer);
63
64
Hits hits = searcher.search(query);
65
assertEquals("Q36 kept as-is",
66
"+partnum:Q36 +space", query.toString("description"));
67
assertEquals("doc found!", 1, hits.length()); //
搜索到结果
68
69
}
70
}
下面是KeywordAnalyzer.java代码
01
package lia.analysis.keyword;
02
03
import org.apache.lucene.analysis.Analyzer;
04
import org.apache.lucene.analysis.Token;
05
import org.apache.lucene.analysis.TokenStream;
06
import java.io.IOException;
07
import java.io.Reader;
08
09
/**
10
* "Tokenizes" the entire stream as a single token.
11
*/
12
public class KeywordAnalyzer extends Analyzer {
13
public TokenStream tokenStream(String fieldName,
14
final Reader reader) {
15
return new TokenStream() {
16
private boolean done;
17
private final char[] buffer = new char[1024];
18
public Token next() throws IOException {
19
if (!done) {
20
done = true;
21
StringBuffer buffer = new StringBuffer();
22
int length = 0;
23
while (true) {
24
length = reader.read(this.buffer);
25
if (length == -1) break;
26
27
buffer.append(this.buffer, 0, length);
28
}
29
String text = buffer.toString();
30
return new Token(text, 0, text.length());
31
}
32
return null;
33
}
34
};
35
}
36
}
看看上面的
TokenStream
的结构图
.
如果你确定你的
keywords
在
255
字符以内
还可以有个简单的实现
.
继承
CharTokenizer
并且覆盖
isTokenChar(char c)
方法
:
如下
public class SimpleKeywordAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
return new CharTokenizer(reader) {
protected boolean isTokenChar(char c) {
return true;
}
};
}
}
5."sounds like" searching
这好像是个好玩的东东
,
根据读音来搜索
.
例如
:
The quick brown fox jumped over the lazy dogs
和
Tha quik brown phox jumpd ovvar tha lazi dogz
这两句话
,
通过
MetaphoneReplacementAnalyzer
分析后的结果是一样的
.
有兴趣的可以看看测试代码
(点击我)
posted on 2007-01-05 10:14
Lansing 阅读(1090)
评论(0) 编辑 收藏 所属分类:
搜索引擎