lucene实例使用

说明一下,这一篇文章的用到的lucene,是用2.0版本的,主要在查询的时候2.0版本的lucene与以前的版本有了一些区别.
其实这一些代码都是早几个月写的,自己很懒,所以到今天才写到自己的博客上,高深的文章自己写不了，只能记录下一些简单的记录与点滴，其中的代码算是自娱自乐的，希望高手不要把重构之类的砸下来...

1、在windows系统下的的C盘，建一个名叫s的文件夹,在该文件夹里面随便建三个txt文件，随便起名啦，就叫"1.txt","2.txt"和"3.txt"啦
其中1.txt的内容如下：

代码
中华人民共和国
全国人民
2006 年

而"2.txt"和"3.txt"的内容也可以随便写几写，这里懒写，就复制一个和1.txt文件的内容一样吧

2、下载lucene包，放在classpath路径中
建立索引:

package lighter.javaeye.com;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStreamReader;

import java.util.Date;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.index.IndexWriter;

/**

* author lighter date 2006-8-7

public class TextFileIndexer {

public static void main(String[] args) throws Exception {

/* 指明要索引文件夹的位置,这里是C盘的S文件夹下 */

File fileDir = new File("c:\\s");

/* 这里放索引文件的位置 */

File indexDir = new File("c:\\index");

Analyzer luceneAnalyzer = new StandardAnalyzer();

IndexWriter indexWriter = new IndexWriter(indexDir, luceneAnalyzer,

true);

File[] textFiles = fileDir.listFiles();

long startTime = new Date().getTime();

//增加document到索引去

for (int i = 0; i < textFiles.length; i++) {

if (textFiles[i].isFile()

&& textFiles[i].getName().endsWith(".txt")) {

System.out.println("File " + textFiles[i].getCanonicalPath()

+ "正在被索引

.");

String temp = FileReaderAll(textFiles[i].getCanonicalPath(),

"GBK");

System.out.println(temp);

Document document = new Document();

Field FieldPath = new Field("path", textFiles[i].getPath(),

Field.Store.YES, Field.Index.NO);

Field FieldBody = new Field("body", temp, Field.Store.YES,

Field.Index.TOKENIZED,

Field.TermVector.WITH_POSITIONS_OFFSETS);

document.add(FieldPath);

document.add(FieldBody);

indexWriter.addDocument(document);

}

//optimize()方法是对索引进行优化

indexWriter.optimize();

indexWriter.close();

//测试一下索引的时间

long endTime = new Date().getTime();

System.out

.println("这花费了"

+ (endTime - startTime)

+ " 毫秒来把文档增加到索引里面去!"

+ fileDir.getPath());

}

public static String FileReaderAll(String FileName, String charset)

throws IOException {

BufferedReader reader = new BufferedReader(new InputStreamReader(

new FileInputStream(FileName), charset));

String line = new String();

String temp = new String();

while ((line = reader.readLine()) != null) {

temp += line;

}

reader.close();

return temp;

}

索引的结果：
File C:\s\1.txt正在被索引....
中华人民共和国全国人民2006年
File C:\s\2.txt正在被索引....
中华人民共和国全国人民2006年
File C:\s\3.txt正在被索引....
中华人民共和国全国人民2006年
这花费了297 毫秒来把文档增加到索引里面去!c:\s

3、建立了索引之后，查询啦....

package lighter.javaeye.com;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.queryParser.ParseException;

import org.apache.lucene.queryParser.QueryParser;

import org.apache.lucene.search.Hits;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

public class TestQuery {

public static void main(String[] args) throws IOException, ParseException {

Hits hits = null;

String queryString = "中华";

Query query = null;

IndexSearcher searcher = new IndexSearcher("c:\\index");

Analyzer analyzer = new StandardAnalyzer();

try {

QueryParser qp = new QueryParser("body", analyzer);

query = qp.parse(queryString);

} catch (ParseException e) {

}

if (searcher != null) {

hits = searcher.search(query);

if (hits.length() > 0) {

System.out.println("找到:" + hits.length() + " 个结果!");

}

其运行结果：
找到:3 个结果!

Lucene 其实很简单的,它最主要就是做两件事:建立索引和进行搜索
来看一些在lucene中使用的术语,这里并不打算作详细的介绍,只是点一下而已----因为这一个世界有一种好东西，叫搜索。

IndexWriter:lucene中最重要的的类之一，它主要是用来将文档加入索引，同时控制索引过程中的一些参数使用。

Analyzer:分析器,主要用于分析搜索引擎遇到的各种文本。常用的有StandardAnalyzer分析器,StopAnalyzer分析器,WhitespaceAnalyzer分析器等。

Directory:索引存放的位置;lucene提供了两种索引存放的位置，一种是磁盘，一种是内存。一般情况将索引放在磁盘上；相应地lucene提供了FSDirectory和RAMDirectory两个类。

Document:文档;Document相当于一个要进行索引的单元，任何可以想要被索引的文件都必须转化为Document对象才能进行索引。

Field：字段。

IndexSearcher:是lucene中最基本的检索工具，所有的检索都会用到IndexSearcher工具;

Query:查询，lucene中支持模糊查询，语义查询，短语查询，组合查询等等,如有TermQuery,BooleanQuery,RangeQuery,WildcardQuery等一些类。

QueryParser: 是一个解析用户输入的工具，可以通过扫描用户输入的字符串，生成Query对象。

Hits:在搜索完成之后，需要把搜索结果返回并显示给用户，只有这样才算是完成搜索的目的。在lucene中，搜索的结果的集合是用Hits类的实例来表示的。
上面作了一大堆名词解释，下面就看几个简单的实例吧:
1、简单的的StandardAnalyzer测试例子

package lighter.javaeye.com;

import java.io.IOException;

import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.Token;

import org.apache.lucene.analysis.TokenStream;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

public class StandardAnalyzerTest

{

//构造函数，

public StandardAnalyzerTest()

{

}

public static void main(String[] args)

{

//生成一个StandardAnalyzer对象

Analyzer aAnalyzer = new StandardAnalyzer();

//测试字符串

StringReader sr = new StringReader("lighter javaeye com is the are on");

//生成TokenStream对象

TokenStream ts = aAnalyzer.tokenStream("name", sr);

try {

int i=0;

Token t = ts.next();

while(t!=null)

{

//辅助输出时显示行号

i++;

//输出处理后的字符

System.out.println("第"+i+"行:"+t.termText());

//取得下一个字符

t=ts.next();

}

} catch (IOException e) {

e.printStackTrace();

}

显示结果：
第1行:lighter
第2行:javaeye
第3行:com

提示一下：
StandardAnalyzer是lucene中内置的"标准分析器",可以做如下功能:
1、对原有句子按照空格进行了分词
2、所有的大写字母都可以能转换为小写的字母
3、可以去掉一些没有用处的单词，例如"is","the","are"等单词，也删除了所有的标点
查看一下结果与"new StringReader("lighter javaeye com is the are on")"作一个比较就清楚明了。
这里不对其API进行解释了，具体见lucene的官方文档。需要注意一点，这里的代码使用的是lucene2的API，与1.43版有一些明显的差别。

2、看另一个实例,简单地建立索引，进行搜索

代码

package lighter.javaeye.com;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.queryParser.QueryParser;

import org.apache.lucene.search.Hits;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.store.FSDirectory;

public class FSDirectoryTest {

//建立索引的路径

public static final String path = "c:\\index2";

public static void main(String[] args) throws Exception {

Document doc1 = new Document();

doc1.add( new Field("name", "lighter javaeye com",Field.Store.YES,Field.Index.TOKENIZED));

Document doc2 = new Document();

doc2.add(new Field("name", "lighter blog",Field.Store.YES,Field.Index.TOKENIZED));

IndexWriter writer = new IndexWriter(FSDirectory.getDirectory(path, true), new StandardAnalyzer(), true);

writer.setMaxFieldLength(3);

writer.addDocument(doc1);

writer.setMaxFieldLength(3);

writer.addDocument(doc2);

writer.close();

IndexSearcher searcher = new IndexSearcher(path);

Hits hits = null;

Query query = null;

QueryParser qp = new QueryParser("name",new StandardAnalyzer());

query = qp.parse("lighter");

hits = searcher.search(query);

System.out.println("查找\"lighter\" 共" + hits.length() + "个结果");

query = qp.parse("javaeye");

hits = searcher.search(query);

System.out.println("查找\"javaeye\" 共" + hits.length() + "个结果");

}

运行结果：

查找"lighter" 共2个结果
查找"javaeye" 共1个结果

发表于 2008-01-14 10:32 大田斗阅读(4381) 评论(3) 编辑收藏所属分类: Lucene

# Site Promotion

Good Day. Better by far you should forget and smile than you should remember and be sad.
I am from Great and also now teach English, please tell me right I wrote the following sentence: "The anzsrc seo classification allows rd activity in australia and new the anzsrc seo classification updates the asrc seo classification.Few day to go to final day of busby seo test."

Waiting for a reply ;-), Taro.

Site Promotion 评论于 2009-05-18 11:19 回复更多评论

# re: lucene实例使用

@Site Promotion
hai!
did you mean how to translate the sentence from enligsh to chinese?

大田斗评论于 2009-05-18 16:30 回复更多评论

直接报错。。。傻，copy 有意思吗

就评论于 2014-04-09 17:29 回复更多评论

导航

统计

常用链接

留言簿(5)

随笔档案

文章分类

文章档案

java

工具

朋友

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜