随笔-26  评论-111  文章-19  trackbacks-0

        基于自己的兴趣,利用业务时间在Lucene基础上做的一个搜索框架,请大家多多指教。

一、       
介绍

基于Lucene的全文检索框架,提供快速方便的索引创建及查询方式,并提供扩展功能对框架进行扩展。

    项目地址:http://code.google.com/p/snoics-retrieval/

二、        使用指南

1、 环境要求

Java1.5+

Lucene 3.0.x+

2、 加载

通过RetrievalApplicationContext 载入配置参数,创建实例,每个被创建出的RetrievalApplicationContext实例中都包含一个完整的、独立的上下文环境。

一般情况下,一个应用只需要在启动时创建一个RetrievalApplicationContext实例,然后由整个应用共享。

有以下几种方式创建RetrievalApplicationContext实例:

以默认的方式,通过读取classpath下的retrieval.properties配置文件创建

                   RetrievalApplicationContext retrievalApplicationContext=

              new RetrievalApplicationContext(“c:""index”);

使用配置文件的Properties实例加载

                   Properties properties=...

                   ...

                   RetrievalApplicationContext retrievalApplicationContext=

           new RetrievalApplicationContext(properties,“c:""index”);

读取指定的配置文件创建,文件必须放在classpath

                   RetrievalApplicationContext retrievalApplicationContext=

           new RetrievalApplicationContext(“app-retrieval.properties”,

“c:""index”);

通过构建RetrievalProperties对象创建

                   RetrievalProperties retrievalProperties=…

       …

                   RetrievalApplicationContext retrievalApplicationContext=

           new RetrievalApplicationContext(retrievalProperties,

“c:""index”);

3、 参数配置

默认配置文件为classpath下的retrieval.properties,配置参数说明如下

LUCENE_PARAM_VERSION

Lucene参数,如果不设置则使用默认值 LUCENE_30
设置Lucene版本号,将影响到索引文件格式及查询结果

LUCENE_20
LUCENE_21
LUCENE_22
LUCENE_23
LUCENE_24
LUCENE_29
LUCENE_30

LUCENE_PARAM_MAX_FIELD_LENGTH

Lucene参数,如果不设置则使用默认值 DEFAULT_MAX_FIELD_LENGTH=10000

The maximum number of terms that will be indexed for a single field in a document.
This limits the amount of memory required for indexing, so that collections with
very large files will not crash the indexing process by running out of memory.
This setting refers to the number of running terms, not to the number of different terms.

Note: this silently truncates large documents, excluding from the index all terms
that occur further in the document. If you know your source documents are large,
be sure to set this value high enough to accomodate the expected size. If you set
it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate
an OutOfMemoryError.

By default, no more than DEFAULT_MAX_FIELD_LENGTH=10000 terms will be indexed for a field.

LUCENE_PARAM_RAM_BUFFER_SIZE_MB

Lucene 参数,如果不设置,则使用默认值 DEFAULT_RAM_BUFFER_SIZE_MB=16
控制用于buffer索引文档的内存上限,如果buffer的索引文档个数到达该上限就写入硬盘,越大索引速度越快。

Determines the amount of RAM that may be used for buffering added documents and
deletions before they are flushed to the Directory. Generally for faster indexing
performance it's best to flush by RAM usage instead of document count and use as
large a RAM buffer as you can.

When this is set, the writer will flush whenever buffered documents and deletions
use this much RAM. Pass in DISABLE_AUTO_FLUSH to prevent triggering a flush due to
RAM usage. Note that if flushing by document count is also enabled, then the flush
will be triggered by whichever comes first.

NOTE: the account of RAM usage for pending deletions is only approximate. Specifically,
if you delete by Query, Lucene currently has no way to measure the RAM usage if
individual Queries so the accounting will under-estimate and you should compensate by
either calling commit() periodically yourself, or by using setMaxBufferedDeleteTerms
to flush by count instead of RAM usage (each buffered delete Query counts as one).

NOTE: because IndexWriter uses ints when managing its internal storage, the absolute
maximum value for this setting is somewhat less than 2048 MB. The precise limit depends on
various factors, such as how large your documents are, how many fields have norms,
etc., so it's best to set this value comfortably under 2048.

The default value is DEFAULT_RAM_BUFFER_SIZE_MB=16.

LUCENE_PARAM_MAX_BUFFERED_DOCS

Lucene 参数,如果不设置,则使用默认值
和LUCENE_PARAM_RAM_BUFFER_SIZE_MB这两个参数是可以一起使用的,一起使用时只要有一个触发条件
满足就写入硬盘,生成一个新的索引segment文件

Determines the minimal number of documents required before the buffered in-memory documents
are flushed as a new Segment. Large values generally gives faster indexing.

When this is set, the writer will flush every maxBufferedDocs added documents. Pass in
DISABLE_AUTO_FLUSH to prevent triggering a flush due to number of buffered documents.
Note that if flushing by RAM usage is also enabled, then the flush will be triggered by
whichever comes first.

Disabled by default (writer flushes by RAM usage).

LUCENE_PARAM_MERGE_FACTOR

Lucene 参数,如果不设置,则使用默认值 10
MergeFactor 这个参数就是控制当硬盘中有多少个子索引segments,MergeFactor这个不能设置太大,
特别是当MaxBufferedDocs比较小时(segment 越多),否则会导致open too many files错误,甚至导致虚拟机外面出错。

Determines how often segment indices are merged by addDocument(). With smaller values,
less RAM is used while indexing, and searches on unoptimized indices are faster,
but indexing speed is slower. With larger values, more RAM is used during indexing,
and while searches on unoptimized indices are slower, indexing is faster. Thus larger
values (> 10) are best for batch index creation, and smaller values (< 10) for indices
that are interactively maintained.

Note that this method is a convenience method: it just calls mergePolicy.setMergeFactor
as long as mergePolicy is an instance of LogMergePolicy. Otherwise an IllegalArgumentException
is thrown.

This must never be less than 2. The default value is 10.

LUCENE_PARAM_MAX_MERGE_DOCS

Lucene 参数,如果不设置,则使用默认值 Integer.MAX_VALUE
该参数决定写入内存索引文档个数,到达该数目后就把该内存索引写入硬盘,生成一个新的索引segment文件,越大索引速度越快。

Determines the largest segment (measured by document count) that may be merged with other segments.
Small values (e.g., less than 10,000) are best for interactive indexing, as this limits the length
of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches.

The default value is Integer.MAX_VALUE.

Note that this method is a convenience method: it just calls mergePolicy.setMaxMergeDocs as long as
mergePolicy is an instance of LogMergePolicy. Otherwise an IllegalArgumentException is thrown.

The default merge policy (LogByteSizeMergePolicy) also allows you to set this limit by net
size (in MB) of the segment, using LogByteSizeMergePolicy.setMaxMergeMB.

INDEX_MAX_FILE_DOCUMENT_PAGE_SIZE

设置索引创建执行参数,如果不设置,则使用默认值 20

批量创建文件索引时每页最大的文件索引文档数量,
即使在创建索引时,通过API设置超过这个值的数量,也不会生效

INDEX_MAX_INDEX_FILE_SIZE

设置索引创建执行参数,如果不设置,则使用默认值 3145728(单位:字节)

创建文件索引时,如果文件大小超过这里设置的限制的大小,则忽略该文件的内容,不对文件内容解析创建索引

INDEX_MAX_DB_DOCUMENT_PAGE_SIZE

设置索引创建执行参数,如果不设置,则使用默认值 500

批量创建数据库索引时,从数据库中读取的每页数据最大记录数
即使在创建索引时,通过API设置超过这个值的数量,也不会生效

INDEX_DEFAULT_CHARSET

设置索引创建执行参数,如果不设置,则使用默认值 utf-8

解析文本文件内容时使用的默认字符集

QUERY_RESULT_TOP_DOCS_NUM

设置索引检索执行参数,如果不设置,则使用默认值 3000

查询结果返回的最大结果集

RETRIEVAL_EXTENDS_CLASS_FILE_CONTENT_PARSER_MANAGER

Retrieval扩展,如果不设置,则使用默认值 com.snoics.retrieval.engine.index.create.impl.file.FileContentParserManager

文件内容解析管理器,对文件创建索引时,通过该管理器对不同的文件类型创建各自对应的解析器对文件内容进行解析
需要实现接口:com.snoics.retrieval.engine.index.create.impl.file.IFileContentParserManager

RETRIEVAL_EXTENDS_CLASS_ANALYZER_BUILDER

Retrieval扩展,如果不设置,则使用默认值 com.snoics.retrieval.engine.analyzer.CJKAnalyzerBuilder

索引分词器,内置索引分词器包括:
com.snoics.retrieval.engine.analyzer.CJKAnalyzerBuilder(默认)
com.snoics.retrieval.engine.analyzer.IKCAnalyzerBuilder(中文分词强烈推荐)
com.snoics.retrieval.engine.analyzer.StandardAnalyzerBuilder
com.snoics.retrieval.engine.analyzer.ChineseAnalyzerBuilder

需要实现接口:com.snoics.retrieval.engine.analyzer.IRAnalyzerBuilder

RETRIEVAL_EXTENDS_CLASS_HEIGHLIGHTER_MAKER

Retrieval扩展,如果不设置,则使用默认值 com.snoics.retrieval.engine.query.formatter.HighlighterMaker

对查询结果内容进行高亮处理

需要实现接口:com.snoics.retrieval.engine.query.formatter.IHighlighterMaker

RETRIEVAL_EXTENDS_CLASS_DATABASE_INDEX_ALL

Retrieval扩展,如果不设置,则使用默认值 com.snoics.retrieval.engine.index.all.impl.DefaultRDatabaseIndexAllImpl

对查询结果内容进行高亮处理

需要继承抽象类:com.snoics.retrieval.engine.index.all.impl.AbstractDefaultRDatabaseIndexAll
或直接实现接口:com.snoics.retrieval.engine.index.all.IRDatabaseIndexAll

4、 索引

4.1、初始化索引

                retrievalApplicationContext

.getFacade()

.initIndex(new String[]{"DB","FILE"});

4.2、提供5种方式创建索引

以普通方式创建索引

       RFacade facade=retrievalApplicationContext.getFacade();

      

       NormalIndexDocument normalIndexDocument=

facade.createNormalIndexDocument(false);

      

       RDocItem docItem1=new RDocItem();

       docItem1.setContent("搜索引擎");

       docItem1.setName("KEY_FIELD");

       normalIndexDocument.addKeyWord(docItem1);

       RDocItem docItem2=new RDocItem();

       docItem2.setContent("速度覅藕断丝连房价多少了咖啡卡拉圣诞节");

       docItem2.setName("TITLE_FIELD");

       normalIndexDocument.addContent(docItem2);

       RDocItem docItem3=new RDocItem();

       docItem3.setContent("哦瓦尔卡及讨论热离开家");

       docItem3.setName("CONTENT_FIELD");

       normalIndexDocument.addContent(docItem3);

       IRDocOperatorFacade docOperatorFacade=

facade.createDocOperatorFacade();

      

       docOperatorFacade.create(normalIndexDocument);

对单条数据库记录内容创建索引

       IRDocOperatorFacade docOperatorHelper=

retrievalApplicationContext

.getFacade()

.createDocOperatorFacade();

      

       String tableName="TABLE1";

       String recordId="849032809432490324093";

      

       DatabaseIndexDocument databaseIndexDocument=

retrievalApplicationContext

.getFacade()

.createDatabaseIndexDocument(false);

      

       databaseIndexDocument.setIndexPathType("DB");

       databaseIndexDocument.setIndexInfoType("TABLE1");

      

       databaseIndexDocument.setTableNameAndRecordId(tableName,

recordId);

       RDocItem docItem1=new RDocItem();

       docItem1.setName("TITLE");

       docItem1.setContent("SJLKDFJDSLK F");

      

       RDocItem docItem2=new RDocItem();

       docItem2.setName("CONTENT");

       docItem2.setContent("RUEWOJFDLSKJFLKSJGLKJSFLKDSJFLKDSF");

      

       RDocItem docItem3=new RDocItem();

       docItem3.setName("field3");

       docItem3.setContent("adsjflkdsjflkdsf");

      

       RDocItem docItem4=new RDocItem();

       docItem4.setName("field4");

       docItem4.setContent("45432534253");

      

       RDocItem docItem5=new RDocItem();

       docItem5.setName("field5");

       docItem5.setContent("87987yyyyyyyy");

      

       RDocItem docItem6=new RDocItem();

       docItem6.setName("field6");

       docItem6.setContent("87987yyyyyyyy");

      

       databaseIndexDocument.addContent(docItem1);

       databaseIndexDocument.addContent(docItem2);

       databaseIndexDocument.addContent(docItem3);

       databaseIndexDocument.addContent(docItem4);

       databaseIndexDocument.addContent(docItem5);

       databaseIndexDocument.addContent(docItem6);

      

       docOperatorHelper.create(databaseIndexDocument);

对单个文件内容及文件信息创建索引

       IRDocOperatorFacade docOperatorHelper=

retrievalApplicationContext

.getFacade()

.createDocOperatorFacade();

      

       FileIndexDocument fileIndexDocument=

retrievalApplicationContext

.getFacade()

.createFileIndexDocument(false,"utf-8");

       fileIndexDocument.setFileBasePath("c:""doc");

       fileIndexDocument.setFileId("fileId_123");

       fileIndexDocument.setFile(new File("c:""doc""1.doc"));

       fileIndexDocument.setIndexPathType("FILE");

       fileIndexDocument.setIndexInfoType("SFILE");

      

       docOperatorHelper.create(fileIndexDocument,3*1024*1024);

       

对数据库记录进行批量创建索引

       String tableName = "TABLE1";

       String keyField = "ID";

       String sql = "SELECT ID,"

+ "TITLE,"

+ "CONTENT,"

+ "FIELD3,"

              + "FIELD4,"

+ "FIELD5,"

+ "FIELD6 FROM TABLE1 ORDER BY ID ASC";

RDatabaseIndexAllItem databaseIndexAllItem =

           retrievalApplicationContext

                  .getFacade()

.createDatabaseIndexAllItem(false);

      

databaseIndexAllItem.setIndexPathType("DB");

       databaseIndexAllItem.setIndexInfoType("TABLE1");

       // 如果无论记录是否存在,都新增一条索引内容,

则使用RetrievalType.RIndexOperatorType.INSERT

       // 如果索引中记录已经存在,则只更新索引中的对应的记录,

否则新增记录,则使用RetrievalType.RIndexOperatorType.UPDATE

       databaseIndexAllItem

.setIndexOperatorType(RetrievalType.

RIndexOperatorType.INSERT);

       databaseIndexAllItem.setTableName(tableName);

       databaseIndexAllItem.setKeyField(keyField);

       databaseIndexAllItem.setDefaultTitleFieldName("TITLE");

       databaseIndexAllItem.setDefaultResumeFieldName("CONTENT");

       databaseIndexAllItem.setPageSize(500);

       databaseIndexAllItem.setSql(sql);

       databaseIndexAllItem.setParam(new Object[] {});

       databaseIndexAllItem

.setDatabaseRecordInterceptor(new TestDatabaseRecordInterceptor());

       IRDocOperatorFacade docOperatorFacade =

retrievalApplicationContext

              .getFacade()

.createDocOperatorFacade();

       long indexCount = docOperatorFacade.

createAll(databaseIndexAllItem);

       //优化索引

       retrievalApplicationContext

.getFacade()

.createIndexOperatorFacade("DB")

.optimize();

对大量的文件批量创建索引

       RFileIndexAllItem fileIndexAllItem=

retrievalApplicationContext

.getFacade()

.createFileIndexAllItem(false,"utf-8");

       fileIndexAllItem.setIndexPathType("FILE");

      

       //如果无论记录是否存在,都新增一条索引内容,

则使用RetrievalType.RIndexOperatorType.INSERT

       //如果索引中记录已经存在,则只更新索引中的对应的记录,

否则新增记录,则使用RetrievalType.RIndexOperatorType.UPDATE

       FileIndexAllItem

.setIndexOperatorType(RetrievalType

.RIndexOperatorType.INSERT);

       fileIndexAllItem.setIndexInfoType("SFILE");

      

       fileIndexAllItem

.setFileBasePath("D:""workspace""resources""docs");

       fileIndexAllItem.setIncludeSubDir(true);

       fileIndexAllItem.setPageSize(100);

       fileIndexAllItem

.setIndexAllFileInterceptor(

new TestFileIndexAllInterceptor());

      

       //如果要对所有类型的文件创建索引,则不要做设置一下这些设置,

否则在设置过类型之后,将只对这些类型的文件创建索引

       fileIndexAllItem.addFileType("doc");

       fileIndexAllItem.addFileType("docx");

       fileIndexAllItem.addFileType("sql");

       fileIndexAllItem.addFileType("html");

       fileIndexAllItem.addFileType("htm");

       fileIndexAllItem.addFileType("txt");

       fileIndexAllItem.addFileType("xls");

      

       long count=docOperatorHelper.createAll(fileIndexAllItem);

      

      

retrievalApplicationContext

.getFacade()

.createIndexOperatorFacade("FILE")

.optimize();

支持多线程创建索引,而不会出现索引文件异常

       Thread thread1=new Thread(new Runnable(){

       publicvoid run() {

              do 单条或批量创建索引

           }

       });

       Thread thread2=new Thread(new Runnable(){

           publicvoid run() {

              do 单条或批量创建索引

           }

       });

       Thread thread3=new Thread(new Runnable(){

       publicvoid run() {

              do 单条或批量创建索引

           }

       });

      

       thread1.start();

       thread2.start();

thread3.start();

5、 查询

使用RQuery实例,通过传入构造好的QueryItem实例进行查询,并使用QuerySort实例对结果排序

       public QueryItem createQueryItem(

RetrievalType.RDocItemType docItemType,

Object name,

String value){

           QueryItem queryItem=

retrievalApplicationContext

.getFacade()

.createQueryItem(docItemType,

String.valueOf(name),

value);

           return queryItem;

        }

IRQueryFacade queryFacade=

retrievalApplicationContext

.getFacade()

.createQueryFacade();

       RQuery query=queryFacade.createRQuery(indexPathType);

       QueryItem queryItem0=

testQuery

.createQueryItem(RetrievalType.RDocItemType.CONTENT,

"TITLE","啊啊");

       QueryItem queryItem1=

testQuery

.createQueryItem(RetrievalType.RDocItemType.CONTENT,

"TITLE","");

       QueryItem queryItem2=

testQuery

.createQueryItem(RetrievalType.RDocItemType.CONTENT,

"CONTENT","工作");

       QueryItem queryItem3=

testQuery

.createQueryItem(RetrievalType.RDocItemType.CONTENT,

"CONTENT","地方");

       QueryItem queryItem4=

testQuery

.createQueryItem(RetrievalType.RDocItemType.CONTENT,

"FIELD3","过节");

       QueryItem queryItem5=

testQuery

.createQueryItem(RetrievalType.RDocItemType.CONTENT,

"FIELD4","高兴");

       QueryItem queryItem=

queryItem0

.should(QueryItem.SHOULD,queryItem1)

.should(queryItem2)

.should(queryItem3.mustNot(QueryItem.SHOULD,queryItem4)).should(queryItem5);

       QuerySort querySort=new QuerySort(QueryUtil.createScoreSort());

       QueryResult[] queryResults=

query.getQueryResults(queryItem,querySort);

query.close();

6、 扩展

提供两种途径进行扩展:

1) 在配置文件指定扩展类,在加载时,自动读取和设置配置文件中的扩展类

2) RetrievalProperties实例中设置扩展类,

并使用该实例创建RetrievalApplicationContext实例

IFileContentParserManager

通过实现此接口,并替换整个文件内容解析管理器,扩展文件内容解析方式

或通过以下的方式,在原文件内容解析管理器中替换或新增文件解析器

实现IFileContentParser接口,并使用以下的方式新增或替换文件类型的内容解析器

           retrievalApplicationContext

.getRetrievalFactory()

.getFileContentParserManager()

.regFileContentParser(“docx”, fileContentParser)

IRAnalyzerBuilder

通过实现此接口,并替换分词器构建器

IHighlighterMaker

通过实现此接口,并替换内容高亮处理器

IRDatabaseIndexAll

通过实现此接口,实现数据库数据批量读取并写入索引

或直接继承AbstractRDatabaseIndexAll抽象类,并实现其中的抽象方法

           /**

            * 获取当前页数据库记录,每调用一次这个方法,就返回一页的记录

            * @return

 */

publicabstract List<Map> getResultList()

7、 其它

更详细的示例请查阅test中的代码

snoics-retrieval项目中使用了snoics-base.jar,如果需要获取snoics-base.jar的源代码,请到http://code.google.com/p/snoics-base/中下载

三、        关于

项目地址:http://code.google.com/p/snoics-retrieval/

Email : snoics@gmail.com

       Blog : http://www.blogjava.net/snoics/

posted on 2010-07-26 08:06 snoics 阅读(2754) 评论(0)  编辑  收藏

只有注册用户登录后才能发表评论。


网站导航: