posts - 40, comments - 7, trackbacks - 0

Lucene In Action ch 6(I) 笔记 --自定义排序

----- 2006-2-16

使用 Lucene 来搜索内容,搜索结果的显示顺序当然是比较重要的.Lucene中Build-in的几个排序定义在大多数情况下是不适合我们使用的.要适合自己的应用程序的场景,就只能自定义排序功能,本节我们就来看看在Lucene中如何实现自定义排序功能.

Lucene 中的自定义排序功能和Java集合中的自定义排序的实现方法差不多,都要实现一下比较接口. 在Java中只要实现Comparable接口就可以了.但是在Lucene中要实现SortComparatorSource接口和ScoreDocComparator接口.在了解具体实现方法之前先来看看这两个接口的定义吧.

SortComparatorSource 接口的功能是返回一个用来排序ScoreDocs的comparator(Expert: returns a comparator for sorting ScoreDocs).该接口只定义了一个方法.如下:

public ScoreDocComparatornewComparator(IndexReader reader,String fieldname) throws IOException

Creates a comparator for the field in the given index.

Parameters:

reader - Index to create comparator for.

fieldname - Field to create comparator for.

Returns:

Comparator of ScoreDoc objects.

Throws:

IOException - If an error occurs reading the index.

该方法 只是创造一个ScoreDocComparator实例用来实现排序.所以我们还要实现ScoreDocComparator接口.来看看ScoreDocComparator接口.功能是比较来两个ScoreDoc 对象来排序(Compares two ScoreDoc objects for sorting) 里面定义了两个Lucene实现的静态实例.如下:

public static final ScoreDocComparatorRELEVANCE

Special comparator for sorting hits according to computed relevance (document score).

public static final ScoreDocComparatorINDEXORDER

Special comparator for sorting hits according to index order (document number).

有3个方法与排序相关,需要我们实现分别如下:

public int compare(ScoreDoc i,ScoreDoc j)

Compares two ScoreDoc objects and returns a result indicating their sort order.

Parameters:

i - First ScoreDoc

j - Second ScoreDoc

Returns:

-1 if i should come before j
1 if i should come after j
0 if they are equal

public ComparablesortValue(ScoreDoc i)

Returns the value used to sort the given document. The object returned must implement the java.io.Serializable interface. This is used by multisearchers to determine how to collate results from their searchers.

Parameters:

i - Document

Returns:

Serializable object

public int sortType()

Returns the type of sort. Should return SortField.SCORE, SortField.DOC, SortField.STRING, SortField.INTEGER, SortField.FLOAT or SortField.CUSTOM. It is not valid to return SortField.AUTO. This is used by multisearchers to determine how to collate results from their searchers.

Returns:

One of the constants in SortField.

看个例子吧!

该例子为Lucene in Action中的一个实现,用来搜索距你最近的餐馆的名字. 餐馆坐标用字符串"x,y"来存储.如下图:

Figure 6.1 Which Mexican restaurant is closest to home (at 0,0) or work (at 10,10)?

此中情况下 Lucene中Build-in Sorting 实现就不可行了,看看如何自己实现吧.

01 package lia.extsearch.sorting;
02
03 import org.apache.lucene.search.SortComparatorSource;
04 import org.apache.lucene.search.ScoreDoc;
05 import org.apache.lucene.search.SortField;
06 import org.apache.lucene.search.ScoreDocComparator;
07 import org.apache.lucene.index.IndexReader;
08 import org.apache.lucene.index.TermEnum;
09 import org.apache.lucene.index.Term;
10 import org.apache.lucene.index.TermDocs;
11
12 import java.io.IOException;
13 // DistanceComparatorSource 实现了 SortComparatorSource 接口
14 public class DistanceComparatorSource implements SortComparatorSource {
15    // x y 用来保存坐标位置
16    private int x;
17    private int y;
18
19    public DistanceComparatorSource(int x, int y) {
20      this.x = x;
21      this.y = y;
22    }
23      // 返回 ScoreDocComparator 用来实现排序功能
24    public ScoreDocComparator newComparator(IndexReader reader, String fieldname)
25        throws IOException {
26      return new DistanceScoreDocLookupComparator(reader, fieldname, x, y);
27    }
28
29    //DistanceScoreDocLookupComparator 实现了 ScoreDocComparator 用来排序
30    private static class DistanceScoreDocLookupComparator implements
31        ScoreDocComparator {
32      private float[] distances;  // 保存每个餐馆到指定点的距离
33
34      // 构造函数 , 构造函数在这里几乎完成所有的准备工作 .
35      public DistanceScoreDocLookupComparator(IndexReader reader,
36          String fieldname, int x, int y) throws IOException {
37
38        final TermEnum enumerator = reader.terms(new Term(fieldname, ""));
39        distances = new float[reader.maxDoc()];  // 初始化 distances
40        if (distances.length > 0) {
41          TermDocs termDocs = reader.termDocs();
42          try {
43            if (enumerator.term() == null) {
44              throw new RuntimeException("no terms in field "
45                  + fieldname);
46            }
47            int i = 0,j = 0;
48            do {
49              System.out.println("in do-while :" + i ++);
50
51              Term term = enumerator.term(); // 取出每一个 Term
52              if (term.field() != fieldname)  // 与给定的域不符合则比较下一个
53                break;
54              //Sets this to the data for the current term in a TermEnum.
55              //This may be optimized in some implementations.
56              termDocs.seek(enumerator); // 参考 TermDocs Doc
57              while (termDocs.next()) {
58                System.out.println("    in while :" + j ++);
59                System.out.println("    in while ,Term :" + term.toString());
60
61                String[] xy = term.text().split(","); // 去处 x y
62                int deltax = Integer.parseInt(xy[0]) - x;
63                int deltay = Integer.parseInt(xy[1]) - y;
64                // 计算距离
65                distances[termDocs.doc()] = (float) Math
66                    .sqrt(deltax * deltax + deltay * deltay);
67              }
68            } while (enumerator.next());
69          } finally {
70            termDocs.close();
71          }
72        }
73      }
74
75      // 有上面的构造函数的准备这里就比较简单了
76      public int compare(ScoreDoc i, ScoreDoc j) {
77        if (distances[i.doc] < distances[j.doc])
78          return -1;
79        if (distances[i.doc] > distances[j.doc])
80          return 1;
81        return 0;
82      }
83
84      // 返回距离
85      public Comparable sortValue(ScoreDoc i) {
86        return new Float(distances[i.doc]);
87      }
88
89      // 指定 SortType
90      public int sortType() {
91        return SortField.FLOAT;
92      }
93    }
94
95    public String toString() {
96      return "Distance from (" + x + "," + y + ")";
97    }
98
99 }

这是一个实现了上面两个接口的两个类, 里面带有详细注释, 可以看出自定义排序并不是很难的. 该实现能否正确实现,我们来看看测试代码能否通过吧.

001 package lia.extsearch.sorting;
002
003 import junit.framework.TestCase;
004 import org.apache.lucene.analysis.WhitespaceAnalyzer;
005 import org.apache.lucene.document.Document;
006 import org.apache.lucene.document.Field;
007 import org.apache.lucene.index.IndexWriter;
008 import org.apache.lucene.index.Term;
009 import org.apache.lucene.search.FieldDoc;
010 import org.apache.lucene.search.Hits;
011 import org.apache.lucene.search.IndexSearcher;
012 import org.apache.lucene.search.Query;
013 import org.apache.lucene.search.ScoreDoc;
014 import org.apache.lucene.search.Sort;
015 import org.apache.lucene.search.SortField;
016 import org.apache.lucene.search.TermQuery;
017 import org.apache.lucene.search.TopFieldDocs;
018 import org.apache.lucene.store.RAMDirectory;
019
020 import java.io.IOException;
021
022 import lia.extsearch.sorting.DistanceComparatorSource;
023 // 测试自定义排序的实现
024 public class DistanceSortingTest extends TestCase {
025    private RAMDirectory directory;
026
027    private IndexSearcher searcher;
028
029    private Query query;
030
031    // 建立测试环境
032    protected void setUp() throws Exception {
033      directory = new RAMDirectory();
034      IndexWriter writer = new IndexWriter(directory,
035          new WhitespaceAnalyzer(), true);
036      addPoint(writer, "El Charro", "restaurant", 1, 2);
037      addPoint(writer, "Cafe Poca Cosa", "restaurant", 5, 9);
038      addPoint(writer, "Los Betos", "restaurant", 9, 6);
039      addPoint(writer, "Nico's Taco Shop", "restaurant", 3, 8);
040
041      writer.close();
042
043      searcher = new IndexSearcher(directory);
044
045      query = new TermQuery(new Term("type", "restaurant"));
046    }
047
048    private void addPoint(IndexWriter writer, String name, String type, int x,
049        int y) throws IOException {
050      Document doc = new Document();
051      doc.add(Field.Keyword("name", name));
052      doc.add(Field.Keyword("type", type));
053      doc.add(Field.Keyword("location", x + "," + y));
054      writer.addDocument(doc);
055    }
056
057    public void testNearestRestaurantToHome() throws Exception {
058      // 使用 DistanceComparatorSource 来构造一个 SortField
059      Sort sort = new Sort(new SortField("location",
060          new DistanceComparatorSource(0, 0)));
061
062      Hits hits = searcher.search(query, sort);  // 搜索
063
064      // 测试
065      assertEquals("closest", "El Charro", hits.doc(0).get("name"));
066      assertEquals("furthest", "Los Betos", hits.doc(3).get("name"));
067    }
068
069    public void testNeareastRestaurantToWork() throws Exception {
070      Sort sort = new Sort(new SortField("location",
071          new DistanceComparatorSource(10, 10)));  // 工作的坐标 10,10
072
073      // 上面的测试实现了自定义排序 , 但是并不能访问自定义排序的更详细信息 , 利用
074      //TopFieldDocs 可以进一步访问相关信息
075      TopFieldDocs docs = searcher.search(query, null, 3, sort);
076
077      assertEquals(4, docs.totalHits);
078      assertEquals(3, docs.scoreDocs.length);
079
080      // 取得 FieldDoc 利用 FieldDoc 可以取得关于排序的更详细信息请查看 FieldDoc Doc
081      FieldDoc fieldDoc = (FieldDoc) docs.scoreDocs[0];
082
083      assertEquals("(10,10) -> (9,6) = sqrt(17)", new Float(Math.sqrt(17)),
084          fieldDoc.fields[0]);
085
086      Document document = searcher.doc(fieldDoc.doc);
087      assertEquals("Los Betos", document.get("name"));
088
089       dumpDocs(sort, docs);  // 显示相关信息
090    }
091
092    // 显示有关排序的信息
093    private void dumpDocs(Sort sort, TopFieldDocs docs) throws IOException {
094      System.out.println("Sorted by: " + sort);
095      ScoreDoc[] scoreDocs = docs.scoreDocs;
096      for (int i = 0; i < scoreDocs.length; i++) {
097        FieldDoc fieldDoc = (FieldDoc) scoreDocs[i];
098        Float distance = (Float) fieldDoc.fields[0];
099        Document doc = searcher.doc(fieldDoc.doc);
100        System.out.println("   " + doc.get("name") + " @ ("
101            + doc.get("location") + ") -> " + distance);
102      }
103    }
104 }

完全通过测试,

输入信息如下:想进一步了解详细信息的可以研究一下:

in do-while :0
    in while :0
    in while ,Term :location:1,2
in do-while :1
    in while :1
    in while ,Term :location:3,8
in do-while :2
    in while :2
    in while ,Term :location:5,9
in do-while :3
    in while :3
    in while ,Term :location:9,6
in do-while :4
in do-while :0
    in while :0
    in while ,Term :location:1,2
in do-while :1
    in while :1
    in while ,Term :location:3,8
in do-while :2
    in while :2
    in while ,Term :location:5,9
in do-while :3
    in while :3
    in while ,Term :location:9,6
in do-while :4
Sorted by: <custom:"location": Distance from (10,10)>
Los Betos @ (9,6) -> 4.1231055
Cafe Poca Cosa @ (5,9) -> 5.0990195
Nico's Taco Shop @ (3,8) -> 7.28011

如果要想取得测试的详细参考信息可以参考 testNeareastRestaurantToWork 方法的实现 .

有上面可以看出要自定义实现排序并不是很难的.

下面来看看HitCollector.

一般情况下搜索结果只显示最重要的一些结果,但有时用户可能想显示所有匹配的搜索结果而不访问其内容.这中情况下使用自定义的HitCollector是高效的实现.

下面来看看一个测试例子.在该例子中我们实现了 BookLinkCollector 一个自定义的HitCollector,里面有一个Map 保存了符合查询条件的 URL 和相应的booktitle ,HitCollector中有个函数要实现 collect:其doc如下:

public abstract void collect(int doc, float score)

Called once for every non-zero scoring document, with the document number and its score.

If, for example, an application wished to collect all of the hits for a query in a BitSet, then it might:

Searcher searcher = new IndexSearcher(indexReader);

final BitSet bits = new BitSet(indexReader.maxDoc());

searcher.search(query, new HitCollector() {

public void collect(int doc, float score) {

bits.set(doc);

}

});

Note: This is called in an inner search loop. For good search performance, implementations of this method should not call Searchable.doc(int) or IndexReader.document(int) on every document number encountered. Doing so can slow searches by an order of magnitude or more.

Note: The score passed to this method is a raw score. In other words, the score will not necessarily be a float whose value is between 0 and 1.

下面来看看BookLinkCollector的实现:

01 package lia.extsearch.hitcollector;
02
03 import org.apache.lucene.document.Document;
04 import org.apache.lucene.search.HitCollector;
05 import org.apache.lucene.search.IndexSearcher;
06
07 import java.io.IOException;
08 import java.util.Collections;
09 import java.util.HashMap;
10 import java.util.Map;
11 // 自定义 BookLinkCollector 的实现 , 比较简单
12 public class BookLinkCollector extends HitCollector {
13    private IndexSearcher searcher;
14    // 保存 URL 和 Title 的 Map
15    private HashMap documents = new HashMap();
16
17    public BookLinkCollector(IndexSearcher searcher) {
18      this.searcher = searcher;
19    }
20
21    // 实现的接口的方法
22    public void collect(int id, float score) {
23      try {
24        Document doc = searcher.doc(id);
25        documents.put(doc.get("url"), doc.get("title"));
26        System.out.println(doc.get("title") + ":" + score);
27      } catch (IOException e) {
28        // ignore
29      }
30    }
31
32    public Map getLinks() {
33      return Collections.unmodifiableMap(documents);
34    }
35 }

测试代码:

01 package lia.extsearch.hitcollector;
02
03 import lia.common.LiaTestCase;
04 import lia.extsearch.hitcollector.BookLinkCollector;
05 import org.apache.lucene.index.Term;
06 import org.apache.lucene.search.IndexSearcher;
07 import org.apache.lucene.search.TermQuery;
08 import org.apache.lucene.search.Hits;
09
10 import java.util.Map;
11
12 public class HitCollectorTest extends LiaTestCase {
13
14    public void testCollecting() throws Exception {
15      TermQuery query = new TermQuery(new Term("contents", "junit"));
16      IndexSearcher searcher = new IndexSearcher(directory);
17
18      // BookLinkCollector 需要一个参数 searcher
19      BookLinkCollector collector = new BookLinkCollector(searcher);
20      searcher.search(query, collector); // 搜索
21
22      Map linkMap = collector.getLinks();
23      // 测试
24      assertEquals("Java Development with Ant", linkMap
25          .get("http://www.manning.com/antbook"));
26
27
28      Hits hits = searcher.search(query);
29      dumpHits(hits);
30
31      searcher.close();
32    }
33 }

该实现是比较简单的,要进一步了解其用法请参考Lucene in Action 或者我的Blog.

III. 自定义Filter的实现

有了上面实现的Sort代码自定义实现Filter也是很简单的只要实现Filter接口的一个方法就可以了该方法如下:

public abstract BitSetbits(IndexReader reader)

throws IOException

Returns a BitSet with true for documents which should be permitted in search results, and false for those that should not.

来看个例子:

01 package lia.extsearch.filters;
02
03 import org.apache.lucene.index.IndexReader;
04 import org.apache.lucene.index.Term;
05 import org.apache.lucene.index.TermDocs;
06 import org.apache.lucene.search.Filter;
07
08 import java.io.IOException;
09 import java.util.BitSet;
10
11 import lia.extsearch.filters.SpecialsAccessor;
12
13 public class SpecialsFilter extends Filter {
14    // 访问 isbns 的接口解耦便于重用
15    private SpecialsAccessor accessor;
16
17    public SpecialsFilter(SpecialsAccessor accessor) {
18      this.accessor = accessor;
19    }
20
21    // 覆盖该方法实现自定义 Filter
22    /**
23     * Returns a BitSet with true for documents which should be permitted in
24     * search results, and false for those that should not
25     */
26    public BitSet bits(IndexReader reader) throws IOException {
27      BitSet bits = new BitSet(reader.maxDoc());
28
29      String[] isbns = accessor.isbns();
30
31      int[] docs = new int[1];
32      int[] freqs = new int[1];
33
34      for (int i = 0; i < isbns.length; i++) {
35        String isbn = isbns[i];
36        if (isbn != null) {
37          TermDocs termDocs = reader.termDocs(new Term("isbn", isbn));
38          int count = termDocs.read(docs, freqs);
39          if (count == 1) {
40            bits.set(docs[0]);
41
42          }
43        }
44      }
45
46      return bits;
47    }
48
49    public String toString() {
50      return "SpecialsFilter";
51    }
52 }

用到了如下接口

1 package lia.extsearch.filters;
2
3 // 定义一个取得过虑参考信息的接口
4 public interface SpecialsAccessor {
5 String[] isbns();
6 }

和 Mock Object 实现

01 package lia.extsearch.filters;
02
03 // 一个 Mock object 的实现
04 public class MockSpecialsAccessor implements SpecialsAccessor {
05    private String[] isbns;
06
07    public MockSpecialsAccessor(String[] isbns) {
08      this.isbns = isbns;
09    }
10
11    public String[] isbns() {
12      return isbns;
13    }
14 }

测试代码如下:

01 package lia.extsearch.filters;
02
03 import lia.common.LiaTestCase;
04 import org.apache.lucene.search.Filter;
05 import org.apache.lucene.search.Hits;
06 import org.apache.lucene.search.WildcardQuery;
07 import org.apache.lucene.search.FilteredQuery;
08 import org.apache.lucene.search.TermQuery;
09 import org.apache.lucene.search.BooleanQuery;
10 import org.apache.lucene.search.RangeQuery;
11 import org.apache.lucene.search.IndexSearcher;
12 import org.apache.lucene.search.Query;
13 import org.apache.lucene.index.Term;
14
15 // 测试自定义 Filter
16 public class SpecialsFilterTest extends LiaTestCase {
17    private Query allBooks;
18
19    private IndexSearcher searcher;
20
21    // 建立测试环境
22    protected void setUp() throws Exception {
23      super.setUp();
24
25      allBooks = new RangeQuery(new Term("pubmonth", "190001"), new Term(
26          "pubmonth", "200512"), true);
27      searcher = new IndexSearcher(directory);
28    }
29
30    // 测试
31    public void testCustomFilter() throws Exception {
32      String[] isbns = new String[] { "0060812451", "0465026567" };
33
34      SpecialsAccessor accessor = new MockSpecialsAccessor(isbns);
35      Filter filter = new SpecialsFilter(accessor);
36      Hits hits = searcher.search(allBooks, filter);
37      assertEquals("the specials", isbns.length, hits.length());
38    }
39
40    // Using the new FilteredQuery, though, you can apply a
41    // Filter to a particular query clause of a BooleanQuery.
42    // FilteredQuery 为 1.4 新加入的详细情况请参考 Lucene in action 和 FilteredQuery 的 doc
43    public void testFilteredQuery() throws Exception {
44      String[] isbns = new String[] { "0854402624" }; // Steiner
45
46      SpecialsAccessor accessor = new MockSpecialsAccessor(isbns);
47      Filter filter = new SpecialsFilter(accessor);
48
49      WildcardQuery educationBooks = new WildcardQuery(new Term("category",
50          "*education*"));
51      FilteredQuery edBooksOnSpecial = new FilteredQuery(educationBooks,
52          filter);
53
54      TermQuery logoBooks = new TermQuery(new Term("subject", "logo"));
55
56      BooleanQuery logoOrEdBooks = new BooleanQuery();
57      logoOrEdBooks.add(logoBooks, false, false);
58      logoOrEdBooks.add(edBooksOnSpecial, false, false);
59
60      Hits hits = searcher.search(logoOrEdBooks);
61      System.out.println(logoOrEdBooks.toString());
62      assertEquals("Papert and Steiner", 2, hits.length());
63    }
64 }

明天看看扩展 QueryParser 和 Lucene 的性能看看 Lucene 到底有多块 !.

posted on 2007-01-05 10:27 Lansing 阅读(763) 评论(0) 编辑收藏所属分类: 搜索引擎

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问管理
相关文章: Lucene In Action Ch6 笔记 Lucene In Action Ch4 笔记 Lucene In Action Ch3 笔记 Lucene In Action Ch2 笔记 ORACLE 全文索引功能实现全文索引—CONTAINS语法基于Java的全文索引/检索引擎——Lucene

2007年1月

日

一

二

三

四

五

六

欢迎探讨，努力学习Java哈

常用链接

留言簿(3)

随笔分类

随笔档案

文章分类

学习(1)

文章档案

2006年8月 (1)

Lansing's Download

Lansing's Link

我的博客

我的QQ空间

常用链接

留言簿(3)

随笔分类

随笔档案

文章分类

文章档案

Lansing's Download

Lansing's Link

我的博客

搜索

最新评论

阅读排行榜

评论排行榜