使用Lucene计算类别中的结果

我正在尝试使用Lucene Java 2.3.2在产品目录上实现搜索。除了产品的常规字段外，还有一个名为“类别”的字段。产品可以分为多个类别。目前，我使用FilteredQuery在每个类别中搜索相同的搜索词，以获得每个类别的结果数。

这导致每个查询20-30个内部搜索调用以显示结果。这大大减慢了搜索速度。使用Lucene有更快的方法来实现相同的结果吗？

这就是我所做的，虽然它对内存有点沉重：

你需要的是提前创建一堆BitSet ，每个类别一个，包含一个类别中所有文档的doc id。现在，在搜索时，您使用HitCollector并检查针对BitSet的文档ID。

这是创建位集的代码：

 public BitSet[] getBitSets(IndexSearcher indexSearcher, Category[] categories) { BitSet[] bitSets = new BitSet[categories.length]; for(int i=0; i


 这只是一种方法。 如果您的类别足够简单，您可以使用TermDocs而不是运行完整搜索，但这应该只在您加载索引时运行一次。 
 现在，在计算搜索结果类别的时候，你可以这样做： 
 public int[] getCategroryCount(IndexSearcher indexSearcher, Query query, final BitSet[] bitSets) { final int[] count = new int[bitSets.length]; indexSearcher.search(query, new HitCollector() { public void collect(int doc, float score) { for(int i=0; i 

 您最终得到的是一个数组，其中包含搜索结果中每个类别的计数。 如果您还需要搜索结果，则应该向命中收集器添加TopDocCollector（yo dawg ...）。 或者，您可以再次运行搜索。  2次搜索优于30次。



		      	 我没有足够的声誉来评论（！）但是在Matt Quail的回答中我很确定你可以替换它： 
 int numDocs = 0; td.seek(terms); while (td.next()) { numDocs++; } 
 有了这个： 
 int numDocs = terms.docFreq() 
 然后完全摆脱td变量。 这应该会更快。 



		      	 您可能需要考虑使用TermDocs迭代器查看与类别匹配的所有文档。 
 此示例代码遍历每个“类别”术语，然后计算与该术语匹配的文档数。 
 public static void countDocumentsInCategories(IndexReader reader) throws IOException { TermEnum terms = null; TermDocs td = null; try { terms = reader.terms(new Term("Category", "")); td = reader.termDocs(); do { Term currentTerm = terms.term(); if (!currentTerm.field().equals("Category")) { break; } int numDocs = 0; td.seek(terms); while (td.next()) { numDocs++; } System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs); } while (terms.next()); } finally { if (td != null) td.close(); if (terms != null) terms.close(); } } 
 即使对于大型索引，此代码也应该运行得相当快。 
 以下是一些测试该方法的代码： 
 public static void main(String[] args) throws Exception { RAMDirectory store = new RAMDirectory(); IndexWriter w = new IndexWriter(store, new StandardAnalyzer()); addDocument(w, 1, "Apple", "fruit", "computer"); addDocument(w, 2, "Orange", "fruit", "colour"); addDocument(w, 3, "Dell", "computer"); addDocument(w, 4, "Cumquat", "fruit"); w.close(); IndexReader r = IndexReader.open(store); countDocumentsInCategories(r); r.close(); } private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException { Document d = new Document(); d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED)); d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED)); for (String category : categories) { d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED)); } w.addDocument(d); } 



		      	 萨钦，我相信你想要分面搜索 。  Lucene没有开箱即用。 我建议你尝试使用SOLR ，它具有分面作为一个主要和方便的function。 



		      	 因此，让我看看我是否正确理解了这个问题：给定来自用户的查询，您希望显示每个类别中查询的匹配数。 正确？ 
 可以这样想：您的查询实际上是originalQuery AND (category1 OR category2 or ...)除了您希望获得每个类别的数字的总分。 不幸的是，在Lucene中收集点击的界面非常狭窄，只能为您提供查询的总分。 但是你可以实现一个自定义的Scorer / Collector。 
 查看org.apache.lucene.search.DisjunctionSumScorer的源代码。 您可以复制其中一些来编写一个自定义记分器，在您的主要搜索进行时迭代类别匹配。 你可以保持Map来跟踪每个类别中的匹配。



  为什么我们在Hadoop堆栈中需要ZooKeeper？
  如何用java编写n级嵌入循环
	如何让我的Jersey 2端点在启动时急切地初始化？
在try / finally外部或内部初始化一次性资源
不使用.YML文件的Dropwizard配置？
强制或生成jvm核心转储（IBM JVM）
从Java中创建Windows服务
Glassfish 3.1默认主体到角色映射
Java Swing：如何动态更改GUI
java打印一个三角形
@ElementCollection with Map 其中Entity是Embeddable的一个字段

使用Lucene计算类别中的结果

使用map-reduce构建分布式KD树

将-source设置为1.5，显然设置为1.3

Hibernate中的小写注释

-source 1.3不支持generics

如何在Java（NetBeans）中将禁用按钮的文本颜色更改为黑色？

在WEB-INF / LIB中使Tomcat忽略Servlet

导致Maven / JBehave错误的原因是什么？

带有boolean的java.lang.NullPointerException

PHP和Java有什么区别？

合并大文件而不将整个文件加载到内存中？