如何从Lucene的特定字段中获取唯一术语列表？

我有一个包含多个字段的大型语料库的索引。这些字段中只有一个包含文本。我需要根据这个字段从整个索引中提取唯一的单词。有谁知道我如何用java中的Lucene做到这一点？

您正在寻找术语向量（字段中所有单词的集合以及每个单词的使用次数，不包括停用单词）。您将为索引中的每个文档使用IndexReader的getTermFreqVector（docid，field），并使用它们填充HashSet 。

另一种方法是使用terms（）并仅选择您感兴趣的字段的术语：

 IndexReader reader = IndexReader.open(index); TermEnum terms = reader.terms(); Set uniqueTerms = new HashSet(); while (terms.next()) { final Term term = terms.term(); if (term.field().equals("field_name")) { uniqueTerms.add(term.text()); } }

这不是最佳解决方案，您正在阅读然后丢弃所有其他字段。 Lucene 4中有一个类Fields ，它只返回单个字段的术语（字段）。

如果您使用的是Lucene 4.0 api，则需要从索引阅读器中获取字段。然后，Fields提供了获取索引中每个字段的术语的方法。以下是如何执行此操作的示例：

  Fields fields = MultiFields.getFields(indexReader); Terms terms = fields.terms("field"); TermsEnum iterator = terms.iterator(null); BytesRef byteRef = null; while((byteRef = iterator.next()) != null) { String term = new String(byteRef.bytes, byteRef.offset, byteRef.length); }

最后，对于新版本的Lucene，您可以从BytesRef调用中获取字符串：

  byteRef.utf8ToString();

代替

  new String(byteRef.bytes, byteRef.offset, byteRef.length);

如果要获取文档频率，可以执行以下操作：

  int docFreq = iterator.docFreq();

相同的结果，只是更清洁，是在lucene-suggest包中使用LuceneDictionary 。它通过返回BytesRefIterator.EMPTY来处理不包含任何术语的BytesRefIterator.EMPTY 。那会省你NPE 🙂

  LuceneDictionary ld = new LuceneDictionary( indexReader, "field" ); BytesRefIterator iterator = ld.getWordsIterator(); BytesRef byteRef = null; while ( ( byteRef = iterator.next() ) != null ) { String term = byteRef.utf8ToString(); }

使用TermsEnum和terms.next()的答案有一个微妙的错误。这是因为TermsEnum已指向第一个术语，因此while(terms.next())将导致跳过第一个术语。

而是使用for循环：

 TermEnum terms = reader.terms(); for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) { // do something with the term }

要修改接受的答案中的代码：

 IndexReader reader = IndexReader.open(index); TermEnum terms = reader.terms(); Set uniqueTerms = new HashSet(); for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) { if (term.field().equals("field_name")) { uniqueTerms.add(term.text()); } }

如何从Lucene的特定字段中获取唯一术语列表？

如何使用GSON反序列化Map

带有中文字符的无效URI（Java）

推断类型不是Comparablegenerics类型的有效替代

PrimeFaces日历接受无效日期作为输入

为什么可序列化的内部类不可序列化？

Java易失性和缓存一致性

使用FlyingSaucer将包含阿拉伯字符的HTML页面转换为PDF

使用Java 6注释处理器获取generics类型的限定类名

如何取消正在运行的SQL查询？

Kafka消费者（0.8.2.2）可以批量阅读消息吗？