增量索引lucene

我正在使用Lucene 3.6在Java中创建应用程序，并希望提高增量率。我已经创建了索引，我读到你要做的就是打开现有的索引，并检查每个文档的索引和文档修改日期，看它们是否不同删除索引文件并重新添加。我的问题是我不知道如何在Java Lucene中这样做。

谢谢

我的代码是：

public static void main(String[] args) throws CorruptIndexException, LockObtainFailedException, IOException { File docDir = new File("D:\\PRUEBASLUCENE"); File indexDir = new File("C:\\PRUEBA"); Directory fsDir = FSDirectory.open(indexDir); Analyzer an = new StandardAnalyzer(Version.LUCENE_36); IndexWriter indexWriter = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED); long numChars = 0L; for (File f : docDir.listFiles()) { String fileName = f.getName(); Document d = new Document(); d.add(new Field("Name",fileName, Store.YES,Index.NOT_ANALYZED)); d.add(new Field("Path",f.getPath(),Store.YES,Index.ANALYZED)); long tamano = f.length(); d.add(new Field("Size",""+tamano,Store.YES,Index.ANALYZED)); long fechalong = f.lastModified(); d.add(new Field("Modification_Date",""+fechalong,Store.YES,Index.ANALYZED)); indexWriter.addDocument(d); } indexWriter.optimize(); indexWriter.close(); int numDocs = indexWriter.numDocs(); System.out.println("Index Directory=" + indexDir.getCanonicalPath()); System.out.println("Doc Directory=" + docDir.getCanonicalPath()); System.out.println("num docs=" + numDocs); System.out.println("num chars=" + numChars);

}

谢谢Edmondo1984，你帮了我很多忙。

最后我做了如下所示的代码。存储文件的哈希值，然后检查修改日期。

在9300索引文件需要15秒，并且重新索引（没有任何索引没有因为没有文件而改变）需要15秒。我做错了什么还是我可以优化代码以减少占用？

谢谢jtahlborn，做了我设法均衡indexReader时间来创建和更新。您不应该更新现有索引应该更快地重新创建吗？是否有可能进一步优化代码？

 if(IndexReader.indexExists(dir)) { //reader is a IndexReader and is passed as parameter to the function //searcher is a IndexSearcher and is passed as parameter to the function term = new Term("Hash",String.valueOf(file.hashCode())); Query termQuery = new TermQuery(term); TopDocs topDocs = searcher.search(termQuery,1); if(topDocs.totalHits==1) { Document doc; int docId,comparedate; docId=topDocs.scoreDocs[0].doc; doc=reader.document(docId); String dateIndString=doc.get("Modification_date"); long dateIndLong=Long.parseLong(dateIndString); Date date_ind=new Date(dateIndLong); String dateFichString=DateTools.timeToString(file.lastModified(), DateTools.Resolution.MINUTE); long dateFichLong=Long.parseLong(dateFichString); Date date_fich=new Date(dateFichLong); //Compare the two dates comparedates=date_fich.compareTo(date_ind); if(comparedate>=0) { if(comparedate==0) { //If comparation is 0 do nothing flag=2; } else { //if comparation>0 updateDocument flag=1; } }

根据Lucene数据模型，您将文档存储在索引中。在每个文档中，您将拥有要编制索引的字段，即所谓的“已分析”字段和未“分析”的字段，您可以在其中存储时间戳以及稍后可能需要的其他信息。

我觉得你在文件和文档之间有一定的混淆，因为在你的第一篇文章中你谈到文档，现在你试图调用IndexFileNames.isDocStoreFile（file.getName（）），它实际上只告诉文件是否包含文件一个Lucene索引。

如果您了解Lucene对象模型，那么编写所需的代码大约需要三分钟：

您只需查询Lucene，就必须检查文档是否已存在于索引中（例如，通过存储包含唯一标识符的未分析字段）。
如果查询返回0个文档，则将新文档添加到索引中
如果您的查询返回1个文档，您将获得其“timestamp”字段，并将其与您尝试存储的新文档进行比较。然后，如果需要，您可以使用文档的docId将其从索引中删除，以添加新文档。

如果另一方面确定您希望始终修改以前的值，则可以参考Lucene in Action中的这个片段：

 public void testUpdate() throws IOException { assertEquals(1, getHitCount("city", "Amsterdam")); IndexWriter writer = getWriter(); Document doc = new Document(); doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO)); doc.add(new Field("contents", "Den Haag has a lot of museums", Field.Store.NO, Field.Index.ANALYZED)); doc.add(new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED)); writer.updateDocument(new Term("id", "1"), doc); writer.close(); assertEquals(0, getHitCount("city", "Amsterdam")); assertEquals(1, getHitCount("city", "Den Haag")); }

如您所见，片段使用非分析ID，因为我建议保存可查询 – 简单属性，方法updateDocument首先删除然后重新添加文档。

你可能想直接检查javadoc

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document。文件）

增量索引lucene

从数据库检索顺序数据的最快方法是什么？

运行Maven2 build时使用不同的Java源代码版本

从Java中的combobox中删除所有项目

使用二进制搜索的多个键的最后一个索引？

在JFreechart中使用多个渲染器

如何在Java中记忆配置文件？

Json和Java – 循环参考

使用iText5在生成的PDF中使用unicode字符

如何在java中的方法之间传递变量？

从JSF 1.2生成动态图表/将对象传递给Servlet