Lucene:首先不显示完全匹配

我正在使用演示IndexFiles和SearchFiles类来索引和搜索org.apache.lucene.demo数据包中的内容。

我的问题是当我使用包含多个单词的查询时,我没有得到具有完全匹配的结果。 例如:

Enter query: "natural language" Searching for: "natural language" 298 total matching documents 1. download\researchers.uq.edu.au\fields-of-research\natural-language-processing .txt 2. download\researchers.uq.edu.au\research-project\16267.txt 3. download\researchers.uq.edu.au\research-project\16279.txt 4. download\researchers.uq.edu.au\research-project\18361.txt 5. download\www.uq.edu.au\news\%3Farticle%3D2187.txt 6. download\researchers.uq.edu.au\researcher\2115.txt 7. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody%3Fpage%3D1.txt 8. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody%3Fpage%3D2.txt 9. download\ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-project s-dr-alan-cody.txt 10. download\www.ceit.uq.edu.au\content\2013-2014-summer-research-scholarship-pr ojects-dr-alan-cody.txt Press (n)ext page, (q)uit or enter number to jump to a page. 

没有相同的结果:

 Enter query: natural language Searching for: natural language 54307 total matching documents 1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D190.txt 2. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D576.txt 3. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D46.txt 4. download\espace.library.uq.edu.au\view\UQ%3A166163.txt 5. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D108.txt 6. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D70.txt 7. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D708.txt 8. download\researchers.uq.edu.au\fields-of-research\natural-language-processing .txt 9. download\researchers.uq.edu.au\research-project\16267.txt 10. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D117.tx t Press (n)ext page, (q)uit or enter number to jump to a page. 

例如,第一个匹配的文档甚至不包含“language”关键字。

如果我在IndexSearcher类中使用explain()方法,那么我得到第一个结果:

 1. download\cyberschool.library.uq.edu.au\display_resource.phtml%3Frid%3D190.txt 0.70643383 = (MATCH) sum of: 0.5590494 = (MATCH) weight(contents:natural in 62541) [DefaultSimilarity], result of: 0.5590494 = score(doc=62541,freq=4.0 = termFreq=4.0 ), product of: 0.8091749 = queryWeight, product of: 4.4216847 = idf(docFreq=13111, maxDocs=401502) 0.18300149 = queryNorm 0.6908882 = fieldWeight in 62541, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 4.4216847 = idf(docFreq=13111, maxDocs=401502) 0.078125 = fieldNorm(doc=62541) 0.1473844 = (MATCH) weight(contents:language in 62541) [DefaultSimilarity], result of: 0.1473844 = score(doc=62541,freq=1.0 = termFreq=1.0 ), product of: 0.5875679 = queryWeight, product of: 3.2107275 = idf(docFreq=44012, maxDocs=401502) 0.18300149 = queryNorm 0.25083807 = fieldWeight in 62541, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.2107275 = idf(docFreq=44012, maxDocs=401502) 0.078125 = fieldNorm(doc=62541) 

如果我单击下一步并找到如下结果:

 19. download\www.uq.edu.au\news\%3Farticle%3D2187.txt 0.47449595 = (MATCH) sum of: 0.2795247 = (MATCH) weight(contents:natural in 35173) [DefaultSimilarity], result of: 0.2795247 = score(doc=35173,freq=4.0 = termFreq=4.0 ), product of: 0.8091749 = queryWeight, product of: 4.4216847 = idf(docFreq=13111, maxDocs=401502) 0.18300149 = queryNorm 0.3454441 = fieldWeight in 35173, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 4.4216847 = idf(docFreq=13111, maxDocs=401502) 0.0390625 = fieldNorm(doc=35173) 0.19497125 = (MATCH) weight(contents:language in 35173) [DefaultSimilarity], result of: 0.19497125 = score(doc=35173,freq=7.0 = termFreq=7.0 ), product of: 0.5875679 = queryWeight, product of: 3.2107275 = idf(docFreq=44012, maxDocs=401502) 0.18300149 = queryNorm 0.33182758 = fieldWeight in 35173, product of: 2.6457512 = tf(freq=7.0), with freq of: 7.0 = termFreq=7.0 3.2107275 = idf(docFreq=44012, maxDocs=401502) 0.0390625 = fieldNorm(doc=35173) 

哪个页面本身包含完全关键字“自然语言”。 所以我的问题是:

1)为什么Lucene不会先显示完全匹配?

2)为什么Lucene显示的结果甚至不包含关键字?

3)我在哪里/如何改变它以便它首先显示完全匹配的那些然后更相关的?

1 – 不打算。 请参阅Lucene查询语法的文档。 查询natural language是由两个术语组成的查询。 就自己而言,Lucene并不偏好这些条款紧密相连。 如果你想找到完全匹配,短语查询是正确的方法,如"natural language"

2 – 您包含解释的两个结果都包含两个术语的匹配,请参阅:

 0.2795247 = (MATCH) weight(contents:natural in 35173) [DefaultSimilarity], result of: 0.2795247 = score(doc=35173,freq=4.0 = termFreq=4.0 ... 0.19497125 = (MATCH) weight(contents:language in 35173) [DefaultSimilarity], result of: 0.19497125 = score(doc=35173,freq=7.0 = termFreq=7.0 

根据Lucene的说法,它在该文档中发现了4次“自然”一词,在内容字段中发现了7次“语言”(我假设它是您的默认字段)。

3 – 查看查询解析器语法,了解对您最有意义的内容。 听起来你可能会发现Proximity Searches很有用。

如果您只想简单地获得其他人的短语匹配,您可以使用以下内容:

 "natural language" natural language