如何在java中使用lucene添加自定义停用词

我正在使用lucene删除英语停止词,但我的要求是删除英语停止词和自定义停止词。 下面是我使用lucene删除英语停用词的代码。

我的示例代码:

public class Stopwords_remove { public String removeStopWords(String string) throws IOException { StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string)); StringBuilder sb = new StringBuilder(); tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, ana.STOP_WORDS_SET); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { if (sb.length() > 0) { sb.append(" "); } sb.append(token.toString()); } return sb.toString(); } public static void main(String args[]) throws IOException { String text = "this is a java project written by james."; Stopwords_remove stopwords = new Stopwords_remove(); stopwords.removeStopWords(text); } } 

输出: java project written james.

必需输出: java project james.

我怎样才能做到这一点?

您可以添加将额外的停用词添加到标准英语停止词集的副本中,或者只添加另一个StopFilter。 喜欢:

 TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string)); CharArraySet stopSet = CharArraySet.copy(Version.LUCENE_36, StandardAnalyzer.STOP_WORD_SET); stopSet.add("add"); stopSet.add("your"); stopSet.add("stop"); stopSet.add("words"); tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stopSet); //Or, if you just need the added stopwords in a standardanalyzer, you could just pass this stopfilter into the StandardAnalyzer... //analyzer = new StandardAnalyzer(Version.LUCENE_36, stopSet); 

要么:

 TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string)); tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StandardAnalyzer.STOP_WORDS_SET); List stopWords = //your list of stop words..... tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StopFilter.makeStopSet(Version.LUCENE_36, stopWords)); 

如果您尝试创建自己的分析器,则可以按照分析器文档中的示例更好地遵循模式。