从java中的其他字符串中删除字符串

可以说我有这个单词列表:

String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"}; 

比我有文字

  String text = "I would like to do a nice novel about nature AND people" 

是否有匹配stopWords的方法并在忽略大小写的情况下删除它们; 像那样在某个地方?:

  String noStopWordsText = remove(text, stopWords); 

结果:

  " would like do nice novel nature people" 

如果你知道正则表达式工作得很好但我真的更喜欢像公共解决方案更具有性能导向的东西。

顺便说一句,现在我正在使用这种缺乏适当的不敏感案例处理的公共方法:

  private static final String[] stopWords = new String[]{"i", "a", "and", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "the", "www", "I", "A", "AND", "ABOUT", "AN", "ARE", "AS", "AT", "BE", "BY", "COM", "FOR", "FROM", "HOW", "IN", "IS", "IT", "NOT", "OF", "ON", "OR", "THAT", "THE", "THIS", "TO", "WAS", "WHAT", "WHEN", "WHERE", "WHO", "WILL", "WITH", "THE", "WWW"}; private static final String[] blanksForStopWords = new String[]{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""}; noStopWordsText = StringUtils.replaceEach(text, stopWords, blanksForStopWords); 

这是一个不使用正则表达式的解决方案。 我认为它不如我的另一个答案,因为它更长,更不清楚,但如果性能真的非常重要,那么这就是O(n) ,其中n是文本的长度。

 Set stopWords = new HashSet(); stopWords.add("a"); stopWords.add("and"); // and so on ... String sampleText = "I would like to do a nice novel about nature AND people"; StringBuffer clean = new StringBuffer(); int index = 0; while (index < sampleText.length) { // the only word delimiter supported is space, if you want other // delimiters you have to do a series of indexOf calls and see which // one gives the smallest index, or use regex int nextIndex = sampleText.indexOf(" ", index); if (nextIndex == -1) { nextIndex = sampleText.length - 1; } String word = sampleText.substring(index, nextIndex); if (!stopWords.contains(word.toLowerCase())) { clean.append(word); if (nextIndex < sampleText.length) { // this adds the word delimiter, eg the following space clean.append(sampleText.substring(nextIndex, nextIndex + 1)); } } index = nextIndex + 1; } System.out.println("Stop words removed: " + clean.toString()); 

使用停用词创建一个正则表达式,使其不区分大小写,然后使用matcher的replaceAll方法用空字符串替换所有匹配项

 import java.util.regex.*; Pattern stopWords = Pattern.compile("\\b(?:i|a|and|about|an|are|...)\\b\\s*", Pattern.CASE_INSENSITIVE); Matcher matcher = stopWords.matcher("I would like to do a nice novel about nature AND people"); String clean = matcher.replaceAll(""); 

...在模式中只是我懒惰,继续停止词的列表。

另一种方法是遍历所有停用词并使用StringreplaceAll方法。 这种方法的问题是replaceAll将为每个调用编译一个新的正则表达式,因此在循环中使用它并不是非常有效。 此外,当您使用StringreplaceAll时,不能传递使正则表达式不区分大小写的标志。

编辑:我在模式周围添加了\b ,使其仅匹配整个单词。 我还添加了\s*以使其在任何空格后全局化,这可能不是必需的。

您可以创建一个reg表达式来匹配所有停用 [例如a ,注意空格]并最终得到

 str.replaceAll(regexpression,""); 

要么

  String[] stopWords = new String[]{" i ", " a ", " and ", " about ", " an ", " are ", " as ", " at ", " be ", " by ", " com ", " for ", " from ", " how ", " in ", " is ", " it ", " not ", " of ", " on ", " or ", " that ", " the ", " this ", " to ", " was ", " what ", " when ", " where ", " who ", " will ", " with ", " the ", " www "}; String text = " I would like to do a nice novel about nature AND people "; for (String stopword : stopWords) { text = text.replaceAll("(?i)"+stopword, " "); } System.out.println(text); 

输出:

  would like do nice novel nature people 
  • IdeOneDemo

可能有更好的方法。

在whilespace上拆分text 。 然后循环遍历数组,并且只有当它不是停用词之一时才继续追加到StringBuilder