使用正则表达式从文本中删除连续的重复单词并显示新文本
HY,
我有以下代码:
import java.io.*; import java.util.ArrayList; import java.util.Scanner; import java.util.regex.*; / public class RegexSimple4 { public static void main(String[] args) { try { Scanner myfis = new Scanner(new File("D:\\myfis32.txt")); ArrayList foundaz = new ArrayList(); ArrayList noduplicates = new ArrayList(); while(myfis.hasNext()) { String line = myfis.nextLine(); String delim = " "; String [] words = line.split(delim); for (String s : words) { if (!s.isEmpty() && s != null) { Pattern pi = Pattern.compile("[aA-zZ]*"); Matcher ma = pi.matcher(s); if (ma.find()) { foundaz.add(s); } } } } if(foundaz.isEmpty()) { System.out.println("No words have been found"); } if(!foundaz.isEmpty()) { int n = foundaz.size(); String plus = foundaz.get(0); noduplicates.add(plus); for(int i=1; i<n; i++) { if ( !noduplicates.get(i-1) .equalsIgnoreCase(foundaz.get(i))) { noduplicates.add(foundaz.get(i)); } } //System.out.print("Cuvantul/cuvintele \n"+i); } if(!foundaz.isEmpty()) { System.out.print("Original text \n"); for(String s: foundaz) { System.out.println(s); } } if(!noduplicates.isEmpty()) { System.out.print("Remove duplicates\n"); for(String s: noduplicates) { System.out.println(s); } } } catch(Exception ex) { System.out.println(ex); } } }
目的是从短语中删除连续的重复项。 该代码仅适用于不是全长短语的字符串列。
例如我的输入应该是:
Blah blah狗猫老鼠。 猫老鼠狗狗。
和输出
Blah狗猫老鼠。 猫老鼠狗。
诚恳,
首先,正则表达式[aA-zZ]*
不会像你想象的那样做。 它表示“在ASCII A
和ASCII z
(也包括[
, ]
, \
和其他)之间的范围内匹配零个或多个s或字符,或Z
s”。 因此它也匹配空字符串。
假设您只是寻找仅由ASCII字母组成的重复单词,不区分大小写,保留第一个单词(这意味着您不想匹配"it's it's"
或"olé olé!"
),那么您可以在单个正则表达式操作中执行此操作:
String result = subject.replaceAll("(?i)\\b([az]+)\\b(?:\\s+\\1\\b)+", "$1");
哪个会改变
Hello hello Hello there there past pastures
成
Hello there past pastures
说明:
(?i) # Mode: case-insensitive \b # Match the start of a word ([az]+) # Match one ASCII "word", capture it in group 1 \b # Match the end of a word (?: # Start of non-capturing group: \s+ # Match at least one whitespace character \1 # Match the same word as captured before (case-insensitively) \b # and make sure it ends there. )+ # Repeat that as often as possible
在regex101.com上查看 。
贝娄这是你的代码。 我用线来分割文本和Tim的正则表达式。
import java.util.Scanner; import java.io.*; import java.util.regex.*; import java.util.ArrayList; /** * * @author Marius */ public class RegexSimple41 { /** * @param args the command line arguments */ public static void main(String[] args) { ArrayList manyLines = new ArrayList (); ArrayList noRepeat = new ArrayList (); try { Scanner myfis = new Scanner(new File("D:\\myfis41.txt")); while(myfis.hasNext()) { String line = myfis.nextLine(); String delim = System.getProperty("line.separator"); String [] lines = line.split(delim); for(String s: lines) { if(!s.isEmpty()&&s!=null) { manyLines.add(s); } } } if(!manyLines.isEmpty()) { System.out.print("Original text\n"); for(String s: manyLines) { System.out.println(s); } } if(!manyLines.isEmpty()) { for(String s: manyLines) { String result = s.replaceAll("(?i)\\b([az]+)\\b(?:\\s+\\1\\b)+", "$1"); noRepeat.add(result); } } if(!noRepeat.isEmpty()) { System.out.print("Remove duplicates\n"); for(String s: noRepeat) { System.out.println(s); } } } catch(Exception ex) { System.out.println(ex); } } }
祝好运,
Bellow代码工作正常
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
公共类DuplicateRemoveEx {
public static void main(String[] args){ String regex="(?i)\\b(\\w+)(\\b\\W+\\1\\b)+"; Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE); Scanner in = new Scanner(System.in); int numSentences = Integer.parseInt(in.nextLine()); while(numSentences-- >0){ String input = in.nextLine(); Matcher m = p.matcher(input); while(m.find()){ input=input.replaceAll(regex, "$1"); } System.out.println(input); } in.close(); }
}