使用正则表达式从文本中删除连续的重复单词并显示新文本

HY,

我有以下代码:

import java.io.*; import java.util.ArrayList; import java.util.Scanner; import java.util.regex.*; / public class RegexSimple4 { public static void main(String[] args) { try { Scanner myfis = new Scanner(new File("D:\\myfis32.txt")); ArrayList  foundaz = new ArrayList(); ArrayList  noduplicates = new ArrayList(); while(myfis.hasNext()) { String line = myfis.nextLine(); String delim = " "; String [] words = line.split(delim); for (String s : words) { if (!s.isEmpty() && s != null) { Pattern pi = Pattern.compile("[aA-zZ]*"); Matcher ma = pi.matcher(s); if (ma.find()) { foundaz.add(s); } } } } if(foundaz.isEmpty()) { System.out.println("No words have been found"); } if(!foundaz.isEmpty()) { int n = foundaz.size(); String plus = foundaz.get(0); noduplicates.add(plus); for(int i=1; i<n; i++) { if ( !noduplicates.get(i-1) .equalsIgnoreCase(foundaz.get(i))) { noduplicates.add(foundaz.get(i)); } } //System.out.print("Cuvantul/cuvintele \n"+i); } if(!foundaz.isEmpty()) { System.out.print("Original text \n"); for(String s: foundaz) { System.out.println(s); } } if(!noduplicates.isEmpty()) { System.out.print("Remove duplicates\n"); for(String s: noduplicates) { System.out.println(s); } } } catch(Exception ex) { System.out.println(ex); } } } 

目的是从短语中删除连续的重复项。 该代码仅适用于不是全长短语的字符串列。

例如我的输入应该是:

Blah blah狗猫老鼠。 猫老鼠狗狗。

和输出

Blah狗猫老鼠。 猫老鼠狗。

诚恳,

首先,正则表达式[aA-zZ]*不会像你想象的那样做。 它表示“在ASCII A和ASCII z (也包括[]\和其他)之间的范围内匹配零个或多个s或字符,或Z s”。 因此它也匹配空字符串。

假设您只是寻找仅由ASCII字母组成的重复单词,不区分大小写,保留第一个单词(这意味着您不想匹配"it's it's""olé olé!" ),那么您可以在单个正则表达式操作中执行此操作:

 String result = subject.replaceAll("(?i)\\b([az]+)\\b(?:\\s+\\1\\b)+", "$1"); 

哪个会改变

 Hello hello Hello there there past pastures 

 Hello there past pastures 

说明:

 (?i) # Mode: case-insensitive \b # Match the start of a word ([az]+) # Match one ASCII "word", capture it in group 1 \b # Match the end of a word (?: # Start of non-capturing group: \s+ # Match at least one whitespace character \1 # Match the same word as captured before (case-insensitively) \b # and make sure it ends there. )+ # Repeat that as often as possible 

在regex101.com上查看 。

贝娄这是你的代码。 我用线来分割文本和Tim的正则表达式。

 import java.util.Scanner; import java.io.*; import java.util.regex.*; import java.util.ArrayList; /** * * @author Marius */ public class RegexSimple41 { /** * @param args the command line arguments */ public static void main(String[] args) { ArrayList  manyLines = new ArrayList(); ArrayList  noRepeat = new ArrayList(); try { Scanner myfis = new Scanner(new File("D:\\myfis41.txt")); while(myfis.hasNext()) { String line = myfis.nextLine(); String delim = System.getProperty("line.separator"); String [] lines = line.split(delim); for(String s: lines) { if(!s.isEmpty()&&s!=null) { manyLines.add(s); } } } if(!manyLines.isEmpty()) { System.out.print("Original text\n"); for(String s: manyLines) { System.out.println(s); } } if(!manyLines.isEmpty()) { for(String s: manyLines) { String result = s.replaceAll("(?i)\\b([az]+)\\b(?:\\s+\\1\\b)+", "$1"); noRepeat.add(result); } } if(!noRepeat.isEmpty()) { System.out.print("Remove duplicates\n"); for(String s: noRepeat) { System.out.println(s); } } } catch(Exception ex) { System.out.println(ex); } } } 

祝好运,

Bellow代码工作正常

import java.util.Scanner;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

公共类DuplicateRemoveEx {

 public static void main(String[] args){ String regex="(?i)\\b(\\w+)(\\b\\W+\\1\\b)+"; Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE); Scanner in = new Scanner(System.in); int numSentences = Integer.parseInt(in.nextLine()); while(numSentences-- >0){ String input = in.nextLine(); Matcher m = p.matcher(input); while(m.find()){ input=input.replaceAll(regex, "$1"); } System.out.println(input); } in.close(); } 

}