找到所有匹配的子串,而不仅仅是“最扩展的”子串

代码

String s = "yzaaabccz"; Pattern p = Pattern.compile("(a )+(b )+(c *)c"); Matcher m = p.matcher(s); while (m.find()) { System.out.println(m.group()); } 

版画

 aaabcc 

哪个是对的。

但逻辑上,子串

 aaabc aabcc aabc abcc abc 

也匹配正则表达式。

那么,我怎样才能使代码找到那些子串呢,即不仅是最扩展的子串,还有它的子代码

您可以使用不情愿的限定符,例如*?+? 。 这些匹配尽可能少,与贪婪的标准*+形成鲜明对比,即尽可能匹配。 尽管如此,这只允许您找到特定的“子匹配”,而不是全部。 使用前瞻控制非捕获组可以实现更多控制,也在文档中描述。 但是为了真正找到所有子匹配,你可能必须自己做一些事情,即构建正则表达式对应的自动机并使用自定义代码进行导航。

你需要一个懒惰的量词 。

请尝试以下方法:

 Pattern p = Pattern.compile("(a )+(b )+((c )*?)c"); 

请注意,我再次将“ c ”分组,因为我认为这就是你想要的。 否则你会发现任意多个空格,但不是“ c ”。

我能想到的唯一方法是生成原始字符串的所有可能子字符串的列表,并将正则表达式与每个字符串相匹配,保留匹配的项目。

我不知道任何可以回馈所有有效匹配的正则表达式引擎。

但是我们可以应用一些逻辑来生成所有候选字符串并将其呈现给正则表达式。

通过枚举给定输入的所有可能子串来构造候选者。

 var str = "yzaaabcczyzaaabccz"; var regex = new Regex("(a )+(b )+(c *)c"); var length = str.Length; for (int start = 1; start <= length;start++){ for (int groupLength = 1; start + groupLength - 1 <= length ;groupLength++){ var candidate = str.Substring(start-1,groupLength); //.Dump(); //("\"" + candidate + "\"").Dump(); var match = regex.Match(candidate); if (match.Value == candidate ) { candidate.Dump(); } } } 

这给了

 aaabccaabccabcc 

这似乎是正确的答案,但与你的结果相矛盾:

 aaabc => I state that this is not a match aabcc ok aabc => I state that this is not a match abcc ok abc => I state that this is not a match 

例如,你给的正则表达式

 (a )+(b )+(c *)c 

与结果中的第一个条目不匹配

 aaabc 

如果您认为起始位置不重要,则上述逻辑可以生成相同的匹配。 例如,如果您再次重复给定输入:

 "yzaaabcczyzaaabccz" 

它会给:

 aaabcc aabcc abcc aaabcc aabcc abcc 

如果您认为职位不重要,您应该对此结果做出明确的决定

如果认为是潜在匹配,则应添加输入为空字符串的简单情况。

仅供参考,这是正则表达式检查的所有候选人

 "y" "y " "yz" "yz " "yza" "yza " "yzaa" "yzaa " "yzaaa" "yzaaa " "yzaaab" "yzaaab " "yzaaabc" "yzaaabc " "yzaaabcc" "yzaaabcc " "yzaaabccz" " " " z" " z " " za" " za " " zaa" " zaa " " zaaa" " zaaa " " zaaab" " zaaab " " zaaabc" " zaaabc " " zaaabcc" " zaaabcc " " zaaabccz" "z" "z " "za" "za " "zaa" "zaa " "zaaa" "zaaa " "zaaab" "zaaab " "zaaabc" "zaaabc " "zaaabcc" "zaaabcc " "zaaabccz" " " " a" " a " " aa" " aa " " aaa" " aaa " " aaab" " aaab " " aaabc" " aaabc " " aaabcc" " aaabcc " " aaabccz" "a" "a " "aa" "aa " "aaa" "aaa " "aaab" "aaab " "aaabc" "aaabc " "aaabcc" "aaabcc " "aaabccz" " " " a" " a " " aa" " aa " " aab" " aab " " aabc" " aabc " " aabcc" " aabcc " " aabccz" "a" "a " "aa" "aa " "aab" "aab " "aabc" "aabc " "aabcc" "aabcc " "aabccz" " " " a" " a " " ab" " ab " " abc" " abc " " abcc" " abcc " " abccz" "a" "a " "ab" "ab " "abc" "abc " "abcc" "abcc " "abccz" " " " b" " b " " bc" " bc " " bcc" " bcc " " bccz" "b" "b " "bc" "bc " "bcc" "bcc " "bccz" " " " c" " c " " cc" " cc " " ccz" "c" "c " "cc" "cc " "ccz" " " " c" " c " " cz" "c" "c " "cz" " " " z" "z" 

此外,了解两种主要类型的正则表达式(NFA和DFA)如何完成其​​工作也是一件好事

来自http://msdn.microsoft.com/en-us/library/e347654k.aspx

.NET(我认为也是JAVA)是NFA正则表达式引擎(与DFA相对),并且当它处理特定语言元素时,引擎使用贪婪匹配; 也就是说,它尽可能多地匹配输入字符串。 但它在成功匹配子表达式后也会保存其状态。 如果匹配最终失败,则引擎可以返回已保存状态,以便可以尝试其他匹配。 放弃成功的子表达式匹配以使正则表达式中的后续语言元素也可以匹配的这一过程称为回溯。

鉴于这些非常具体的约束(即这不是一般情况解决方案),这将起作用:

 import java.util.Set; import java.util.TreeSet; import java.util.regex.Matcher; import java.util.regex.Pattern; public class test { public static void main(String[] args) { String s = "yzaaabccz"; Pattern p = Pattern.compile("(a )+(b )+(c ?)+"); Set set = recurse(s, p, 0); } public static Set recurse(String s, Pattern p, int depth) { int temp = depth; while(temp>0) { System.out.print(" "); temp--; } System.out.println("-> " +s); Matcher matcher = p.matcher(s); Set set = new TreeSet(); if(matcher.find()) { String found = matcher.group().trim(); set.add(found); set.addAll(recurse(found.substring(1), p, depth+1)); set.addAll(recurse(found.substring(0, found.length()-1), p, depth+1)); } while(depth>0) { System.out.print(" "); depth--; } System.out.println("<- " +s); return set; } } 

我有理由相信你可以调整它以适应其他情况,但是递归到匹配的字符串意味着重叠匹配(如@ahenderson所指出的那样)将不起作用。