正则表达式的替代(流畅?)接口的设计

我刚刚看到Java的一个巨大的正则表达式让我想到了一般的正则表达式的可维护性。 我相信大多数人 – 除了一些badass perl mongers–会同意正则表达式难以维持。

我在考虑如何解决这种情况。 到目前为止,我最有希望的想法是使用流畅的界面 。 举个例子,而不是:

Pattern pattern = Pattern.compile("a*|b{2,5}"); 

一个人可以写这样的东西

 import static util.PatternBuilder.* Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile(); Pattern alternative = or( string("a").anyTimes(), string("b").times(2,5) ) .compile(); 

在这个非常简短的例子中,创建正则表达式的常用方法对于任何平庸的有才华的开发人员来说仍然是可读的。 但是,请考虑那些填充两行或多行的怪异表达式,每行包含80个字符。 当然,(冗长的)流畅的界面需要几行而不是两行,但我相信它会更具可读性(因此可维护)。

现在我的问题:

  1. 你知道正则表达式的任何类似方法吗?

  2. 你是否同意这种方法比使用简单的字符串更好?

  3. 你会如何设计API?

  4. 你会在你的项目中使用这样一个整洁的实用程序吗?

  5. 你认为这会很有趣吗? ;)

编辑:想象一下,可能存在比简单构造更高级别的方法,我们都没有来自正则表达式,例如

 // matches aaaab@example.com - think of it as reusable expressions Pattern p = string{"a").anyTimes().string("b@").domain().compile(); 

编辑 – 评论的简短摘要:

  • 一个流畅的.NET正则表达式库

  • RegexBuddy – 花费30欧元使你的代码可读(wtf?!这种产品的纯粹存在certificate我的论文是对的 – 我们今天所知道的正则表达式是一件坏事(tm))

  • Martin Fowler的方法 (仍然远非完美)

有趣的是,大多数人认为正则表达式仍然存在 – 虽然它需要工具来阅读它们,聪明的家伙想办法使它们可维护。 虽然我不确定流畅的界面是最好的方法,但我确信有些聪明的工程师 – 我们呢? ;) – 应该花一些时间让正则表达式成为过去 – 这已经足够让他们和我们在一起已有50年了,你不觉得吗?

开放的BOUNTY

对于正则表达式的新方法,赏金将被授予最佳想法(无需代码)。

编辑 – 一个很好的例子:

这是我正在谈论的那种模式 – 对能够翻译它的第一个人的额外荣誉 – RegexBuddies允许(它来自Apache项目顺便说一句)获得chii和mez的额外荣誉:它是符合RFC的电子邮件地址validation模式 -虽然它的RFC822 (参见ex-parrot.com ),而不是5322 – 不确定是否存在差异 – 如果是的话,我将为补丁补充下一个额外的荣誉,以适应5322;)

 private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]" + ")+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:" + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(" + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ " + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\0" + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\" + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+" + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:" + "(?:\\r\\n)?[ \\t])*))*|(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z" + "|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)" + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\" + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[" + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)" + "?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]" + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[" + " \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*" + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]" + ")+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)" + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+" + "|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r" + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:" + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t" + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031" + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](" + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?" + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?" + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?" + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?" + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()@,;:\\\".\\[\\] " + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|" + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()" + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"" + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]" + ")*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\\" + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?" + ":[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[" + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()@,;:\\\".\\[\\] \\000-" + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(" + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()@,;" + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([" + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\"" + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\" + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\" + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\" + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] " + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]" + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()@,;:\\\".\\[\\] \\0" + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\" + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@," + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\"(?" + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*" + "(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\"." + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[" + "^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]" + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*(" + "?:(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\\" + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(" + "?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[" + "\\[\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t" + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t" + "])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?" + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|" + "\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:" + "[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\" + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)" + "?[ \\t])*(?:@(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"" + "()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)" + "?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()" + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[" + " \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@," + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]" + ")*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\\" + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?" + "(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()@,;:\\\"." + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:" + "\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[" + "\"()@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])" + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])" + "+|\\Z|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\" + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z" + "|(?=[\\[\"()@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(" + "?:\\r\\n)?[ \\t])*))*)?;\\s*)"; 

你会如何设计API?

我会从Hibernate标准API中借用一个页面。 而不是使用:

 string("a").anyTimes().or().string("b").times(2,5).compile() 

使用如下模式:

 Pattern.or(Pattern.anyTimes("a"), Pattern.times("b", 2, 5)).compile() 

这种表示法更简洁,我觉得理解模式的层次结构/结构更容易。 每个方法都可以接受字符串或模式片段作为第一个参数。

你知道正则表达式的任何类似方法吗?

没有随便,没有。

你是否同意这种方法比使用简单的字符串更好?

是的,绝对的…如果你正在使用正则表达式来处理任何远程复杂的东西。 对于非常简短的情况,字符串更方便。

你会在你的项目中使用这样一个整洁的实用程序吗?

可能,因为它已被certificate/稳定…将它滚动到像Apache Commons这样的大型实用程序项目中可能是一个优势。

你认为这会很有趣吗? ;)

+1

马丁福勒提出了另一种策略 。 即采用正则表达式的有意义部分并用变量替换它们。 他使用以下示例:

  "^score\s+(\d+)\s+for\s+(\d+)\s+nights?\s+at\s+(.*)" 

  String scoreKeyword = "^score\s+"; String numberOfPoints = "(\d+)"; String forKeyword = "\s+for\s+"; String numberOfNights = "(\d+)"; String nightsAtKeyword = "\s+nights?\s+at\s+"; String hotelName = "(.*)"; String pattern = scoreKeyword + numberOfPoints + forKeyword + numberOfNights + nightsAtKeyword + hotelName; 

哪个更具可读性和可维护性。

上一个示例的目标是从字符串列表中解析numberOfPoints,numberOfNights和hotelName,如:

 score 400 for 2 nights at Minas Tirith Airport 

没有任何正则表达式经验的人可能会稍微容易些,但是在有人学习你的系统之后,他仍然无法在其他地方阅读正常的正则表达式。

此外,我认为您的版本更难以阅读正则表达式专家。

我建议像这样注释正则表达式:

 Pattern pattern = Pattern.compile( "a* # Find 0 or more a \n" + "| # ... or ... \n" + "b{2,5} # Find between 2 and 5 b \n", Pattern.COMMENTS); 

这对于任何经验级别都很容易阅读,对于没有经验的人来说,它同时教授正则表达式。 此外,评论可以根据情况进行调整,以解释正则表达式背后的业务规则,而不仅仅是结构。

此外,像RegexBuddy这样的工具可以使用你的正则表达式并将其转换为:

匹配下面的正则表达式(仅在此失败的情况下尝试下一个替代方案)«a *»
   匹配字符“a”字面上的“a *”
      在零和无限次之间,尽可能多次,根据需要回馈(贪婪)«*»
或者匹配下面的正则表达式2(如果这个匹配失败,则整个匹配尝试失败)«b {2,5}»
   匹配字符“b”字面意思«b {2,5}»
       2至5次,尽可能多次,根据需要回馈(贪婪)«{2,5}»

这是一个有趣的概念,但正如它所呈现的那样,存在一些缺陷。

但首先回答了关键问题:

现在我的问题:

你知道正则表达式的任何类似方法吗?

没有人已经提到过。 通过阅读问题和答案,我发现了那些。

你是否同意这种方法比使用简单的字符串更好?

如果它像宣传的那样工作,它肯定会使调试更容易。

3.您将如何设计API?

请参阅下一节中的我的笔记。 我以您的示例和链接的.NET库为起点。

你会在你的项目中使用这样一个整洁的实用程序吗?

犹豫不决。 使用当前版本的隐秘的常规表达式没有问题。 我需要一个工具将现有的正则表达式转换为流利的语言版本。

你认为这会很有趣吗? ;)

我喜欢编写更高级别的方法,而不是编写实际的代码。 这解释了这个答案的文本墙。


以下是我注意到的一些问题,以及我处理它的方式。

结构不清楚。

您的示例似乎通过连接到字符串来创建正则表达式。 这不是很强大。 我相信这些方法应该添加到String和Patern / Regex对象中,因为它会使实现和代码更清晰。 此外,它类似于正则表达式的经典定义方式。

仅仅因为我看不到它以任何其他方式工作,我对提议方案的其余注释将假设所有方法都作用于并返回Pattern对象。

编辑:我似乎始终使用以下约定。 所以我澄清了它们并将它们移到了这里。

  • 实例方法:模式扩充。 例如:捕获,重复,查看断言。

  • 运营商:运营秩序。 交替,连接

  • 常量:字符类,边界(代替\ w,$,\ b等)

如何处理捕获/群集?

捕获是正则表达式的重要组成部分。

我看到每个Pattern对象都在内部存储为一个集群。 Perl术语中的(?:pattern)。 允许图案标记容易混合和混合而不会干扰其他部件。

我希望捕获作为Pattern上的实例方法完成。 采用变量来存储匹配的字符串[s]。

pattern.capture(variable)将模式存储在变量中。 如果捕获是要多次匹配的表达式的一部分,则变量应包含模式的所有匹配的字符串数组。

流利的语言可能非常模糊。

流利的语言不太适合正则表达式的递归性质。 因此需要考虑操作的顺序。 仅将方法链接在一起不允许非常复杂的正则表达式。 究竟是这种工具有用的情况。

是否

 Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile(); 

产生/a*|b{2,5}//(a*|b){2,5}/

这样的方案如何处理嵌套交替? 例如: /a*|b(c|d)|e/

我在正则表达式中看到了三种处理交替的方法

  1. 作为运算符: pattern1 or pattern2 => pattern # /pattern1|pattern2/
  2. 作为类方法: Pattern.or( pattern1, pattern2[, pattern3]*) => pattern # /pattern1|patern2|patern3|...|/
  3. 作为实例方法: pattern1.or(pattern2) => pattern # /pattern1|patern2/

我会以同样的方式处理串联。

  1. 作为运算符: pattern1 + pattern2 => pattern # /pattern1pattern2/
  2. 作为类方法: Pattern.concatenate( pattern1, pattern2[, pattern3]*) => pattern # /pattern1patern2patern3.../
  3. 作为实例方法: pattern1.then(pattern2) => pattern # /pattern1patern2/

如何扩展常用模式

建议的方案使用.domain() ,这似乎是一个普通的正则表达式。 将用户定义的模式视为方法不会使添加新模式变得容易。 在像Java这样的语言中,库的用户必须重写类以添加常用模式的方法。

我的建议是将每件作品作为对象来解决。 可以为每个常用的正则表达式创建模式对象,例如匹配域。 鉴于我之前关于捕获它的想法并不难以确保捕获适用于包含捕获部分的相同公共模式的多个副本。

对于各种字符类匹配的模式也应该有常量。

零宽度看看断言

扩展我的想法,所有部分都应该隐式聚类。 查看断言对于实例方法也不应该太难。

pattern.zeroWidthLookBehind()会产生(?


还需要考虑的事情。

  • 反向引用:希望对前面讨论过的命名捕获并不太难
  • 如何实际实现它。 我没有太多考虑内部因素。 这是真正的魔法将要发生的地方。
  • 翻译:真的应该有一个工具翻译成经典正则表达式(比如Perl方言)和新方案。 从新计划转换可能是一揽子计划的一部分

总而言之,我建议的与电子邮件地址匹配的模式版本:

 Pattern domain_label = LETTER_CHARACTER + (LETTER_CHARACTER or "-" or DIGIT_CHARACTER).anyTimes() Pattern domain = domain_label + ("." + domain_label).anyTimes() Pattern pattern = (LETTER_CHARACTER + ALPHANUMERIC_CHARACTER + "@" + domain).compile 

事后看来,我的计划大量借鉴了Martin Fowler的使用方法。 虽然我不打算采用这种方式,但它确实使得使用这样的系统更易于维护。 它还解决了福勒方法(捕获顺序)的问题。

我自己的卑微尝试可以在GitHub上找到 。 虽然我认为它不值得用于简单表达式,但除了可读性改进之外,它确实提供了一些优点:

  • 它负责括号匹配
  • 它可以处理所有“特殊”字符的转义,这些字符很快就会导致反斜杠地狱

一些简单的例子:

  // Matches a single digit RegExBuilder.build(anyDigit()); // "[0-9]" // Matches exactly 2 digits RegExBuilder.build(exactly(2).of(anyDigit())); // "[0-9]{2}" // Matches between 2 and 4 letters RegExBuilder.build(between(2,4).of(anyLetter())); // "[a-zA-Z]{2,4}" 

更复杂的一个(或多或少validation电子邮件地址):

 final Token ALPHA_NUM = anyOneOf(range('A','Z'), range('a','z'), range('0','9')); final Token ALPHA_NUM_HYPEN_UNDERSCORE = anyOneOf(characters('_','-'), range('A','Z'), range('a','z'), range('0','9')); String regexText = RegExBuilder.build( // Before the '@' symbol we can have letters, numbers, underscores and hyphens anywhere oneOrMore().of( ALPHA_NUM_HYPEN_UNDERSCORE ), zeroOrMore().of( text("."), // Periods are also allowed in the name, but not as the initial character oneOrMore().of( ALPHA_NUM_HYPEN_UNDERSCORE ) ), text("@"), // Everything else is the domain name - only letters, numbers and periods here oneOrMore().of( ALPHA_NUM ), zeroOrMore().of( text("."), // Periods must not be the first character in the domain oneOrMore().of( ALPHA_NUM ) ), text("."), // At least one period is required atLeast(2).of( // Period must be followed by at least 2 letters (this is the TLD) anyLetter() ) ); 

.NET有一个流畅的regexps库 。

简短的回答:我已经看到它从一个linting和编译角度接近,我认为这是需要考虑的事情。

答案很长:我工作的公司为企业内容过滤应用程序制作基于硬件的正则表达式引擎。 认为在网络路由器中以20GB / sec的速度运行防病毒或防火墙应用程序,而不是占用宝贵的服务器或处理器周期。 大多数反病毒,反垃圾邮件或防火墙应用程序都是一堆核心的正则表达式。

无论如何,正则表达式的编写方式对扫描的性能有很大的影响。 您可以用几种不同的方式编写正则表达式来执行相同的操作,而某些方法的性能会大大提高。 我们为客户编写了编译器和短语,以帮助他们维护和调整表达式。

回到OP的问题,而不是定义一个全新的语法,我会写一个linter(对不起,我们的专有)剪切和粘贴正则表达式将打破传统的正则表达式并输出“流利的英语”让别人更好地理解。 我还会添加相关的性能检查和常见修改建议。

对我来说,简短的回答是,一旦你得到正则表达式(或其他模式匹配做同样的事情)足够长就会引起问题……你应该考虑它们是否是正确的工具首先是工作。

老实说,任何流畅的界面似乎比标准的正则表达式更难阅读。 对于真正简短的表达,流畅的版本是冗长的,但不会太长; 它是可读的。 但是长期存在的正则表达式也是如此。

对于中等规则的正则表达式,流畅的界面变得笨重; 足够长的时间,阅读很难,如果不是不可能的话。

对于长正则表达式(即,电子邮件地址一),正则表达式实际上很难(如果不是不可能)阅读,流畅的版本变得不可能在10页前阅读。

你知道正则表达式的任何类似方法吗?

不,除了之前的答案

你是否同意这种方法比使用简单的字符串更好?

排序 – 我认为不是单个字符来表示构造,我们可以使用更多的描述性标记,但我怀疑它会使复杂的逻辑更清晰。

你会如何设计API?

将每个构造转换为方法名称,并允许嵌套函数调用,以便很容易获取字符串并将方法名称替换为它。

我认为大部分价值在于定义一个强大的效用函数库,比如匹配“电子邮件”,“电话号码”,“不包含X的行”等,可以自定义。

你会在你的项目中使用这样一个整洁的实用程序吗?

也许 – 但可能只适用于较长的那些,调试函数调用比调试字符串编辑更容易,或者有一个很好的实用函数可以使用。

你认为这会很有趣吗? ;)

当然!

回答问题的最后部分(感谢Kudos)

 private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]" + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:" + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(" + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ " + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0" + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\" + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+" + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:" + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z" + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)" + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\" + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[" + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)" + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]" + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[" + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*" + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]" + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)" + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+" + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r" + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:" + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t" + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031" + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](" + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?" + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?" + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?" + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?" + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] " + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|" + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>" + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"" + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]" + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\" + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?" + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[" + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-" + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(" + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;" + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([" + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\"" + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\" + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\" + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\" + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] " + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]" + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0" + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\" + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@," + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?" + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*" + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"." + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[" + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]" + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*(" + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\" + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(" + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[" + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t" + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t" + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?" + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|" + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:" + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\" + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)" + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"" + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)" + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>" + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[" + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@," + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]" + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\" + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?" + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"." + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:" + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[" + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])" + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])" + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\" + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z" + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(" + "?:\\r\\n)?[ \\t])*))*)?;\\s*)"; 

matches RFC compliant email addresses 😀

A regular expression is a description of a finite state machine. The classic textual representation isn’t necessarily bad. It is compact, it is relatively unambigous and it is fairly well adopted.

It MAY be that a better representation would be a state transition diagram, but that would probably be hard to use in source code.

One possibility would be to build it from a variety of container and combiner objects.

Something along the line of the following (turning this from pseudo-code to language of choice is left as an exercise for the eager):

domainlabel = oneormore(characterclass("a-zA-Z0-9-"))
separator = literal(".")
domain = sequence(oneormore(sequence(domainlabel, separator)), domainlabel)
localpart = oneormore(characterclassnot("@"))
emailaddress = sequence(localpart, literal("@"), domain)

Note that the above will incorrectly classify arbritarily many email addresses as being valid that do NOT conform to the grammar they’re required to follow, as that grammar requires more than a simple FSM for full parsing. I don’t believe it’d misclassify a valid address as invalid, though.

It should correspond to [^@]+@([a-zA-Z0-9-]+.)+.([a-zA-Z0-9-]+)

4. Would you use such a neat utility in your projects?

I would most likely not. I think this is a case of using the right tool for the job. There are some great answers here such as: 1579202 from Jeremy Stein. I have recently “gotten” regular expressions on a recent project and found them to be very useful just as they are, and when properly commented they are understandable if you know the syntax.

I think the “knowing the syntax” part is what turns people off to Regular Expressions, to those who don’t understand they look arcane and cryptic, but they are a powerful tool and in applied Computer Science (eg writing software for a living) I feel like intelligent professionals should and should be able to learn to use them and us them appropriately.

As they say “With great power comes great responsibility.” I have seen people use regular expressions everywhere for everything, but used judiciously by someone who has taken the time to learn the syntax thoroughly, they are incredibly helpful; to me, adding another layer would in a way defeat their purpose, or at a minimum take away their power.

This just my own opinion and I can understand where people are coming from who would desire a framework like this, or who would avoid regular expressions, but I have hard time hearing “Regular Expressions are bad” from those who haven’t take the time to learn them and make an informed decision.

To make everybody happy (regex masters and fluid interface proponents), make sure the fluid interface can output an appropriate raw regex pattern, and also take a regular regex using a Factory method and generate fluid code for it.

What you are looking for can be found here: . It is a regular expression buillder which follows the Wizard Design Pattern

I recently had this same idea .

Thought of implementing it myself, but then I found VerbalExpressions .

Let’s compare: I have worked often with (N)Hibernate ICriteria queries, which can be considered a Fluent mapping to SQL. I was (and still am) enthusiastic about them, but did they make the SQL queries more legible? No, more to the contrary, but another benefit rose: it became much easier to programmatically build statements, to subclass them and create your own abstractions etc.

What I’m getting at is that using a new interface for a given language, if done right, can prove worthwhile, but don’t think too highly of it. In many cases it won’t become easier to read (nested subtraction character classes, Captures in look-behind, if-branching to name a few advanced concepts that will be hard to combine fluently). But in just as many cases, the benefits of greater flexibility outweigh the added overhead of syntax complexity.

To add to your list of possible alternative approaches and to take this out of the context of only Java, consider the LINQ syntax. Here’s what it could look like (a bit contrived) ( from , where and select are keywords in LINQ):

 // for ^str(aa|bb){3} from part in mystring where part startswith "str" and part hasgroups "aa" or "bb" as first /* "aa" or "bb" in group 'first' */ and part repeats first 3 /* repeat group 'first' 3 times */ select part + "extra" /* can contain complete statement block */ 

just a rough idea, I know. The good thing of LINQ is that it is checked by the compiler, a kind-of language in a language. By default, LINQ can also be expressed as fluent chained syntax, which makes it, if well designed, compatible with other OO languages.

I say go for it, I’m sure it’s fun to implement.

I suggest using a query model (similar to jQuery, django ORM), where each function returns a query object, so you can chain them together.

 any("a").some("b").one("@").some(chars).one(".").some(chars) //a*b+@\w+\.\w+ 

where chars is predefined to fit any character.

or can be achieved by using choices:

 any("a").choice("x", "z") // a(x|z) 

The argument to each function can be a string or another query. For example, the chars variable mentioned above can be defined as a query:

 //this one is ascii only chars = raw("a-zA-Z0-9") 

And so, you can have a “raw” function that accepts a regex string as input if it feels to cumbersome to use the fluent query system.

Actually, these functions can just return raw regex, and chaining them is simply concatenating these raw regex strings.

 any("a") ---> "a*" raw("b+") ----> "b+" one(".") ---> "\." choice("a", "b") ----> (a|b) 

I am not sure that replacing regexp with a fluent API would bring much.

Note that I am not a regexp wizard (I have to re-read the doc nearly every time I need to create a regexp).

A fluent API would make any medium-complexity regexp (let’s say ~50 characters) even more complex than required and not easier to read in the end, although it may improve the creation of a regexp in an IDE, thanks to code completion. But code maintenance generally represents a higher cost than code development.

In fact, I am not even sure it would be possible to have an API smart enough to really provide enough guidance to the developer when creating a new regexp, not talking about ambiguous cases, as mentioned in a previous answer.

You mentioned a regexp example for an RFC. I am 99% sure (there is still 1% hope;-)) that any API would not make that example any simpler, but conversely that would only make it more complex to read! That’s a typical example where you don’t want to use regexp anyway!

Even regarding regexp creation, due to the problem of ambiguous patterns, it is probable that with a fluent API, you would never come with the right expression the first time, but would have to change several times until you get what you really want.

Make no mistake, I do love fluent interfaces; I have developed some libraries that use them, and I use several 3rd-party libraries based on them (eg FEST for Java testing). But I don’t think they can be the golden hammer for any problem.

If we consider Java exclusively, I think the main problem with regexps is the necessary escaping of backslashes in Java string constants. That’s one point that makes it incredibly difficult to create and understand regexp in Java. Hence, the first step to enhance Java regexp would, for me, be a language change, a la Groovy , where string constants don’t need to escape backslashes.

So far that would be my only proposal to improve regexp in Java.