用于解析格式化数字的正则表达式

我正在解析包含大量格式化数字的文档，例如：

Frc consts -- 1.4362 1.4362 5.4100 IR Inten -- 0.0000 0.0000 0.0000 Atom AN XYZXYZXYZ 1 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2 1 0.40 -0.20 0.23 -0.30 -0.18 0.36 0.06 0.42 0.26

这些是分开的线，都具有显着的前导空间，并且可能存在或可能不存在显着的尾随空格。它们由72,72,78,78和78个字符组成。我可以推断出字段之间的界限。这些是可描述的（使用fortran格式（nx = nspaces，an = n alphanum，in = n，n列中的整数，fm.n = m个字符的浮点数，小数点后面的n个位置）：

  (1x,a14,1x,f10.4,13x,f10.4,13x,f10.4) (1x,a14,1x,f10.4,13x,f10.4,13x,f10.4) (1x,a4,a4,3(2x,3a7)) (1x,2i4,3(2x,3f7.2)) (1x,2i4,3(2x,3f7.2))

我可能有几千种不同的格式（我可以自动生成或移植），并通过描述组件的正则表达式来描述它们。因此，如果regf10_4代表满足f10.4约束的任何字符串的正则表达式，我可以创建一个forms的正则表达式：

 COMMENTS (\s .{14} \s regf10_4, \s{13} regf10_4, \s{13} regf10_4, )

我想知道是否有正则表达式以这种方式满足重用。计算机和人类创造的数字有很多种，比如f10.4。我相信以下是fortran的所有合法输入和/或输出（我不需要像12.4f那样的f或dforms的后缀）[SO中的格式应该被理解为没有第一个的前导空格，一个用于第二，等]

 -1234.5678 1234.5678 // missing number 12345678. 1. 1.0000000 1.0000 1. 0. 0. .1234 -.1234 1E2 1.E2 1.E02 -1.0E-02 ********** // number over/underflow

它们必须对相邻字段的内容具有鲁棒性（例如，只能在精确位置精确检查10个字符。因此，以下内容对于（a1，f5.2，a1）是合法的：

 a-1.23b // -1.23 - 1.23. // 1.23 3 1.23- // 1.23

我正在使用Java，因此需要与Java 1.6兼容的正则表达式构造（例如，不是perl扩展）

据我了解，每行包含一个或多个固定宽度字段，可能包含不同种类的标签，空格或数据。如果你知道字段的宽度和类型，提取它们的数据就是substring() ， trim()和（可选） Whatever.parseWhatever()的简单问题。正则表达式无法让这项工作变得更容易 – 事实上，他们所能做的就是让它变得更加困难。

扫描仪也没有真正的帮助。确实，它为各种值类型预定义了正则表达式，它为您进行转换，但仍需要告知每次要查找的类型，并且需要将字段用可识别的分隔符分隔。根据定义，固定宽度数据不需要分隔符。你可以通过做一个前瞻来伪造分隔符，但是应该在行中留下许多字符，但这只是使工作比其需要更难的另一种方式。

听起来性能将成为一个主要问题; 即使你可以使正则表达式解决方案工作，它可能会太慢。不是因为正则表达本身就很慢，而是因为你必须经历扭曲以使它们适合这个问题。我建议你忘掉这份工作的正则表达式。

你可以从这开始，然后从那里开始。

此正则表达式匹配您提供的所有数字。
不幸的是，它也匹配3中的3 1.23-

 // [-+]?(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+)? // // Match a single character present in the list “-+” «[-+]?» // Between zero and one times, as many times as possible, giving back as needed (greedy) «?» // Match the regular expression below «(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)» // Match either the regular expression below (attempting the next alternative only if this one fails) «[0-9]+(?:\.[0-9]*)?» // Match a single character in the range between “0” and “9” «[0-9]+» // Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» // Match the regular expression below «(?:\.[0-9]*)?» // Between zero and one times, as many times as possible, giving back as needed (greedy) «?» // Match the character “.” literally «\.» // Match a single character in the range between “0” and “9” «[0-9]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // Or match regular expression number 2 below (the entire group fails if this one fails to match) «\.[0-9]+» // Match the character “.” literally «\.» // Match a single character in the range between “0” and “9” «[0-9]+» // Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» // Match the regular expression below «(?:[eE][-+]?[0-9]+)?» // Between zero and one times, as many times as possible, giving back as needed (greedy) «?» // Match a single character present in the list “eE” «[eE]» // Match a single character present in the list “-+” «[-+]?» // Between zero and one times, as many times as possible, giving back as needed (greedy) «?» // Match a single character in the range between “0” and “9” «[0-9]+» // Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» Pattern regex = Pattern.compile("[-+]?(?:[0-9]+(?:\\.[0-9]*)?|\\.[0-9]+)(?:[eE][-+]?[0-9]+)?"); Matcher matcher = regex.matcher(document); while (matcher.find()) { // matched text: matcher.group() // match start: matcher.start() // match end: matcher.end() }

这只是一个部分答案，但我收到了Java 1.5 扫描器的警告，它可以扫描文本并解释数字，这些数字为这个Java实用程序可以扫描和解释的数字提供了BNF。原则上我想象BNF可以用来构造一个正则表达式。

用于解析格式化数字的正则表达式

你如何坚持tomcat会话？

如何在Java中的一段时间后停止执行？

读取文本文件和将内容转储到JTextArea的最有效方法

如果多个线程可以访问某个字段，那么它应该标记为volatile吗？

使用Java生成PowerPoint 2007/2010文件

HashSet包含自定义对象的问题

java .properties文件值中的换行符

如何使用MiGLayout将组件置于包含多个组件的行上

jre8中URLPermission处的IllegalArgumentException

如何更改servlet流式传输PDF的浏览器页面的标题？