在火花环境中的Uima Ruta Out of Memory问题

我在apache spark上运行UIMA应用程序。 UIMA RUTA有数百万页需要批量处理才能进行计算。但是有一段时间我面临内存exception。它会在成功处理2000页的时候抛出exception，但有些时候会在500页上失败。

应用日志

Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57) at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39) at org.apache.uima.cas.impl.Heap.grow(Heap.java:187) at org.apache.uima.cas.impl.Heap.add(Heap.java:241) at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844) at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489) at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837) at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172) at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68) at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73) at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330) at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213) at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)

UIMA RUTA SCRIPT

 WORDLIST EnglishStopWordList = 'stopWords.txt'; WORDLIST FiltersList = 'AnchorFilters.txt'; DECLARE Filters, EnglishStopWords; DECLARE Anchors, SpanStart,SpanClose; DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)}; DocumentAnnotation{-> MARKFAST(Filters, FiltersList)}; STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+"; DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)}; (SW | CW | CAP ) { -> MARK(Anchors, 1, 2)}; Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)}; (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)}; (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)}; (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)}; (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)}; Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)}; MixCharacterRegex -> Anchors; "" -> SpanStart; "" -> SpanClose; Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)}; SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};

通常，UIMA Ruta中高内存使用的原因可以在RutaBasic（许多注释，覆盖信息）或RuleMatch（低效规则，许多规则元素匹配）中找到。

这是你的例子，这个问题似乎起源于其他地方。堆栈跟踪指示内存被某些析取规则元素用尽，这需要创建用于存储匹配信息的新注释。

似乎UIMA Ruta的版本相当陈旧，因为行号与我正在查看的源根本不匹配。

stacktrace中有七个（!!!）调用continueOwnMatch 。我一直在寻找一个可能会导致这样的事情的规则但却没有找到。这可能是一个旧的缺陷，已在较新版本中修复，或一些预处理添加了额外的CW / SW / CAP注释。

作为第一个建议，我建议两件事：

更新到UIMA Ruta 2.6.0
摆脱所有析取规则元素

您的脚本中并不真正需要析取规则元素。一般来说，如果不是真的需要它们就不应该使用。我根本没有在生产规则中使用它们。

而不是(SW | CW | CAP )你可以简单地写W

而不是(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)你可以写ANY{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} 。

使用ANY作为匹配条件可以降低运行时性能。在这个例子中，两个规则而不是规则lement重写可能更好，例如，类似的东西

 SPECIAL{REGEXP("['\"-=()\\[\\]]")} W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)}; PM W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};

（规则开头的可选规则元素在规则中没有任何锚点，不是可选的）

顺便说一句，你的规则有很多优化空间。如果我不得不猜测，我会说你可以删除至少一半的规则和90％的所有创建注释，这也会大大减少内存使用量。

免责声明：我是UIMA Ruta的开发人员

在火花环境中的Uima Ruta Out of Memory问题

如何从spark设置和获取静态变量？

apache zeppelin抛出NullPointerException错误

在同一JVM中检测到多个SparkContext

Java中的“Lambdifying”scala函数

Spark 2.0.1写入错误：引起：java.util.NoSuchElementException

如何从sparkdataframe列中的数组中提取值

使用Java的Spark作业服务器

计算RDD中的行数

在google dataproc集群实例中的spark-submit上运行app jar文件

Spark – Java UDF返回多列