斯坦福NLP – 处理文件列表时OpenIE内存不足

我正在尝试使用Stanford CoreNLP中的OpenIE工具从多个文件中提取信息，当几个文件传递给输入时，它会产生内存不足错误，而不是只有一个。

All files have been queued; awaiting termination... java.lang.OutOfMemoryError: GC overhead limit exceeded at edu.stanford.nlp.graph.DirectedMultiGraph.outgoingEdgeIterator(DirectedMultiGraph.java:508) at edu.stanford.nlp.semgraph.SemanticGraph.outgoingEdgeIterator(SemanticGraph.java:165) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.advance(GraphRelation.java:267) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1102) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.(GraphRelation.java:1083) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.(GraphRelation.java:257) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER.searchNodeIterator(GraphRelation.java:257) at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:320) at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.matches(CoordinationPattern.java:211) at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matchChild(NodePattern.java:514) at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:542) at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segmentVerb(RelationTripleSegmenter.java:541) at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segment(RelationTripleSegmenter.java:850) at edu.stanford.nlp.naturalli.OpenIE.relationInFragment(OpenIE.java:354) at edu.stanford.nlp.naturalli.OpenIE.lambda$relationsInFragments$2(OpenIE.java:366) at edu.stanford.nlp.naturalli.OpenIE$$Lambda$76/1438896944.apply(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1540) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at edu.stanford.nlp.naturalli.OpenIE.relationsInFragments(OpenIE.java:366) at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:486) at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$3(OpenIE.java:554) at edu.stanford.nlp.naturalli.OpenIE$$Lambda$25/606198361.accept(Unknown Source) at java.util.ArrayList.forEach(ArrayList.java:1249) at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:554) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499) at edu.stanford.nlp.naturalli.OpenIE.processDocument(OpenIE.java:630) DONE processing files. 1 exceptions encountered.

我使用此调用通过输入传递文件：

 java -mx3g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE file1 file2 file3 etc.

我尝试使用-mx3g和其他变体增加内存，虽然处理文件的数量增加，但并不多（例如，从5到7）。单独处理每个文件，因此我排除了一个带有大句子或多行的文件。

有没有我不考虑的选项，一些OpenIE或Java标志，我可以使用什么来强制转储到输出，清理或处理的每个文件之间的垃圾收集？

先感谢您

从上面的评论：我怀疑这是一个太多并行和内存太少的问题。 OpenIE有点内存不足，尤其是长句，因此并行运行多个文件会占用相当多的内存。

一个简单的解决方法是通过设置-threads 1标志来强制程序运行单线程。如果可能的话，增加内存也应该有所帮助。

运行此命令以获取每个文件的单独注释（sample-file-list.txt应为每行一个文件）

 java -Xmx4g -cp "stanford-corenlp-full-2015-12-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie -filelist sample-file-list.txt -outputDirectory output_dir -outputFormat text

斯坦福NLP – 处理文件列表时OpenIE内存不足

如何通过代理设置libGDX项目

使用TestSuite时，我可以避免在eclipse中运行两次junit测试吗？

在Java 8中使用多重inheritance

整个包/类集的Eclipse清理代码样式

如何突出（均匀视觉选择，绘制透明叠加）JPanel？

如何设置垂直排列的元素之间的距离？

Javagenerics对map的键和值强制执行相同的类型

具有相同名称但不同类型的变量

Karaf / Maven – 无法解决：缺少要求osgi.wiring.package

Eclipse RCP – 相对视野透视扩展不起作用