Stanford Core NLP – 了解共同参与解决方案

我在理解上一版斯坦福NLP工具中对coref解析器所做的更改时遇到了一些麻烦。作为示例，下面是一个句子和相应的CorefChainAnnotation：

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. {1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

我不确定我理解这些数字的含义。查看源代码也没有任何帮助。

谢谢

第一个数字是一个集群ID（代表标记，代表同一个实体），参见SieveCoreferenceSystem#coref(Document)源代码。对数字不在CorefChain＃toString（）中：

 public String toString(){ return position.toString(); }

其中position是一组提到实体的CorefChain.getCorefMentions()对（让他们使用CorefChain.getCorefMentions() ）。这是一个完整代码（在groovy中）的示例，它显示了如何从位置到令牌：

 class Example { public static void main(String[] args) { Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); props.put("dcoref.score", true); pipeline = new StanfordCoreNLP(props); Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons."); pipeline.annotate(document); Map graph = document.get(CorefChainAnnotation.class); println aText for(Map.Entry entry : graph) { CorefChain c = entry.getValue(); println "ClusterId: " + entry.getKey(); CorefMention cm = c.getRepresentativeMention(); println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex); List cms = c.getCorefMentions(); println "Mentions: "; cms.each { it -> print aText.subSequence(it.startIndex, it.endIndex) + "|"; } } } }

输出（我不明白’s’来自哪里）：

 The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. ClusterId: 1 Representative Mention: he Mentions: he|atom |s| ClusterId: 6 Representative Mention: basic unit Mentions: basic unit | ClusterId: 8 Representative Mention: unit Mentions: unit | ClusterId: 10 Representative Mention: it Mentions: it |

我一直在使用coreference依赖图，我开始使用这个问题的另一个答案。过了一会儿，虽然我意识到上面这个算法并不完全正确。它产生的输出甚至不接近我的修改版本。

对于使用这篇文章的任何人来说，这里是我最终得到的算法，它也过滤掉了自引用，因为每个代表性的人都会提到自己，很多提到只引用自己。

 Map coref = document.get(CorefChainAnnotation.class); for(Map.Entry entry : coref.entrySet()) { CorefChain c = entry.getValue(); //this is because it prints out a lot of self references which aren't that useful if(c.getCorefMentions().size() <= 1) continue; CorefMention cm = c.getRepresentativeMention(); String clust = ""; List tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class); for(int i = cm.startIndex-1; i < cm.endIndex-1; i++) clust += tks.get(i).get(TextAnnotation.class) + " "; clust = clust.trim(); System.out.println("representative mention: \"" + clust + "\" is mentioned by:"); for(CorefMention m : c.getCorefMentions()){ String clust2 = ""; tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class); for(int i = m.startIndex-1; i < m.endIndex-1; i++) clust2 += tks.get(i).get(TextAnnotation.class) + " "; clust2 = clust2.trim(); //don't need the self mention if(clust.equals(clust2)) continue; System.out.println("\t" + clust2); } }

您的例句的最终输出如下：

 representative mention: "a basic unit of matter" is mentioned by: The atom it

通常“primefaces”最终成为代表性的提及，但在这种情况下它并不令人惊讶。输出稍微更精确的另一个例子是以下句子：

革命战争发生在18世纪，这是美国的第一次战争。

产生以下输出：

 representative mention: "The Revolutionary War" is mentioned by: it the first war in the United States

这些是注释者最近的结果。

[1,1] 1primefaces
[1,2] 1一个基本的物质单位
[1,3] 1它
[1,6] 6个带负电荷的电子
[1,5] 5带负电的电子云

标记如下：

 [Sentence number,'id'] Cluster_no Text_Associated

属于同一群集的文本指的是相同的上下文。

Stanford Core NLP – 了解共同参与解决方案

DocumentFilter的正则表达式匹配所有十进制数，但最后只有一个小数

将List转换并转换为使用Guava进行设置

Quartz Scheduler：在每个集群节点上触发一些作业，每个集群只触发一些作业

如何实现对Java中映射到内存的文件的并发读取？

将变量从一个jsp发送到另一个jsp

实体表未使用JPA 2.1创建

java中的DRY原则

如何使用数据库中的数据填充JavaFX ChoiceBox？

快速CSV解析

我可以使用类的方法而不实例化这个类吗？