Stanford Core NLP – 了解共同参与解决方案

我在理解上一版斯坦福NLP工具中对coref解析器所做的更改时遇到了一些麻烦。 作为示例,下面是一个句子和相应的CorefChainAnnotation:

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. {1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]} 

我不确定我理解这些数字的含义。 查看源代码也没有任何帮助。

谢谢

第一个数字是一个集群ID(代表标记,代表同一个实体),参见SieveCoreferenceSystem#coref(Document)源代码。 对数字不在CorefChain#toString()中:

 public String toString(){ return position.toString(); } 

其中position是一组提到实体的CorefChain.getCorefMentions()对(让他们使用CorefChain.getCorefMentions() )。 这是一个完整代码(在groovy中 )的示例,它显示了如何从位置到令牌:

 class Example { public static void main(String[] args) { Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); props.put("dcoref.score", true); pipeline = new StanfordCoreNLP(props); Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons."); pipeline.annotate(document); Map graph = document.get(CorefChainAnnotation.class); println aText for(Map.Entry entry : graph) { CorefChain c = entry.getValue(); println "ClusterId: " + entry.getKey(); CorefMention cm = c.getRepresentativeMention(); println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex); List cms = c.getCorefMentions(); println "Mentions: "; cms.each { it -> print aText.subSequence(it.startIndex, it.endIndex) + "|"; } } } } 

输出(我不明白’s’来自哪里):

 The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. ClusterId: 1 Representative Mention: he Mentions: he|atom |s| ClusterId: 6 Representative Mention: basic unit Mentions: basic unit | ClusterId: 8 Representative Mention: unit Mentions: unit | ClusterId: 10 Representative Mention: it Mentions: it | 

我一直在使用coreference依赖图,我开始使用这个问题的另一个答案。 过了一会儿,虽然我意识到上面这个算法并不完全正确。 它产生的输出甚至不接近我的修改版本。

对于使用这篇文章的任何人来说,这里是我最终得到的算法,它也过滤掉了自引用,因为每个代表性的人都会提到自己,很多提到只引用自己。

 Map coref = document.get(CorefChainAnnotation.class); for(Map.Entry entry : coref.entrySet()) { CorefChain c = entry.getValue(); //this is because it prints out a lot of self references which aren't that useful if(c.getCorefMentions().size() <= 1) continue; CorefMention cm = c.getRepresentativeMention(); String clust = ""; List tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class); for(int i = cm.startIndex-1; i < cm.endIndex-1; i++) clust += tks.get(i).get(TextAnnotation.class) + " "; clust = clust.trim(); System.out.println("representative mention: \"" + clust + "\" is mentioned by:"); for(CorefMention m : c.getCorefMentions()){ String clust2 = ""; tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class); for(int i = m.startIndex-1; i < m.endIndex-1; i++) clust2 += tks.get(i).get(TextAnnotation.class) + " "; clust2 = clust2.trim(); //don't need the self mention if(clust.equals(clust2)) continue; System.out.println("\t" + clust2); } } 

您的例句的最终输出如下:

 representative mention: "a basic unit of matter" is mentioned by: The atom it 

通常“primefaces”最终成为代表性的提及,但在这种情况下它并不令人惊讶。 输出稍微更精确的另一个例子是以下句子:

革命战争发生在18世纪,这是美国的第一次战争。

产生以下输出:

 representative mention: "The Revolutionary War" is mentioned by: it the first war in the United States 

这些是注释者最近的结果。

  1. [1,1] 1primefaces
  2. [1,2] 1一个基本的物质单位
  3. [1,3] 1它
  4. [1,6] 6个带负电荷的电子
  5. [1,5] 5带负电的电子云

标记如下:

 [Sentence number,'id'] Cluster_no Text_Associated 

属于同一群集的文本指的是相同的上下文。