斯坦福核心NLP：实体类型非确定性

我使用Stanford Core NLP构建了一个java解析器。我发现在使用CORENLP对象获得一致结果方面存在问题。我得到相同输入文本的不同实体类型。这似乎是CoreNLP中的一个错误。想知道是否有任何StanfordNLP用户遇到过这个问题，并找到相同的解决方法。这是我正在实例化和重用的Service类。

class StanfordNLPService { //private static final Logger logger = LogConfiguration.getInstance().getLogger(StanfordNLPServer.class.getName()); private StanfordCoreNLP nerPipeline; /* Initialize the nlp instances for ner and sentiments. */ public void init() { Properties nerAnnotators = new Properties(); nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner"); nerPipeline = new StanfordCoreNLP(nerAnnotators); } /** * @param text Text from entities to be extracted. */ public void printEntities(String text) { // boolean tracking = PerformanceMonitor.start("StanfordNLPServer.getEntities"); try { // Properties nerAnnotators = new Properties(); // nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner"); // nerPipeline = new StanfordCoreNLP(nerAnnotators); Annotation document = nerPipeline.process(text); // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types List sentences = document.get(CoreAnnotations.SentencesAnnotation.class); for (CoreMap sentence : sentences) { for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) { // Get the entity type and offset information needed. String currEntityType = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); // Ner type int currStart = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class); // token offset_start int currEnd = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class); // token offset_end. String currPos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); // POS type System.out.println("(Type:value:offset)\t" + currEntityType + ":\t"+ text.substring(currStart,currEnd)+"\t" + currStart); } } }catch(Exception e){ e.printStackTrace(); } } } Discrepancy result: type changed from MISC to O from the initial use. Iteration 1: (Type:value:offset) MISC: Appropriate 100 (Type:value:offset) MISC: Time 112 Iteration 2: (Type:value:offset) O: Appropriate 100 (Type:value:offset) O: Time 112

我已经查看了一些代码，这里有一种解决方法：

你可以做的就是将useKnownLCWords设置为false的3个序列化CRF中的每一个加载，并再次序列化它们。然后将新的序列化CRF提供给您的StanfordCoreNLP。

这是一个用于加载序列化CRF且useKnownLCWords设置为false的命令，然后再次转储它：

java -mx600m -cp“*：。” edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers / english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers / new.english.all.3class.distsim.crf.ser.gz

显然你想要任何名字！此命令假设您处于stanford-corenlp-full-2015-04-20 /并且具有序列化CRF的目录分类器。根据您的设置进行更改。

此命令应加载序列化的CRF，覆盖useKnownLCWords设置为false，然后将CRF重新转储到new.english.all.3class.distsim.crf.ser.gz

然后在您的原始代码中：

 nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");

请告知我这项工作是否有效或是否有效，我可以更深入地了解这一点！

以下是NER FAQ的答案：

http://nlp.stanford.edu/software/crf-faq.shtml

NER是否具有确定性？为什么相同数据的结果会发生变化？

是的，基础CRF是确定性的。但是，如果您不止一次将NER应用于同一个句子，则可能会第二次得到不同的答案。原因是NER记得以前是否看到过小写forms的单词。

这个用作特征的确切方式是在单词形状特征中，如果它之前已经或者没有看到“棕色”作为小写单词，则会以不同的方式处理诸如“Brown”之类的单词。如果有，单词形状将是“初始上部，已经看到全部小写”，如果没有，单词形状将是“初始上部，没有看到全部小写”。

在最近版本中，可以使用标志-useKnownLCWords false关闭此function

在做了一些研究后，我发现问题出在ClassifierCombiner.classify（）方法中。默认情况下加载的baseClassifiers edu / stanford / nlp / models / ner / english.conll.4class.distsim.crf.ser.gz之一在某些情况下会返回不同的类型。我试图只加载第一个模型来解决这个问题。

问题是代码的以下区域

CRFClassifier.classifyMaxEnt（）

 int[] bestSequence = tagInference.bestSequence(model); Line 1249

当多次调用时， ExactBestSequenceFinder.bestSequence（）为同一输入返回上述模型的不同序列。

不确定这是否需要代码修复或模型的某些配置更改。任何额外的见解表示赞赏。

斯坦福核心NLP：实体类型非确定性

在Eclipse中快速方便地从多个项目中运行Junit-Tests

根据传入的字符串设置枚举值

匿名内部类可以扩展吗？

如何将公共数据从不同的模式插入临时表？

Maven父pom变量未解析

JAVA：http发布请求

Java中的repaint（）不会立即“重新绘制”？

使用POI HSSF获取错误

如何在java中使用正则表达式模式？

如何以编程方式使用TestNG运行Selenium Java测试？