Tag: stanford nlp

如何从java中的字符串中删除无效的unicode字符: 我正在使用CoreNLP神经网络依赖性解析器来解析一些社交媒体内容。不幸的是，该文件包含的字符根据fileformat.info ，不是有效的unicode字符或unicode替换字符。这些是例如U + D83D或U + FFFD 。如果这些字符在文件中，coreNLP会回复错误消息，如下所示： Nov 15, 2015 5:15:38 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) 基于这个答案，我尝试了document.replaceAll(“\\p{C}”, “”); 只是删除那些字符。这里的文档只是作为字符串的文档。但这没有帮助。在将字符串传递给coreNLP之前，如何从字符串中删除这些字符？更新（11月16日）：为了完整起见，我应该提一下，我只是为了通过预处理文件来避免大量的错误消息而问这个问题。 CoreNLP只是忽略它无法处理的字符，所以这不是问题。

使用Stanford CoreNLP: 我正在尝试使用Stanford CoreNLP。我使用Web上的一些代码来了解coreference工具的用途。我尝试在Eclipse中运行该项目但仍然遇到内存不足exception。我尝试增加堆大小，但没有任何区别。关于为什么会这种情况发生的任何想法？这是特定于代码的问题吗？任何使用CoreNLP的方向都会很棒。编辑 – 已添加代码 import edu.stanford.nlp.dcoref.CorefChain; import edu.stanford.nlp.dcoref.CorefCoreAnnotations; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import java.util.Iterator; import java.util.Map; import java.util.Properties; public class testmain { public static void main(String[] args) { String text = “Viki is a smart boy. He knows a lot of things.”; Annotation document = new Annotation(text); Properties […]

如何关闭Stanford CoreNLP Redwood日志记录？: 如何关闭Stanford CoreNLP消息（见post末尾）？我首先尝试在log4j.properties中设置log4j.category.edu.stanford=OFF ，但这没有帮助，所以我发现它显然使用了一个名为“Redwood”的非标准日志框架。根据http://nlp.stanford.edu/nlp/javadoc/javanlp/有一个文档，但它受密码保护。我尝试过RedwoodConfiguration.empty().apply(); 但这也无济于事。记录消息： Adding annotator tokenize Adding annotator ssplit Adding annotator pos Loading default properties from tagger edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger … done [1,2 sec]. PS： Redwood.hideAllChannels(); 也行不通。然而，以下内容会抑制我自己的日志记录声明（但不是来自StanfordCoreNLP的声明）： RedwoodConfiguration.empty().apply(); Redwood.log(“test redwood”); 解决方案好的，StevenC是对的，毕竟它不是记录语句，但是默认的初始化消息被写入stderr，我没想到看到Stanford拥有它自己的日志框架然后不使用它:-) 无论如何，他的提示让我发现了这个解决方案： // shut off the annoying intialization messages RedwoodConfiguration.empty().captureStderr().apply(); nlp = new StanfordCoreNLP(myproperties); // […]

使用Stanford CoreNLP解析共享 – 无法加载解析器模型: 我想做一个非常简单的工作：给一个包含代词的字符串，我想解决它们。例如，我想把句子改为“玛丽有一只小羊羔。她很可爱。” 在“玛丽有一只小羊羔。玛丽很可爱。” 我曾尝试使用Stanford CoreNLP。但是，我似乎无法启动解析器。我已经使用Eclipse在我的项目中导入了所有包含的jar，并且我已经为JVM（-Xmx3g）分配了3GB。错误非常尴尬：线程“main”中的exceptionjava.lang.NoSuchMethodError：edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel（Ljava / lang / String; [Ljava / lang / String;）Ledu / stanford / nlp / parser / lexparser / LexicalizedParser; 我不明白L来自哪里，我认为这是我问题的根源……这很奇怪。我试图进入源文件，但那里没有错误的引用。码： import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation; import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation; import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefGraphAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.dcoref.CorefChain; import edu.stanford.nlp.pipeline.*; […]

如何为stanford tagger创建自己的训练语料库？: 我必须用很多简短的手和当地的术语来分析非正式的英文文本。因此我在考虑为stanford标记创建模型。如何为斯坦福标记器创建我自己的标记语料库集？语料库的语法是什么，我的语料库应该多长时间才能达到理想的性能？

使用Stanford CoreNLP的共指解决方案: 我是Stanford CoreNLP工具包的新手，并尝试将其用于解决新闻文本中的核心问题的项目。为了使用Stanford CoreNLP共同参考系统，我们通常会创建一个管道，它需要标记化，句子分割，词性标注，词形化，命名实体重新定义和解析。例如： Properties props = new Properties(); props.setProperty(“annotators”, “tokenize, ssplit, pos, lemma, ner, parse, dcoref”); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // read some text in the text variable String text = “As competition heats up in Spain’s crowded bank market, Banco Exterior de Espana is seeking to shed its image of a […]

Stanford CoreNLP给出了NullPointerException: 我正试图了解斯坦福CoreNLP API。我希望使用以下代码将一个简单的句子标记为： Properties props = new Properties(); props.put(“annotators”, “tokenize”); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // read some text in the text variable String text = “I wish this code would run.”; // create an empty Annotation just with the given text Annotation document = new Annotation(text); // run all Annotators on this text pipeline.annotate(document); […]

如何使用Stanford Parser解析英语以外的语言？在java中，而不是命令行: 我一直试图在我的Java程序中使用Stanford Parser来解析一些中文句子。由于我在Java和Stanford Parser都很新，我使用’ParseDemo.java’来练习。该代码适用于英语句子并输出正确的结果。但是，当我将模型更改为’chinesePCFG.ser.gz’并尝试解析一些分段的中文句子时，出现了问题。这是我在Java中的代码 class ParserDemo { public static void main(String[] args) { LexicalizedParser lp = LexicalizedParser.loadModel(“edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz”); if (args.length > 0) { demoDP(lp, args[0]); } else { demoAPI(lp); } } public static void demoDP(LexicalizedParser lp, String filename) { // This option shows loading and sentence-segment and tokenizing // a file using DocumentPreprocessor […]

在Java中将单词转换为其名词/形容词/动词forms: 是否可以使用Java替代NLTK来“详细说明”这个问题？在动词/名词/形容词forms之间转换单词例如，我想将天生转换为出生，因为当使用Wordnet相似性时，该算法并未表明出生和出生非常相似。因此，我想将出生时转为出生，反之亦然。为了有更多相似的词。你有什么建议？我发现了一些工具，但我不确定他们是否可以这样做： – NTLK（我猜只有python） – OpenNlp – Stanford-Nlp – Simple NLG 谢谢

显示斯坦福NER的置信度: 我正在使用斯坦福NER CRFC分类器从新闻文章中提取命名实体，为了实现主动学习，我想知道每个标记实体的类的置信度分数。显示的例子：位置（0.20）人（0.10）组织（0.60）MISC（0.10）这是我从文本中提取命名实体的代码： AbstractSequenceClassifier classifier = CRFClassifier.getClassifierNoExceptions(classifier_path); String annnotatedText = classifier.classifyWithInlineXML(text); 是否有解决方法来获取值和注释？