如何在Java中使用OpenNLP？

我想要POStag一个英文句子并做一些处理。我想使用openNLP。我安装了它

当我执行命令

I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt

它为Text.txt中的输入提供输出POSTagging

  Loading POS Tagger model ... done (4.009s) My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._. Average: 66.7 sent/s Total: 1 sent Runtime: 0.015s

我希望它安装得当吗？

现在我如何从java应用程序内部执行此操作？我已经将openNLPtools，jwnl，maxent jar添加到项目中但是如何调用POStagging？

这里有一些（旧的）示例代码，我将它们放在一起，现代化代码如下：

 package opennlp; import opennlp.tools.cmdline.PerformanceMonitor; import opennlp.tools.cmdline.postag.POSModelLoader; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSSample; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.tokenize.WhitespaceTokenizer; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import java.io.File; import java.io.IOException; import java.io.StringReader; public class OpenNlpTest { public static void main(String[] args) throws IOException { POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin")); PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); POSTaggerME tagger = new POSTaggerME(model); String input = "Can anyone help me dig through OpenNLP's horrible documentation?"; ObjectStream lineStream = new PlainTextByLineStream(new StringReader(input)); perfMon.start(); String line; while ((line = lineStream.read()) != null) { String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line); String[] tags = tagger.tag(whitespaceTokenizerLine); POSSample sample = new POSSample(whitespaceTokenizerLine, tags); System.out.println(sample.toString()); perfMon.incrementCounter(); } perfMon.stopAndPrintFinalResult(); } }

输出是：

 Loading POS Tagger model ... done (2.045s) Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN Average: 76.9 sent/s Total: 1 sent Runtime: 0.013s

这基本上是作为OpenNLP的一部分包含的POSTaggerTool类工作的。 sample.getTags()是一个String数组，它自己有标记类型。

这需要直接访问培训数据，这实际上是非常蹩脚的。

更新的代码库有点不同（可能更有用）。

首先，Maven POM：

   4.0.0 org.javachannel opennlp-example 1.0-SNAPSHOT   org.apache.opennlp opennlp-tools 1.6.0   org.testng testng [6.8.21,) test      org.apache.maven.plugins maven-compiler-plugin 3.1  1.8 1.8

这是代码，作为测试编写，因此位于./src/test/java/org/javachannel/opennlp/example ：

 package org.javachannel.opennlp.example; import opennlp.tools.cmdline.PerformanceMonitor; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSSample; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.tokenize.WhitespaceTokenizer; import org.testng.annotations.DataProvider; import org.testng.annotations.Test; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.net.URL; import java.nio.channels.Channels; import java.nio.channels.ReadableByteChannel; import java.util.stream.Stream; public class POSTest { private void download(String url, File destination) throws IOException { URL website = new URL(url); ReadableByteChannel rbc = Channels.newChannel(website.openStream()); FileOutputStream fos = new FileOutputStream(destination); fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE); } @DataProvider Object[][] getCorpusData() { return new Object[][][]{{{ "Can anyone help me dig through OpenNLP's horrible documentation?" }}}; } @Test(dataProvider = "getCorpusData") public void showPOS(Object[] input) throws IOException { File modelFile = new File("en-pos-maxent.bin"); if (!modelFile.exists()) { System.out.println("Downloading model."); download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile); } POSModel model = new POSModel(modelFile); PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); POSTaggerME tagger = new POSTaggerME(model); perfMon.start(); Stream.of(input).map(line -> { String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString()); String[] tags = tagger.tag(whitespaceTokenizerLine); POSSample sample = new POSSample(whitespaceTokenizerLine, tags); perfMon.incrementCounter(); return sample.toString(); }).forEach(System.out::println); perfMon.stopAndPrintFinalResult(); } }

这段代码实际上并没有测试任何东西 – 它是一个冒烟测试，如果有的话 – 但它应该作为一个起点。另一个（可能）好处是，如果你没有下载它，它会为你下载一个模型。

URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html不再起作用。我在第14张幻灯片上找到了以下内容： http：//www.slideshare.net/gagan1667/opennlp-demo

在此处输入图像描述

上面的答案确实提供了一种使用OpenNLP现有模型的方法，但如果您需要训练自己的模型，可能以下内容可以提供帮助：

这是一个完整代码的详细教程：

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php

根据您的域，您可以自动或手动构建数据集。手动构建这样的数据集可能非常痛苦，像POS标记器这样的工具可以帮助使过程更容易。

培训数据格式

训练数据作为文本文件传递，其中每一行是一个数据项。行中的每个单词都应以“word_LABEL”之类的格式标记，单词和标签名称用下划线“_”分隔。

 anki_Brand overdrive_Brand just_ModelName dance_ModelName 2018_ModelName aoc_Brand 27"_ScreenSize monitor_Category horizon_ModelName zero_ModelName dawn_ModelName cm_Unknown 700_Unknown modem_Category computer_Category

火车模型

这里重要的类是POSModel，它包含实际模型。我们使用类POSTaggerME来进行模型构建。以下是从训练数据文件构建模型的代码

 public POSModel train(String filepath) { POSModel model = null; TrainingParameters parameters = TrainingParameters.defaultParams(); parameters.put(TrainingParameters.ITERATIONS_PARAM, "100"); try { try (InputStream dataIn = new FileInputStream(filepath)) { ObjectStream lineStream = new PlainTextByLineStream(new InputStreamFactory() { @Override public InputStream createInputStream() throws IOException { return dataIn; } }, StandardCharsets.UTF_8); ObjectStream sampleStream = new WordTagSampleStream(lineStream); model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory()); return model; } } catch (Exception e) { e.printStackTrace(); } return null; }

使用模型进行标记。

最后，我们可以看到该模型如何用于标记看不见的查询：

  public void doTagging(POSModel model, String input) { input = input.trim(); POSTaggerME tagger = new POSTaggerME(model); Sequence[] sequences = tagger.topKSequences(input.split(" ")); for (Sequence s : sequences) { List tags = s.getOutcomes(); System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags); } }

如何在Java中使用OpenNLP？

在Spring 3中是否可以动态设置@ResponseStatus的原因？

基于AES-256密码的Java加密/解密

如何为@MappedSuperclass实现Spring Data存储库

使用sax解析器解析和修改xml字符串

如何只从ResultSet获取第一行

像JTable单元格编辑器一样使用JSpinner

非常简单的Java动态强制转换

带有Chrome驱动程序的Selenium网格（WebDriverException：驱动程序可执行文件的路径必须由webdriver.chrome.driver系统属性设置）

Java ForkJoinPool具有非递归任务，是否可以正常工作？

如何从文本文件中读取数据并将其中的一些数据保存到数组中