在OpenNLP中培训命名实体

我想为印度名字训练一个语料库:

class NameTraining { public static void TrainNames() throws IOException { Charset charset = Charset.forName("UTF-8"); FileReader fileReader = new FileReader("train.txt"); ObjectStream fileStream = new PlainTextByLineStream(fileReader); ObjectStream sampleStream = new NameSampleDataStream(fileStream); TokenNameFinderModel model = NameFinderME.train("pt-br", "train", sampleStream, Collections.emptyMap()); NameFinderME nfm = new NameFinderME(model); } public static void main(String args[]) throws IOException { NameTraining det = new NameTraining(); det.TrainNames(); } } 

我使用以下命令编译它:

 javac -cp $(echo lib/*.jar | tr ' ' ':') NameTraining.java -Xlint:unchecked 

但是我收到这些错误消息

 NameTraining.java:35: warning: [unchecked] unchecked conversion found : opennlp.tools.util.ObjectStream required: opennlp.tools.util.ObjectStream ObjectStream sampleStream = new NameSampleDataStream(fileStream); ^ NameTraining.java:36: warning: [unchecked] unchecked conversion found : opennlp.tools.util.ObjectStream required: opennlp.tools.util.ObjectStream TokenNameFinderModel model = NameFinderME.train("pt-br", "train", sampleStream, Collections.emptyMap()); ^ 2 warnings 

我想知道两件事

  1. 以上代码是否适合培训,如果是,那么如何在培训后检查结果?
  2. 这些警告意味着什么?

嗨,我得到了一个简短的成功训练数据集

 public static void TrainNames() throws IOException { Charset charset = Charset.forName("UTF-8"); ObjectStream lineStream =new PlainTextByLineStream(new FileInputStream("/home/yogi.singh/dev/java/nlp/data/en-ner-person.train"), charset); ObjectStream sampleStream = new NameSampleDataStream(lineStream); //FileReader fileReader = new FileReader("train.txt"); //ObjectStream fileStream = new PlainTextByLineStream(fileReader); //ObjectStream sampleStream = new NameSampleDataStream(fileStream); TokenNameFinderModel model = NameFinderME.train("en", "person", sampleStream, Collections.emptyMap()); NameFinderME nfm = new NameFinderME(model); String sentence = ""; BufferedReader br = new BufferedReader(new FileReader("/home/yogi.singh/dev/java/nlp/train.txt")); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { sb.append(line); sb.append('\n'); line = br.readLine(); } sentence = sb.toString(); } finally { br.close(); } InputStream is1 = new FileInputStream("/home/yogi.singh/dev/java/nlp/data/en-token.bin"); TokenizerModel model1 = new TokenizerModel(is1); Tokenizer tokenizer = new TokenizerME(model1); String tokens[] = tokenizer.tokenize(sentence); for (String a : tokens) System.out.println(a); Span nameSpans[] = nfm.find(tokens); for(Span s: nameSpans) { System.out.print(s.toString()); System.out.print(" "); for(int index = s.getStart();index < s.getEnd();index++) { System.out.print(tokens[index] + " "); } System.out.println(" "); } } 

警告与使用Javagenerics而不是OpenNLP有关。

尝试这个:

 ObjectStream fileStream = new PlainTextByLineStream(fileReader); ObjectStream sampleStream = new NameSampleDataStream(fileStream);