如何从解析的文本中提取名词短语

我已经用constituency解析器解析了一个文本,将结果复制到如下文本文件中:

(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP we)) (VP (VBD went) (PP (TO to).... (ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (PRP I)) (VP (VBD was) (NP (NP (EX... (ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP I)) (VP (VBD went) (PP (TO to..... (ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (NNP Jim)) (VP (VBD was) (NP (NP (.... (ROOT (S (S (NP (PRP I)) (VP (VBD started) (S (VP (VBG talking) (PP..... 

我需要从这个文本文件中提取所有NounPhrases(NP)。 我编写了以下代码,仅从每行中提取第一个NP。 但是,我需要提取所有名词短语。 我的代码是:

 public class nounPhrase { public static int findClosingParen(char[] text, int openPos) { int closePos = openPos; int counter = 1; while (counter > 0) { char c = text[++closePos]; if (c == '(') { counter++; } else if (c == ')') { counter--; } } return closePos; } public static void main(String[] args) throws IOException { ArrayList npList = new ArrayList (); String line; String line1; int np; String Input = "/local/Input/Temp/Temp.txt"; String Output = "/local/Output/Temp/Temp-out.txt"; FileInputStream fis = new FileInputStream (Input); BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8" )); while ((line = br.readLine())!= null){ char[] lineArray = line.toCharArray(); np = findClosingParen (lineArray, line.indexOf("(NP")); line1 = line.substring(line.indexOf("(NP"),np+1); System.out.print(line1+"\n"); } } } 

输出是:

 (NP (NN Yesterday))...I need other NPs in this line also (NP (PRP I)).....I need other NPs in this line also (NP (NNP Jim)).....I need other NPs in this line also (NP (PRP I)).....I need other NPs in this line also 

我的代码只使用右括号获取每行的第一个NP,但我需要从文本中提取所有NP。

编写自己的树解析器是一个很好的练习(!),如果你只是想要结果,最简单的方法是使用更多Stanford NLP工具的function,即专为此类设计的Tregex 。 您可以将最终的while循环更改为以下内容:

 TregexPattern tPattern = TregexPattern.compile("NP"); while ((line = br.readLine()) != null) { Tree t = Tree.valueOf(line); TregexMatcher tMatcher = tPattern.matcher(t); while (tMatcher.find()) { System.out.println(tMatcher.getMatch()); } } 

干得好。 我改变它一点点它变得混乱但我可以清理它,如果你真的需要代码漂亮。

 import java.io.*; import java.util.*; public class nounPhrase { public static void main(String[] args)throws IOException{ ArrayList npList = new ArrayList(); String line = ""; String line1 = ""; String Input = "/local/Input/Temp/Temp.txt"; String Output = "/local/Output/Temp/Temp-out.txt"; FileInputStream fis = new FileInputStream (Input); BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8")); while ((line = br.readLine()) != null){ char[] lineArray = line.toCharArray(); int temp; for (int i=0; i+2 

和fyi你的代码只选择第一次出现NP的主要原因是因为你使用了indexOf方法来查找位置。 IndexOf ALWAYS和ONLY首次出现您要搜索的String。

在获得第一个NP短语后,你必须迭代解析树并更改Noun Phrase的索引,简单方法可以只是你的行变量的子串,并且该子串的起始索引将是np + 1。 以下是您可以对代码进行的更改:

 while ((line = br.readLine())!= null){ char[] lineArray = line.toCharArray(); int indexOfNP = line.indexOf("(NP"); while(indexOfNP!=-1) { np = findClosingParen(lineArray, indexOfNP); line1 = line.substring(indexOfNP, np + 1); System.out.print(line1 + "\n"); npList.add(line1); line = line.substring(np+1); indexOfNP = line.indexOf("(NP"); lineArray = line.toCharArray(); } } 

对于递归解决方案:

 public static void main(String[] args) throws IOException { ArrayList npList = new ArrayList(); String line; String Input = "/local/Input/Temp/Temp.txt"; String Output = "/local/Output/Temp/Temp-out.txt"; FileInputStream fis = new FileInputStream (Input); BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8")); while ((line = br.readLine())!= null){ int indexOfNP = line.indexOf("(NP"); if(indexOfNP>=0) extractNPs(npList,line,indexOfNP); } for(String npString:npList){ System.out.println(npString); } br.close(); fis.close(); } public static ArrayList extractNPs(ArrayList arr,String parse, int indexOfNP){ if(indexOfNP==-1){ return arr; } else{ int npIndex = findClosingParen(parse.toCharArray(), indexOfNP); String mainNP = new String(parse.substring(indexOfNP, npIndex + 1)); arr.add(mainNP); //Uncomment Lines below if you also want MainNP along with all NPs //within MainNP to be extracted /* mainNP = new String(mainNP.substring(3)); if(mainNP.indexOf("(NP")>0){ return extractNPs(arr,mainNP,mainNP.indexOf("(NP")); } */ parse = new String(parse.substring(npIndex+1)); indexOfNP = parse.indexOf("(NP"); return extractNPs(arr,parse,indexOfNP); } } 

您正在构建一个解析器(…为您的自然语言解析器生成的代码),这是一个具有广泛学术文档的主题。 您可以构建的最简单的解析器是LL解析器。 看一下来自维基百科的这个artcle,它有一些非常好的例子可以让你获得灵感: http : //en.wikipedia.org/wiki/LL_parser

关于解析的维基百科条目一般可以让你体验一般的解析领域:维基百科文章: http : //en.wikipedia.org/wiki/Parsing