用于从输入文本中提取关键字的Java库

我正在寻找一个Java库来从一个文本块中提取关键字。

该过程应如下:

停止单词清理 – >词干 – >根据英语语言学统计信息搜索关键词 – 意味着如果一个单词在文本中出现的次数多于在英语中出现的概率而不是关键词候选词。

是否有执行此任务的库?

这是使用Apache Lucene的可能解决方案。 我没有使用最后一个版本而是使用3.6.2版本,因为这是我最了解的版本。 除了/lucene-core-xxxjar ,不要忘记将下载的存档中的/contrib/analyzers/common/lucene-analyzers-xxxjar到您的项目中:它包含特定于语言的分析器(尤其是您的英语分析器)案件)。

请注意,这只会根据各自的词干找到输入文本词的频率。 将这些频率与英语语言统计数据进行比较后应该进行( 这个答案可能会有所帮助)。


数据模型

一个词干的一个关键词。 不同的单词可能具有相同的词干,因此设置了terms 。 每次找到新术语时,关键字频率都会递增(即使已经找到它 – 一个集合会自动删除重复项)。

 public class Keyword implements Comparable { private final String stem; private final Set terms = new HashSet(); private int frequency = 0; public Keyword(String stem) { this.stem = stem; } public void add(String term) { terms.add(term); frequency++; } @Override public int compareTo(Keyword o) { // descending order return Integer.valueOf(o.frequency).compareTo(frequency); } @Override public boolean equals(Object obj) { if (this == obj) { return true; } else if (!(obj instanceof Keyword)) { return false; } else { return stem.equals(((Keyword) obj).stem); } } @Override public int hashCode() { return Arrays.hashCode(new Object[] { stem }); } public String getStem() { return stem; } public Set getTerms() { return terms; } public int getFrequency() { return frequency; } } 

公用事业

干一句话:

 public static String stem(String term) throws IOException { TokenStream tokenStream = null; try { // tokenize tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term)); // stem tokenStream = new PorterStemFilter(tokenStream); // add each token in a set, so that duplicates are removed Set stems = new HashSet(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { stems.add(token.toString()); } // if no stem or 2+ stems have been found, return null if (stems.size() != 1) { return null; } String stem = stems.iterator().next(); // if the stem has non-alphanumerical chars, return null if (!stem.matches("[a-zA-Z0-9-]+")) { return null; } return stem; } finally { if (tokenStream != null) { tokenStream.close(); } } } 

要搜索集合(将由潜在关键字列表使用):

 public static  T find(Collection collection, T example) { for (T element : collection) { if (element.equals(example)) { return element; } } collection.add(example); return example; } 

核心

这是主要的输入法:

 public static List guessFromString(String input) throws IOException { TokenStream tokenStream = null; try { // hack to keep dashed words (eg "non-specific" rather than "non" and "specific") input = input.replaceAll("-+", "-0"); // replace any punctuation char but apostrophes and dashes by a space input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " "); // replace most common english contractions input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", ""); // tokenize input tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input)); // to lowercase tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream); // remove dots from acronyms (and "'s" but already done manually above) tokenStream = new ClassicFilter(tokenStream); // convert any char to ASCII tokenStream = new ASCIIFoldingFilter(tokenStream); // remove english stop words tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet()); List keywords = new LinkedList(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { String term = token.toString(); // stem each term String stem = stem(term); if (stem != null) { // create the keyword or get the existing one if any Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-"))); // add its corresponding initial token keyword.add(term.replaceAll("-0", "-")); } } // reverse sort by frequency Collections.sort(keywords); return keywords; } finally { if (tokenStream != null) { tokenStream.close(); } } } 

使用Java维基百科文章介绍部分中的guessFromString方法,这里是找到的前10个最常用的关键字(即词干):

 java x12 [java] compil x5 [compiled, compiler, compilers] sun x5 [sun] develop x4 [developed, developers] languag x3 [languages, language] implement x3 [implementation, implementations] applic x3 [application, applications] run x3 [run] origin x3 [originally, original] gnu x3 [gnu] 

通过获取terms集(在上面的示例中显示在括号[...]之间),迭代输出列表以了解每个词干的原始找到的单词


下一步是什么

词干频率和频率和比率与英语统计数据进行比较,如果您管理它,让我保持在循环中:我可能也非常感兴趣:)

上面提出的更新且随时可用的代码版本。
此代码与Apache Lucene 5.x … 6.x兼容。

CardKeyword类:

 import java.util.HashSet; import java.util.Set; /** * Keyword card with stem form, terms dictionary and frequency rank */ class CardKeyword implements Comparable { /** * Stem form of the keyword */ private final String stem; /** * Terms dictionary */ private final Set terms = new HashSet<>(); /** * Frequency rank */ private int frequency; /** * Build keyword card with stem form * * @param stem */ public CardKeyword(String stem) { this.stem = stem; } /** * Add term to the dictionary and update its frequency rank * * @param term */ public void add(String term) { this.terms.add(term); this.frequency++; } /** * Compare two keywords by frequency rank * * @param keyword * @return int, which contains comparison results */ @Override public int compareTo(CardKeyword keyword) { return Integer.valueOf(keyword.frequency).compareTo(this.frequency); } /** * Get stem's hashcode * * @return int, which contains stem's hashcode */ @Override public int hashCode() { return this.getStem().hashCode(); } /** * Check if two stems are equal * * @param o * @return boolean, true if two stems are equal */ @Override public boolean equals(Object o) { if (this == o) return true; if (!(o instanceof CardKeyword)) return false; CardKeyword that = (CardKeyword) o; return this.getStem().equals(that.getStem()); } /** * Get stem form of keyword * * @return String, which contains getStemForm form */ public String getStem() { return this.stem; } /** * Get terms dictionary of the stem * * @return Set, which contains set of terms of the getStemForm */ public Set getTerms() { return this.terms; } /** * Get stem frequency rank * * @return int, which contains getStemForm frequency */ public int getFrequency() { return this.frequency; } } 

关键词提取器类:

 import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.core.LowerCaseFilter; import org.apache.lucene.analysis.core.StopFilter; import org.apache.lucene.analysis.en.EnglishAnalyzer; import org.apache.lucene.analysis.en.PorterStemFilter; import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter; import org.apache.lucene.analysis.standard.ClassicFilter; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import java.io.IOException; import java.io.StringReader; import java.util.*; /** * Keywords extractor functionality handler */ class KeywordsExtractor { /** * Get list of keywords with stem form, frequency rank, and terms dictionary * * @param fullText * @return List, which contains keywords cards * @throws IOException */ static List getKeywordsList(String fullText) throws IOException { TokenStream tokenStream = null; try { // treat the dashed words, don't let separate them during the processing fullText = fullText.replaceAll("-+", "-0"); // replace any punctuation char but apostrophes and dashes with a space fullText = fullText.replaceAll("[\\p{Punct}&&[^'-]]+", " "); // replace most common English contractions fullText = fullText.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", ""); StandardTokenizer stdToken = new StandardTokenizer(); stdToken.setReader(new StringReader(fullText)); tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet()); tokenStream.reset(); List cardKeywords = new LinkedList<>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { String term = token.toString(); String stem = getStemForm(term); if (stem != null) { CardKeyword cardKeyword = find(cardKeywords, new CardKeyword(stem.replaceAll("-0", "-"))); // treat the dashed words back, let look them pretty cardKeyword.add(term.replaceAll("-0", "-")); } } // reverse sort by frequency Collections.sort(cardKeywords); return cardKeywords; } finally { if (tokenStream != null) { try { tokenStream.close(); } catch (IOException e) { e.printStackTrace(); } } } } /** * Get stem form of the term * * @param term * @return String, which contains the stemmed form of the term * @throws IOException */ private static String getStemForm(String term) throws IOException { TokenStream tokenStream = null; try { StandardTokenizer stdToken = new StandardTokenizer(); stdToken.setReader(new StringReader(term)); tokenStream = new PorterStemFilter(stdToken); tokenStream.reset(); // eliminate duplicate tokens by adding them to a set Set stems = new HashSet<>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { stems.add(token.toString()); } // if stem form was not found or more than 2 stems have been found, return null if (stems.size() != 1) { return null; } String stem = stems.iterator().next(); // if the stem form has non-alphanumerical chars, return null if (!stem.matches("[a-zA-Z0-9-]+")) { return null; } return stem; } finally { if (tokenStream != null) { try { tokenStream.close(); } catch (IOException e) { e.printStackTrace(); } } } } /** * Find sample in collection * * @param collection * @param sample * @param  * @return  T, which contains the found object within collection if exists, otherwise the initially searched object */ private static  T find(Collection collection, T sample) { for (T element : collection) { if (element.equals(sample)) { return element; } } collection.add(sample); return sample; } } 

function调用:

 String text = "…"; List keywordsList = KeywordsExtractor.getKeywordsList(text);