从文本文件中提取单词

假设你有一个像这样的文本文件： http ： //www.gutenberg.org/files/17921/17921-8.txt

有没有人有一个好的算法或开源代码从文本文件中提取单词？如何获取所有单词，同时避免使用特殊字符，并保留“它是”等内容……

我在Java工作。谢谢

这听起来像是正则表达式的正确工作。这里有一些Java代码可以给你一个想法，以防你不知道如何开始：

String input = "Input text, with words, punctuation, etc. Well, it's rather short."; Pattern p = Pattern.compile("[\\w']+"); Matcher m = p.matcher(input); while ( m.find() ) { System.out.println(input.substring(m.start(), m.end())); }

模式[\w']+多次匹配所有单词字符和撇号。示例字符串将逐字打印。查看Java Pattern类文档以了解更多信息。

伪代码看起来像这样：

 create words, a list of words, by splitting the input by whitespace for every word, strip out whitespace and punctuation on the left and the right

python代码将是这样的：

 words = input.split() words = [word.strip(PUNCTUATION) for word in words]

哪里

 PUNCTUATION = ",. \n\t\\\"'][#*:"

或者您要删除的任何其他字符。

我相信Java在String类中具有相同的function： String .split（）。

在您在链接中提供的文本上运行此代码的输出：

 >>> print words[:100] ['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis', 'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', ... etc etc.

这是解决问题的好方法：此函数接收文本作为输入并返回给定文本中所有单词的数组

 private ArrayList get_Words(String SInput){ StringBuilder stringBuffer = new StringBuilder(SInput); ArrayList all_Words_List = new ArrayList(); String SWord = ""; for(int i=0; i

基本上，你想要匹配

（[A-ZA-Z]）+（“（[A-ZA-Z]）*）？

对？

您可以使用您创建的模式尝试正则表达式，并运行计数已找到模式的次数。

从文本文件中提取单词

struts.convention.result.path在Struts2中不起作用

在WEB-INF目录下移动JSP的问题

为什么log（1000）/ log（10）与log10（1000）不同？

java.lang.reflect.Array的性能

如何停止关于目标文件夹内容的Eclipse警告？

Joda在GMT时区解析ISO8601日期

要汇集还是不汇集java加密服务提供商

lucene良好的实践和线程安全

使用Spring的动态表单更好的方法？

Spring注入Servlet