计算不同单词的数量

我试图使用Java计算文本中不同单词的数量。

这个词可以是unigram，bigram或trigram名词。这三个已经通过使用斯坦福POS标记器找到了，但是我无法计算频率大于等于一，二，三，四和五的单词及其计数。

我可能没有正确理解，但是如果您需要做的就是计算给定文本中不同单词的数量，具体取决于您从文本中获取需要计算的单词的位置/方式，您可以使用Java.Util.Scanner然后将单词添加到ArrayList ，如果单词已经存在于列表中，则不添加它，然后列表的大小将是Distinct单词的数量，如下例所示：

 public ArrayList makeWordList(){ Scanner scan = new Scanner(yourTextFileOrOtherTypeOfInput); ArrayList listOfWords = new ArrayList(); String word = scan.next(); //scanner automatically uses " " as a delimeter if(!listOfWords.contains(word)){ //add the word if it isn't added already listOfWords.add(word); } return listOfWords; //return the list you made of distinct words } public int getDistinctWordCount(ArrayList list){ return list.size(); }

现在如果你真的必须在将它添加到列表之前先计算单词中的字符数，那么你只需要添加一些语句来检查单词字符串的长度，然后再将其添加到列表中。例如：

 if(word.length() <= someNumber){ //do whatever you need to }

对不起，如果我不理解这个问题，只是给了一些蹩脚的无关答案= P但我希望它在某种程度上有所帮助！

如果你需要跟踪你看到同一个单词的频率，即使你只想计算一次，你也可以创建一个跟踪该频率的变量并将其放入一个列表中，以便计算频率指数与ArrayList中的索引相同，因此你知道频率对应哪个词或更好但是使用HashMap ，其中键是不同的词，值是它的频率（基本上使用与上面相同的代码但不使用ArrayList HashMap并添加一些变量来计算频率：

  public HashMap makeWordList(){ Scanner scan = new Scanner(yourTextFileOrOtherTypeOfInput); HashMap listOfWords = new HashMap(); Scanner scan = new Scanner(sc); while(cs.hasNext()) { String word = scan.next(); //scanner automatically uses " " as a delimeter int countWord = 0; if(!listOfWords.containsKey(word)) { //add word if it isn't added already listOfWords.put(word, 1); //first occurance of this word } else { countWord = listOfWords.get(word) + 1; //get current count and increment //now put the new value back in the HashMap listOfWords.remove(word); //first remove it (can't have duplicate keys) listOfWords.put(word, countWord); //now put it back with new value } } return listOfWrods; //return the HashMap you made of distinct words } public int getDistinctWordCount(HashMap list){ return list.size(); } //get the frequency of the given word public int getFrequencyForWord(String word, HashMap list){ return list.get(word); }

您可以使用Multiset

在空间上分割字符串
从结果中创建一个新的多集

就像是

 String[] words = string.split(" "); Multiset wordCounts = HashMultiset.create(Arrays.asList(words));

对于这个问题可以有很多解决方案，但有一顶帽子对我有帮助，就像下面这样简单：

 public static int countDistinctWords(String str){ Set noOWoInString = new HashSet(); String[] words = str.split(" "); //noOWoInString.addAll(words); for(String wrd:words){ noOWoInString.add(wrd); } return noOWoInString.size(); }

谢谢，萨加尔

计算不同单词的数量

JPA 2.1 ConstructorResult导致ClassCastException

Java pool.map/ Multiprocessing的Java等价物

在Spring中处理POST请求的REST方法究竟如何工作？

如何通过网络将String从Java发送到JavaScript？

什么会导致UDP数据包被发送到localhost时被丢弃？

Java int 数组到HashSet

Play Framework 2.2.1 – 将非播放Java项目添加为子项目

使用Selenium 3.0启动Firefox 46.0.1时获取IllegalStateException

如何在JavaFX中更改子项的顺序

如何使用Gson Library将java.util.List 序列化为Json？