在Weka中使用StringToWordVector和内部数据结构

我正在尝试使用Weka获取文档聚类。这个过程是一个更大的管道的一部分，我真的不能写出arff文件。我将每个文档中的所有文档和单词包都作为Map<String, Multiset>结构，其中键是文档名称， Multiset值是文档中单词的包。我有两个问题，真的：

（1）目前的方法最终聚集了术语，而不是文件：

 public final Instances buildDocumentInstances(TreeMap<String, Multiset> docToTermsMap, String encoding) throws IOException { int dimension = TermToDocumentFrequencyMap.navigableKeySet().size(); FastVector attributes = new FastVector(dimension); for (String s : TermToDocumentFrequencyMap.navigableKeySet()) attributes.addElement(new Attribute(s)); List instances = Lists.newArrayList(); for (Map.Entry<String, Multiset> entry : docToTermsMap.entrySet()) { Instance instance = new Instance(dimension); for (Multiset.Entry ms_entry : entry.getValue().entrySet()) { Integer index = TermToIndexMap.get(ms_entry.getElement()); if (index != null) switch (encoding) { case "tf": instance.setValue(index, ms_entry.getCount()); break; case "binary": instance.setValue(index, ms_entry.getCount() > 0 ? 1 : 0); break; case "tfidf": double tf = ms_entry.getCount(); double df = TermToDocumentFrequencyMap.get(ms_entry.getElement()); double idf = Math.log(TermToIndexMap.size() / df); instance.setValue(index, tf * idf); break; } } instances.add(instance); } Instances dataset = new Instances("My Dataset Name", attributes, instances.size()); for (Instance instance : instances) dataset.add(instance); return dataset; }

我正在尝试创建单个Instance对象，然后通过将它们添加到Instances对象来创建数据集。每个实例都是文档向量（具有0/1，tf或tf-idf编码）。此外，每个单词都是一个单独的属性。但是当我运行SimpleKMeans#buildClusterer ，输出显示它正在聚集单词，而不是文档。我显然做了一些可怕的错误，但我无法弄清楚那个错误是什么。

（2）如何在这种情况下使用StringToWordVector？ 在我看过的每个地方，人们建议使用weka.filters.unsupervised.attribute.StringToWordVector来聚类文档。但是，我找不到任何可以使用它的方式，我可以从我的文档中获取单词– >词袋结构。 [注意：在我的例子中，它是Map<String, Multiset ，但这不是一个严格的要求。如果StringToWordVector需要它，我可以将它转换为其他数据结构。

在Weka中使用StringToWordVector和内部数据结构

增加jvisualVM OQL结果集的最大大小

为什么Java中没有Hashable接口

如何使用LWJGL加载图像以用作openGL纹理？

Java SE 8：Java 7编译的JAR是否与Java 8完全兼容？

JTable – 选择行单击事件

从Spark中的压缩中读取整个文本文件

在Swing中指定Canvas的位置

如何以null为参数reflection调用方法？

Selenium Java（maven项目）：TestNG结果与ReportNG不同

使用Spring Boot + Hibernate + MySql运行MVC应用程序