在哈希映射上运行perceptron算法functionvecteur：java

我有以下代码，它从目录中读取许多文件到哈希映射，这是我的functionvecteur 。从某种意义上讲，它有点幼稚，但它并不是我现在主要关注的问题。我想知道如何使用这种数据结构作为感知器算法的输入。我猜我们称之为一袋话，不是吗？

public class BagOfWords { static Map bag_of_words = new HashMap(); public static void main(String[] args) throws IOException { String path = "/home/flavius/atheism; File file = new File( path ); new BagOfWords().iterateDirectory(file); for (Map.Entry entry : bag_of_words.entrySet()) { System.out.println(entry.getKey()+" : "+entry.getValue()); } } private void iterateDirectory(File file) throws IOException { for (File f : file.listFiles()) { if (f.isDirectory()) { iterateDirectory(file); } else { String line; BufferedReader br = new BufferedReader(new FileReader( f )); while ((line = br.readLine()) != null) { String[] words = line.split(" ");//those are your words String word; for (int i = 0; i < words.length; i++) { word = words[i]; if (!bag_of_words.containsKey(word)) { bag_of_words.put(word, 0); } bag_of_words.put(word, bag_of_words.get(word) + 1); } } } } } }

你可以看到路径进入一个名为’atheism’的目录，还有一个名为sports的目录，我想尝试线性分离这两类文档，然后尝试将看不见的测试文档分成两个类别。

怎么做？如何概念化。我很欣赏一个可靠的参考，全面的解释或某种伪代码。

我没有在网上找到很多信息丰富且清晰的参考资料。

让我们预先建立一些词汇表（我猜你使用的是20-newsgroup数据集）：

“类标签”是你想要预测的，在你的二进制情况下，这是“无神论”与其余的
“特征向量”就是您输入分类器的内容
“文档”，它是来自数据集的单个电子邮件
“令牌”是文档的一小部分，通常是unigram / bigram / trigram
“词典”为您的矢量提供一组“允许的”单词

所以词袋的矢量化算法通常遵循以下步骤：

浏览所有文档（跨所有类标签）并收集所有标记，这是您的字典和特征向量的维度
再次浏览所有文件，每个文件：
1. 使用字典的维度创建一个新的特征向量（例如200，该字典中的200个条目）
2. 遍历该文档中的所有标记，并在特征向量的此维度上设置单词计数（在此文档中）
您现在有一个可以输入算法的特征向量列表

例：

 Document 1 = ["I", "am", "awesome"] Document 2 = ["I", "am", "great", "great"]

字典是：

 ["I", "am", "awesome", "great"]

因此，作为矢量的文档看起来像：

 Document 1 = [1, 1, 1, 0] Document 2 = [1, 1, 0, 2]

有了它，你可以做各种花哨的数学东西，并将其喂入你的感知器。

这是我原始问题的完整而完整的答案，这里是为了未来的使用者的利益而发布的

给出以下文件：

无神论/ a_0.txt
```
 Gott ist tot. 
```
政治/ p_0.txt
```
 L'Etat, c'est moi , et aussi moi . 
```

科学/ s_0.txt

 If I have seen further it is by standing on the shoulders of giants.

运动/ s_1.txt

 You miss 100% of the shots you don't take.

输出数据结构：

 /data/train/politics/p_0.txt, [0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] /data/train/science/s_0.txt, [1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0] /data/train/atheism/a_0.txt, [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] /data/train/sports/s_1.txt, [0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1]

代码看起来像这样，或者您可以在我的GitHub页面上找到它。

 public class FileDictCreateur { static String PATH = "/home/matthias/Workbench/SUTD/ISTD_50.570/assignments/practice_data/data/train"; //the global list of all words across all articles static Set GLOBO_DICT = new HashSet(); //is the globo dict full? static boolean globo_dict_fixed = false; // hash map of all the words contained in individual files static Map > fileDict = new HashMap<>(); //input to perceptron. final struc. static Map perceptron_input = new HashMap<>(); @SuppressWarnings("rawtypes") public static void main(String[] args) throws IOException { //each of the diferent categories String[] categories = { "/atheism", "/politics", "/science", "/sports"}; //cycle through all categories once to populate the global dict for(int cycle = 0; cycle <= 3; cycle++) { String general_data_partition = PATH + categories[cycle]; File directory = new File( general_data_partition ); iterateDirectory( directory , globo_dict_fixed); if(cycle == 3) globo_dict_fixed = true; } //cycle through again to populate the file dicts for(int cycle = 0; cycle <= 3; cycle++) { String general_data_partition = PATH + categories[cycle]; File directory = new File( general_data_partition ); iterateDirectory( directory , globo_dict_fixed); } perceptron_data_struc_generateur( GLOBO_DICT, fileDict, perceptron_input ); //print the output for (Map.Entry entry : perceptron_input.entrySet()) { System.out.println(entry.getKey() + ", " + Arrays.toString(entry.getValue())); } } private static void iterateDirectory(File directory, boolean globo_dict_fixed) throws IOException { for (File file : directory.listFiles()) { if (file.isDirectory()) { iterateDirectory(directory, globo_dict_fixed); } else { String line; BufferedReader br = new BufferedReader(new FileReader( file )); while ((line = br.readLine()) != null) { String[] words = line.split(" ");//those are your words if(globo_dict_fixed == false) { populate_globo_dict( words ); } else { create_file_dict( file, words ); } } } } } @SuppressWarnings("unchecked") public static void create_file_dict( File file, String[] words ) throws IOException { if (!fileDict.containsKey(file)) { @SuppressWarnings("rawtypes") ArrayList document_words = new ArrayList(); String word; for (int i = 0; i < words.length; i++) { word = words[i]; document_words.add(word); } fileDict.put(file, document_words); } } public static void populate_globo_dict( String[] words ) throws IOException { String word; for (int i = 0; i < words.length; i++) { word = words[i]; if (!GLOBO_DICT.contains(word)) { GLOBO_DICT.add(word); } } } public static void perceptron_data_struc_generateur(Set GLOBO_DICT, Map > fileDict, Map perceptron_input) { //create a new entry in the array list 'perceptron_input' //with the key as the file name from fileDict //create a new array which is the length of GLOBO_DICT //iterate through the indicies of GLOBO_DICT //for all words in globo dict, if that word appears in fileDict, //increment the perceptron_input index that corresponds to that //word in GLOBO_DICT by the number of times that word appears in fileDict //so i can get the index later List GLOBO_DICT_list = new ArrayList<>(GLOBO_DICT); for (Map.Entry> entry : fileDict.entrySet()) { int[] cross_czech = new int[GLOBO_DICT_list.size()]; //initialize to zero Arrays.fill(cross_czech, 0); for (String s : GLOBO_DICT_list) { for(String st : entry.getValue()) { if( st.equals(s) ) { cross_czech[ GLOBO_DICT_list.indexOf( s ) ] = cross_czech[ GLOBO_DICT_list.indexOf( s ) ] +1; } } } perceptron_input.put( entry.getKey() , cross_czech); } } }

在哈希映射上运行perceptron算法functionvecteur：java

Netduans中的Arduino（处理）库和控制

由于SearchFactoryIntegrator不在注册表中，因此无法在JBoss 7上查询Infinispan

ant javac task使用哪个javac.exe？

Spring Boot Samples FileNot发现错误

什么是Groovy / Grails / Hibernate / JBoss / Jade非常简单？

如何从maven pom文件构建项目

Spring Data JPA：按示例查询？

Java BasicStroke“模糊”

如何使用弹簧安全性保护混合Spring MVC + Flex应用程序

按位乘法并在Java中添加