马尔可夫链:SQL数据库和Java表示

现在这个问题有点模糊。 我有一个基于文本的马尔可夫链,我通过解析用户输入的文本生成。 它用于生成几乎连贯的乱码串,并根据序列中的当前单词存储给定单词作为文本序列中下一个单词的概率。 在javascript中,此对象看起来如下所示:

var text_markov_chain = { "apple" : { "cake" : 0.2, "sauce" : 0.8 }, "transformer" : { "movie" : 0.95, "cat" : 0.025, "dog" : 0.025 } "cat" : { "dog : 0.5, "nap" : 0.5 } // ... } 

因此,例如,如果当前单词是变换器,那么我们生成的下一个单词将有95%的机会成为电影,并且分别有2.5%的机会成为猫或狗。

我的问题有两个:

  • 在Java中表示此对象的最佳方法是什么? 我最关心的是50%的快速访问和50%的内存使用率
  • 如何将此对象存储在单个数据库表(例如MySQL)中?

更新:为了回应@ biziclop的回答,以及@ SanjayTSharma的评论,在我的课程下面我最终写了(这是一项正在进行的工作,麻省理工学院的许可证。它目前只生成一阶Markov链。

 import java.io.IOException; import java.io.InputStream; import java.io.ObjectInputStream; import java.util.Date; import java.util.HashMap; import java.util.HashSet; import java.util.Random; import java.util.Set; import java.util.StringTokenizer; import java.util.TreeMap; public class MarkovChain { HashMap<String, TreeMap> chain; Set known_words; Random rand; /** * Generates a first order Markov Chain from the given text * @param input_text The text to parse */ public MarkovChain(String input_text) { init(input_text, 1); } /** * Generates a nth order Markov Chain from the given text * @param input_text The text to parse * @param n The order of the Markov Chain */ public MarkovChain(String input_text, int n) { init(input_text, n); } /** * Reads a Markov Chain from the given input stream. The object is assumed * to be binary and serialized * @param in The input stream, eg from a network port or file */ public MarkovChain(InputStream in) { try { ObjectInputStream ob_in = new ObjectInputStream(in); chain = (HashMap<String, TreeMap>)ob_in.readObject(); known_words = chain.keySet(); ob_in.close(); in.close(); } catch (IOException e) { //e.printStackTrace(); chain = null; known_words = null; } catch (ClassNotFoundException e) { //e.printStackTrace(); chain = null; known_words = null; } } /** * Returns the next word, according to the Markov Chain probabilities * @param current_word The current generated word */ public String nextWord(String current_word) { if(current_word == null) return nextWord(); // Then head off down the yellow-markov-brick-road TreeMap wordmap = chain.get(current_word); if(wordmap == null) { /* This *shouldn't* happen, but if we get a word that isn't in the * Markov Chain, choose another random one */ return nextWord(); } // Choose the next word based on an RV (Random Variable) float rv = rand.nextFloat(); for(String word : wordmap.keySet()) { float prob = wordmap.get(word); rv -= prob; if(rv <= 0) { return word; } } /* We should never get here - if we do, then the probabilities have * been calculated incorrectly in the Markov Chain */ assert false : "Probabilities in Markov Chain must sum to one!"; return null; } /** * Returns the next word when the current word is unknown, irrelevant or * non existant (at the start of the sequence - randomly picks from known_words */ public String nextWord() { return (String) known_words.toArray()[rand.nextInt(known_words.size())]; } private void init(String input_text, int n) { if(input_text.length() <= 0) return; if(n <= 0) return; chain = new HashMap<String, TreeMap>(); known_words = new HashSet(); rand = new Random(new Date().getTime()); /** Generate the Markov Chain! **/ StringTokenizer st = new StringTokenizer(input_text); while (st.hasMoreTokens()) { String word = st.nextToken(); TreeMap wordmap = new TreeMap(); // First check if the current word has previously been parsed if(known_words.contains(word)) continue; known_words.add(word); // Build the Markov probability table for this word StringTokenizer st_this_word = new StringTokenizer(input_text); String previous = ""; while (st_this_word.hasMoreTokens()) { String next_word = st_this_word.nextToken(); if(previous.equals(word)) { if(wordmap.containsKey(next_word)) { // Increment the number of counts for this word by 1 float num = wordmap.get(next_word); wordmap.put(next_word, num + 1); } else { wordmap.put(next_word, 1.0f); } } previous = next_word; } // End while (st_this_word.hasMoreTokens()) /* The wordmap now contains a map of words and the number of occurrences they have. * We need to convert this to the probability of getting that word by dividing * by the total number of words there were */ int total_number_of_words = wordmap.values().size(); for(String k : wordmap.keySet()) { int num_occurances = wordmap.get(k).intValue(); wordmap.put(k, 1.0f*num_occurances/total_number_of_words); } // Finally, we are ready to add this word and wordmap to the Markov chain chain.put(word, wordmap); } // End while (st.hasMoreTokens()) // The (first order) Markov Chain has now been built! } } 

通过将其存储在Java中,我猜你想要以一种易于生成序列的方式存储它。

首先,您需要一个hashmap,其中的单词是键。 此hashmap的值将是一个树形图,其中键是累积概率,值是下一个字。

所以它将是这样的:

  HashMap> words = new HashMap>(); TreeMap appleMap = new TreeMap(); appleMap.put( 0.2d, "cake"); appleMap.put( 1.0d, "sauce"); words.put( "apple", appleMap ); TreeMap transformerMap = new TreeMap(); transformerMap.put( 0.95d, "movie"); transformerMap.put( 0.975d, "cat"); transformerMap.put( 1.0d, "dog"); words.put( "transformer", transformerMap ); 

从这个结构中生成下一个单词非常容易。

 private String generateNextWord( HashMap> words, String currentWord ) { TreeMap probMap = words.get( currentWord ); double d = Math.random(); return probMap.ceilingEntry( d ).getValue(); } 

在关系数据库中,您可以只使用一个包含三列的表:当前单词,下一个单词和权重。 所以你基本上存储马尔可夫链的状态转移图的边缘

你也可以将它标准化为两个表:一个顶点表来存储单词id的单词,一个边表存储当前单词id,下一个单词id和weight,但除非你想用你的单词存储额外的字段,否则我不会认为这是必要的。