查找字符串中最常见字符的更有效方法

我创建了一种方法来查找字符串中最常见的字符：

public static char getMax(String s) { char maxappearchar = ' '; int counter = 0; int[] charcnt = new int[Character.MAX_VALUE + 1]; for (int i = 0 ; i = counter) { counter = charcnt[ch]; maxappearchar = ch; } } System.out.println("the max char is " +maxappearchar + " and displayed " +counter+ " times"); return maxappearchar; }

我问的是它的不同解决方案：

解决方案1 – 最快的代码（是我附加的代码？）

解决方案2 – 在内存方面最有效，减少了数组和变量的使用

我使用HashMap创建了我的方法 – 更适合解决方案2吗？如果是这样的话？什么是利弊？

附加的代码是否适用于o技术（o ^，o logn …）？如果是这样的话？

执行此操作的最快方法是计算每个字符的出现次数，然后取计数数组中的最大值。如果您的字符串很长，那么在循环字符串中的字符时，不会跟踪当前最大值，您将获得不错的加速。

请参阅如何计算字符串中字符的频率？关于如何计算频率的许多其他想法。

如果你的字符串主要是ASCII，那么count循环中的一个分支可以在低128字符值的数组或其余的HashMap之间进行选择，这应该是值得的。如果您的字符串没有非ASCII字符，分支将很好地预测。如果在ascii和非ascii之间有很多交替，那么与使用HashMap处理所有内容相比，分支可能会受到一些伤害。

public static char getMax(String s) { char maxappearchar = ' '; int counter = 0; int[] ascii_count = new int[128]; // fast path for ASCII HashMap nonascii_count = new HashMap(); for (int i = 0 ; i < s.length() ; i++) { char ch = s.charAt(i); // This does appear to be the recommended way to iterate over a String // alternatively, iterate over 32bit Unicode codepoints, not UTF-16 chars, if that matters. if (ch < 128) { ascii_count[ch]++; } else { // some code to set or increment the nonascii_count[ch]; } } // loop over ascii_count and find the highest element // loop over the keys in nonascii_count, and see if any of them are even higher. return maxappearchar; }

我没有充实代码，因为我没有做很多Java，所以IDK如果有一个容器，那么可以比HashMap get和put对更有效地执行insert- 1 -or-increment操作。 https://stackoverflow.com/a/6712620/224132建议使用Guava MultiSet ，它看起来不错。

这可能比你的2 ^ 16 int数组更好。但是，如果您只触摸此arrays的低128个元素，则可能永远不会触及大部分内存。分配但未触及的内存并没有真正伤害，或者耗尽RAM /交换。

但是，在末尾循环遍历所有65536个条目意味着至少读取它，因此操作系统必须对其进行软页面故障并将其连接起来。它会污染缓存。实际上，更新每个角色的最大值可能是更好的选择。 Microbenchmarks可能会显示迭代字符串，然后循环遍历charcnt[Character.MAX_VALUE]获胜，但这不会解释缓存/ TLB污染触及那么多非真正需要的内存。

它是一种使用大量空间的快速算法。

它不包括完整的Unicode，还有需要两个字符的代码点（Unicode字符，整数）。

小优化仍然可能：

使用byte[]和short[]额外版本，具体取决于s.length() 。

将length()在变量中

for (int i = 0, n = s.length(); i < n; i++)

是的， HashMap可能是最“明智”的解决方案。

现在使用java 8，您可能会转向并行：使用多个内核。不值得努力。

int mostFrequentCodePoint = s.codePoints() ...

对于自然语言的频率分析，将字符串的长度限制为1000左右就足够了。

使用上面的解决方案为ASCII返回SimpleEntry （完整实现）：

public static Map.Entry getMostCommonChar(String phrase) { if (phrase == null || phrase.isEmpty()) { throw new IllegalArgumentException("input phrase must have non-empty value."); } char maxchar = ' '; int counter = 0; int[] ascii_count = new int[Character.MAX_VALUE]; // fast path for ASCII for (int i = 0; i < phrase.length(); i++) { char ch = phrase.charAt(i); // This does appear to be the recommended way to iterate over a String if (ascii_count[ch]++ >= counter) { counter = ascii_count[ch]; maxchar = ch; } } Map.Entry e = new AbstractMap.SimpleEntry<>(maxchar,counter); System.out.println(e.getKey()); System.out.println(e.getValue()); return e; }

确保在JVM出口上删除文件

实现i18n使用java的最佳实践是什么？

解密用RSA java加密的“长”消息
如何在辅助表中的非主键列上连接表？
Hibernate，Log4j和SLF4j
如何从唯一的字符串生成唯一的int？
下载图像时，FileOutputStream因“打开失败：EISDIR（是一个目录）”错误而崩溃
意外的java SimpleDateFormat解析exception
Web Java变量重置
如何从Eclipse java项目中删除Maven？
如何将.war部署到heroku

查找字符串中最常见字符的更有效方法

Java：iterator / iterable的Collections.list

Java字符串和StringPool

缺少序列或表：hibernate_sequence

Wicket从文件系统创建图像

Java套接字在连接到服务器期间发送一些数据

Java从另一个类中获取选定的Combobox

使用返回整数列表的power mock测试私有方法

使用HornetQ在JBoss 7中使用JMS队列的’持久’属性是什么意思？

为什么使用带有Oracle 10g方言的Hibernate使用JPA创建一个名为hibernate_sequence的序列？

如何删除动作侦听器？