按字节截断字符串

我创建了以下内容，用于将java中的字符串截断为具有给定字节数的新字符串。

String truncatedValue = ""; String currentValue = string; int pivotIndex = (int) Math.round(((double) string.length())/2); while(!truncatedValue.equals(currentValue)){ currentValue = string.substring(0,pivotIndex); byte[] bytes = null; bytes = currentValue.getBytes(encoding); if(bytes==null){ return string; } int byteLength = bytes.length; int newIndex = (int) Math.round(((double) pivotIndex)/2); if(byteLength > maxBytesLength){ pivotIndex = newIndex; } else if(byteLength < maxBytesLength){ pivotIndex = pivotIndex + 1; } else { truncatedValue = currentValue; } } return truncatedValue;

这是我想到的第一件事，我知道我可以改进它。我看到另一篇post在那里问了一个类似的问题，但他们使用字节而不是String.substring截断字符串。我想我宁愿在我的情况下使用String.substring。

编辑：我刚刚删除了UTF8参考，因为我宁愿能够为不同的存储类型执行此操作。

为什么不转换为字节并向前走 – 在执行时遵循UTF8字符边界 – 直到获得最大数字，然后将这些字节转换回字符串？

或者，如果您跟踪切割应该发生的位置，您可以剪切原始字符串：

 // Assuming that Java will always produce valid UTF8 from a string, so no error checking! // (Is this always true, I wonder?) public class UTF8Cutter { public static String cut(String s, int n) { byte[] utf8 = s.getBytes(); if (utf8.length < n) n = utf8.length; int n16 = 0; int advance = 1; int i = 0; while (i < n) { advance = 1; if ((utf8[i] & 0x80) == 0) i += 1; else if ((utf8[i] & 0xE0) == 0xC0) i += 2; else if ((utf8[i] & 0xF0) == 0xE0) i += 3; else { i += 4; advance = 2; } if (i <= n) n16 += advance; } return s.substring(0,n16); } }

^{注意：编辑修复2014-08-25的错误}

更理智的解决方案是使用解码器：

 final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset final byte[] bytes = inputString.getBytes(CHARSET); final CharsetDecoder decoder = CHARSET.newDecoder(); decoder.onMalformedInput(CodingErrorAction.IGNORE); decoder.reset(); final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit)); final String outputString = decoded.toString();

我认为Rex Kerr的解决方案有2个错误。

首先，如果非ASCII字符恰好在限制之前，它将截断为限制+ 1。截断“123456789á1”将产生“123456789”，在UTF-8中以11个字符表示。
其次，我认为他误解了UTF标准。 https://en.wikipedia.org/wiki/UTF-8#Description显示UTF序列开头的110xxxxx告诉我们表示长度为2个字符（而不是3个字符）。这就是他的实施通常不会耗尽所有可用空间的原因（正如Nissim Avitan所说）。

请在下面找到我的更正版本：

 public String cut(String s, int charLimit) throws UnsupportedEncodingException { byte[] utf8 = s.getBytes("UTF-8"); if (utf8.length <= charLimit) { return s; } int n16 = 0; boolean extraLong = false; int i = 0; while (i < charLimit) { // Unicode characters above U+FFFF need 2 words in utf16 extraLong = ((utf8[i] & 0xF0) == 0xF0); if ((utf8[i] & 0x80) == 0) { i += 1; } else { int b = utf8[i]; while ((b & 0x80) > 0) { ++i; b = b << 1; } } if (i <= charLimit) { n16 += (extraLong) ? 2 : 1; } } return s.substring(0, n16); }

我仍然认为这远非有效。因此，如果您不需要结果的字符串表示forms并且字节数组将执行，您可以使用：

 private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException { byte[] utf8 = s.getBytes("UTF-8"); if (utf8.length <= charLimit) { return utf8; } if ((utf8[charLimit] & 0x80) == 0) { // the limit doesn't cut an UTF-8 sequence return Arrays.copyOf(utf8, charLimit); } int i = 0; while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) { ++i; } if ((utf8[charLimit-i-1] & 0x80) > 0) { // we have to skip the starter UTF-8 byte return Arrays.copyOf(utf8, charLimit-i-1); } else { // we passed all UTF-8 bytes return Arrays.copyOf(utf8, charLimit-i); } }

有趣的是，在实际的20-500字节限制下，它们执行的几乎相同，如果你再次从字节数组创建一个字符串。

请注意，这两种方法都假设有效的utf-8输入，这是使用Java的getBytes（）函数后的有效假设。

使用UTF-8 CharsetEncoder，并通过查找CoderResult.OVERFLOW进行编码，直到输出ByteBuffer包含您愿意接受的字节数。

第二种方法在这里工作得很好http://www.jroller.com/holy/entry/truncating_utf_string_to_the

如上所述，Peter Lawrey解决方案具有主要的性能劣势（10,000次约为3,500msc），Rex Kerr更好（10,000次约500msc）但结果不准确 – 它切割得比需要的多得多（而不是剩下4000次）在某些例子中，它重新发布了3500个字节）。假设UTF-8最大长度char（以字节为单位）为4（感谢WikiPedia）：此处附带我的解决方案（约250msc，10,000次）

 public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{ double MAX_UTF8_CHAR_LENGTH = 4.0; if(word.length()>dbLimit){ word = word.substring(0, dbLimit); } if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){ int residual=word.getBytes("UTF-8").length-dbLimit; if(residual>0){ int tempResidual = residual,start, end = word.length(); while(tempResidual > 0){ start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH)); tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length; end=start; } word = word.substring(0, end); } } return word; }

您可以将字符串转换为字节并将这些字节转换回字符串。

 public static String substring(String text, int maxBytes) { StringBuilder ret = new StringBuilder(); for(int i = 0;i < text.length(); i++) { // works out how many bytes a character takes, // and removes these from the total allowed. if((maxBytes -= text.substring(i, i+1).getBytes().length) < 0) break; ret.append(text.charAt(i)); } return ret.toString(); }

s = new String(s.getBytes("UTF-8"), 0, MAX_LENGTH - 2, "UTF-8");

通过使用下面的正则表达式，您还可以删除双字节字符的前导和尾随空格。

 stringtoConvert = stringtoConvert.replaceAll("^[\\s ]*", "").replaceAll("[\\s ]*$", "");

这是我的：

 private static final int FIELD_MAX = 2000; private static final Charset CHARSET = Charset.forName("UTF-8"); public String trancStatus(String status) { if (status != null && (status.getBytes(CHARSET).length > FIELD_MAX)) { int maxLength = FIELD_MAX; int left = 0, right = status.length(); int index = 0, bytes = 0, sizeNextChar = 0; while (bytes != maxLength && (bytes > maxLength || (bytes + sizeNextChar < maxLength))) { index = left + (right - left) / 2; bytes = status.substring(0, index).getBytes(CHARSET).length; sizeNextChar = String.valueOf(status.charAt(index + 1)).getBytes(CHARSET).length; if (bytes < maxLength) { left = index - 1; } else { right = index + 1; } } return status.substring(0, index); } else { return status; } }

这个可能不是更有效的解决方案，但有效

 public static String substring(String s, int byteLimit) { if (s.getBytes().length <= byteLimit) { return s; } int n = Math.min(byteLimit-1, s.length()-1); do { s = s.substring(0, n--); } while (s.getBytes().length > byteLimit); return s; }

我已经改进了Peter Lawrey的准确处理代理对的解决方案。另外，我根据UTF-8编码中每个char的最大字节数为3进行了优化。

 public static String substring(String text, int maxBytes) { for (int i = 0, len = text.length(); (len - i) * 3 > maxBytes;) { int j = text.offsetByCodePoints(i, 1); if ((maxBytes -= text.substring(i, j).getBytes(StandardCharsets.UTF_8).length) < 0) return text.substring(0, i); i = j; } return text; }

按字节截断字符串

如何在运行Tomcat时获取软件包版本？

如何使用javavalidationxml与dtd？

Java ASM GeneratorAdapter变量命名

Java SSL / TLS忽略过期的证书？（java.security.cert.CertPathValidatorException：时间戳检查失败）

使用C中的JNI从对象获取对象

System.out.println中的错误

标记注释与标记界面

Java：为什么这段代码不起作用？无限循环？

用于Java的x509证书解析库

从Java创建快捷方式链接（.lnk）

按字节截断字符串

如何在运行Tomcat时获取软件包版本？

如何使用javavalidationxml与dtd？

Java ASM GeneratorAdapter变量命名

Java SSL / TLS忽略过期的证书？ （java.security.cert.CertPathValidatorException：时间戳检查失败）

使用C中的JNI从对象获取对象

System.out.println中的错误

标记注释与标记界面

Java：为什么这段代码不起作用？ 无限循环？

用于Java的x509证书解析库

从Java创建快捷方式链接（.lnk）

Java SSL / TLS忽略过期的证书？（java.security.cert.CertPathValidatorException：时间戳检查失败）

Java：为什么这段代码不起作用？无限循环？