一旦UTF-8编码，我如何截断一个java字符串以适应给定的字节数？

如何截断java String以便我知道一旦UTF-8编码它将适合给定数量的字节存储？

这是一个简单的循环，它计算UTF-8表示的大小，并在超出时截断：

 public static String truncateWhenUTF8(String s, int maxBytes) { int b = 0; for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); // ranges from http://en.wikipedia.org/wiki/UTF-8 int skip = 0; int more; if (c <= 0x007f) { more = 1; } else if (c <= 0x07FF) { more = 2; } else if (c <= 0xd7ff) { more = 3; } else if (c <= 0xDFFF) { // surrogate area, consume next char as well more = 4; skip = 1; } else { more = 3; } if (b + more > maxBytes) { return s.substring(0, i); } b += more; i += skip; } return s; }

这会处理输入字符串中出现的代理项对。 Java的UTF-8编码器（正确）将代理对输出为单个4字节序列而不是两个3字节序列，因此truncateWhenUTF8()将返回最长的截断字符串。如果忽略实现中的代理对，则截断的字符串可能会短于它们所需的短路。

我没有对该代码进行过大量测试，但这里有一些初步测试：

 private static void test(String s, int maxBytes, int expectedBytes) { String result = truncateWhenUTF8(s, maxBytes); byte[] utf8 = result.getBytes(Charset.forName("UTF-8")); if (utf8.length > maxBytes) { System.out.println("BAD: our truncation of " + s + " was too big"); } if (utf8.length != expectedBytes) { System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length); } System.out.println(s + " truncated to " + result); } public static void main(String[] args) { test("abcd", 0, 0); test("abcd", 1, 1); test("abcd", 2, 2); test("abcd", 3, 3); test("abcd", 4, 4); test("abcd", 5, 4); test("a\u0080b", 0, 0); test("a\u0080b", 1, 1); test("a\u0080b", 2, 1); test("a\u0080b", 3, 3); test("a\u0080b", 4, 4); test("a\u0080b", 5, 4); test("a\u0800b", 0, 0); test("a\u0800b", 1, 1); test("a\u0800b", 2, 1); test("a\u0800b", 3, 1); test("a\u0800b", 4, 4); test("a\u0800b", 5, 5); test("a\u0800b", 6, 5); // surrogate pairs test("\uD834\uDD1E", 0, 0); test("\uD834\uDD1E", 1, 0); test("\uD834\uDD1E", 2, 0); test("\uD834\uDD1E", 3, 0); test("\uD834\uDD1E", 4, 4); test("\uD834\uDD1E", 5, 4); }

更新了修改后的代码示例，它现在处理代理对。

您应该使用CharsetEncoder ，简单的getBytes() +副本尽可能多地将UTF-8字符切成两半。

像这样的东西：

 public static int truncateUtf8(String input, byte[] output) { ByteBuffer outBuf = ByteBuffer.wrap(output); CharBuffer inBuf = CharBuffer.wrap(input.toCharArray()); Charset utf8 = Charset.forName("UTF-8"); utf8.newEncoder().encode(inBuf, outBuf, true); System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes"); return outBuf.position(); }

这是我提出的，它使用标准的Java API，因此应该安全并兼容所有unicode古怪和代理对等。解决方案来自http://www.jroller.com/holy/entry/truncating_utf_string_to_the与检查添加为null并在字符串比maxBytes更少的字节时避免解码。

 /** * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal * character. * * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the */ public static String truncateToFitUtf8ByteLength(String s, int maxBytes) { if (s == null) { return null; } Charset charset = Charset.forName("UTF-8"); CharsetDecoder decoder = charset.newDecoder(); byte[] sba = s.getBytes(charset); if (sba.length <= maxBytes) { return s; } // Ensure truncation by having byte buffer = maxBytes ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes); CharBuffer cb = CharBuffer.allocate(maxBytes); // Ignore an incomplete character decoder.onMalformedInput(CodingErrorAction.IGNORE) decoder.decode(bb, cb, true); decoder.flush(cb); return new String(cb.array(), 0, cb.position()); }

UTF-8编码具有一个简洁的特性，允许您查看字节集中的位置。

检查您想要的字符限制流。

如果它的高位为0，则它是一个单字节的字符，只需将其替换为0，你就可以了。
如果它的高位为1，那么下一位，那么你就是一个多字节字符的开头，所以只需将该字节设置为0就可以了。
如果高位为1但下一位为0，那么你就在一个字符的中间，沿着缓冲区返回，直到你找到一个高位为2或更多1的字节，然后用0。

示例：如果您的流是：31 33 31 C1 A3 32 33 00，您可以将字符串设置为1,2,3,5,6或7个字节，但不能设置为4，因为这会将0放在C1之后，是多字节char的开头。

您可以在不进行任何转换的情况下计算字节数。

 foreach character in the Java string if 0 <= character <= 0x7f count += 1 else if 0x80 <= character <= 0x7ff count += 2 else if 0x800 <= character <= 0xd7ff // excluding the surrogate area count += 3 else if 0xdc00 <= character <= 0xffff count += 3 else { // surrogate, a bit more complicated count += 4 skip one extra character in the input stream }

您必须检测代理对（D800-DBFF和U + DC00-U + DFFF）并为每个有效代理对计数4个字节。如果您获得第一个范围中的第一个值，第二个范围中的第二个值，那么一切正常，跳过它们并添加4.但如果没有，则它是无效的代理项对。我不确定Java是如何处理的，但是你的算法必须在那个（不太可能的）情况下正确计算。

一旦UTF-8编码，我如何截断一个java字符串以适应给定的字节数？

使用javareflection在scala中获取具有特定注释的方法参数

在java中加载freemarker模板时出现FileNotFoundException

如何将Grizzly请求注入Jersey ContainerRequestFilter

使用带有IMAP的javax.mail获取来自GMail的消息的UID

lambda可以访问其目标function接口的成员吗？

如何在Java中比较两个对象数组？

在java中执行外部命令

是否可以在应用程序代码之外配置EJB 3.1 @Schedule？

如何将此格式的日期（Tue Jul 13 00:00:00 CEST 2010）转换为Java日期（该字符串来自露天属性）

用java 8 API替换两个嵌套的for循环