Java正则表达式匹配基本多语言平面之外的字符

如何在java中的unicode Basic Multilingual Plane外部匹配字符(意图删除它们)?

要删除所有非BMP字符,以下内容应该有效:

String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", ""); 

您是在寻找BMP之外的特定角色还是所有角色?

如果是前者,您可以使用StringBuilder构造一个包含更高层的代码点的字符串,并且正则表达式将按预期工作:

  String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString(); Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString()); Matcher matcher = regex.matcher(test); matcher.find(); System.out.println(matcher.start()); 

如果你想从字符串中删除所有非BMP字符,那么我将直接使用StringBuilder而不是正则表达式:

  StringBuilder sb = new StringBuilder(test.length()); for (int ii = 0 ; ii < test.length() ; ) { int codePoint = test.codePointAt(ii); if (codePoint > 0xFFFF) { ii += Character.charCount(codePoint); } else { sb.appendCodePoint(codePoint); ii++; } } 
Interesting Posts