从java中的字符串中删除无效的XML字符

您好我想从字符串中删除所有无效的XML字符。 我想使用string.replace方法的正则表达式。

喜欢

line.replace(regExp,"");

什么是正确的regExp使用?

无效的XML字符是不是这样的一切:

 [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

谢谢。

Java的正则表达式支持增补字符 ,因此您可以使用两个UTF-16编码的字符指定那些高范围。

以下是删除XML 1.0中非法字符的模式:

 // XML 1.0 // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] String xml10pattern = "[^" + "\u0009\r\n" + "\u0020-\uD7FF" + "\uE000-\uFFFD" + "\ud800\udc00-\udbff\udfff" + "]"; 

大多数人都想要XML 1.0版本。

以下是删除XML 1.1中非法字符的模式:

 // XML 1.1 // [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] String xml11pattern = "[^" + "\u0001-\uD7FF" + "\uE000-\uFFFD" + "\ud800\udc00-\udbff\udfff" + "]+"; 

您将需要使用String.replaceAll(...)而不是String.replace(...)

 String illegal = "Hello, World!\0"; String legal = illegal.replaceAll(pattern, ""); 

我们应该考虑代理人物吗? 否则’(当前> = 0x10000)&&(当前<= 0x10FFFF)'永远不会成立。

还测试了正则表达式方式似乎比以下循环慢。

 if (null == text || text.isEmpty()) { return text; } final int len = text.length(); char current = 0; int codePoint = 0; StringBuilder sb = new StringBuilder(); for (int i = 0; i < len; i++) { current = text.charAt(i); boolean surrogate = false; if (Character.isHighSurrogate(current) && i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) { surrogate = true; codePoint = text.codePointAt(i++); } else { codePoint = current; } if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) { sb.append(current); if (surrogate) { sb.append(text.charAt(i)); } } } 

Jun的解决方案,简化了。 使用StringBuffer#appendCodePoint(int) ,我不需要char currentString#charAt(int) 。 我可以通过检查codePoint是否大于0xFFFF来告诉代理对。

(没有必要使用i ++,因为低代理不会通过filter。但是然后人们会重新使用不同代码点的代码,它会失败。我更喜欢编程到黑客。)

 StringBuilder sb = new StringBuilder(); for (int i = 0; i < text.length(); i++) { int codePoint = text.codePointAt(i); if (codePoint > 0xFFFF) { i++; } if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD) || ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) { sb.appendCodePoint(codePoint); } } 

到目前为止,所有这些答案只能取代人物本身。 但有时XML文档会有无效的XML实体序列导致错误。 例如,如果你有 在你的xml中,java xml解析器将抛出Illegal character entity: expansion character (code 0x2 at ...

这是一个简单的java程序,可以替换那些无效的实体序列。

  public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;"); /** * Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries. */ String getCleanedXml(String xmlString) { Matcher m = XML_ENTITY_PATTERN.matcher(xmlString); Set replaceSet = new HashSet<>(); while (m.find()) { String group = m.group(1); int val; if (group != null) { val = Integer.parseInt(group, 16); if (isInvalidXmlChar(val)) { replaceSet.add("&#x" + group + ";"); } } else if ((group = m.group(2)) != null) { val = Integer.parseInt(group); if (isInvalidXmlChar(val)) { replaceSet.add("&#" + group + ";"); } } } String cleanedXmlString = xmlString; for (String replacer : replaceSet) { cleanedXmlString = cleanedXmlString.replaceAll(replacer, ""); } return cleanedXmlString; } private boolean isInvalidXmlChar(int val) { if (val == 0x9 || val == 0xA || val == 0xD || val >= 0x20 && val <= 0xD7FF || val >= 0x10000 && val <= 0x10FFFF) { return false; } return true; } 

来自Mark McLaren的博客

  /** * This method ensures that the output String has only * valid XML unicode characters as specified by the * XML 1.0 standard. For reference, please see * the * standard. This method will return an empty * String if the input is null or empty. * * @param in The String whose non-valid characters we want to remove. * @return The in String, stripped of non-valid characters. */ public static String stripNonValidXMLCharacters(String in) { StringBuffer out = new StringBuffer(); // Used to hold the output. char current; // Used to reference the current character. if (in == null || ("".equals(in))) return ""; // vacancy test. for (int i = 0; i < in.length(); i++) { current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen. if ((current == 0x9) || (current == 0xA) || (current == 0xD) || ((current >= 0x20) && (current <= 0xD7FF)) || ((current >= 0xE000) && (current <= 0xFFFD)) || ((current >= 0x10000) && (current <= 0x10FFFF))) out.append(current); } return out.toString(); } 

从最佳方式编码Java中的XML文本数据?

 String xmlEscapeText(String t) { StringBuilder sb = new StringBuilder(); for(int i = 0; i < t.length(); i++){ char c = t.charAt(i); switch(c){ case '<': sb.append("<"); break; case '>': sb.append(">"); break; case '\"': sb.append("""); break; case '&': sb.append("&"); break; case '\'': sb.append("'"); break; default: if(c>0x7e) { sb.append("&#"+((int)c)+";"); }else sb.append(c); } } return sb.toString(); } 

如果要以类似XML的forms存储带有禁止字符的文本元素,则可以使用XPL。 dev-kit为XML和XML处理提供了并发XPL – 这意味着从XPL到XML的转换没有时间成本。 或者,如果您不需要XML(名称空间)的全部function,则可以使用XPL。

网页:HLL XPL

 String xmlData = xmlData.codePoints().filter(c -> isValidXMLChar(c)).collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).toString(); private boolean isValidXMLChar(int c) { if((c == 0x9) || (c == 0xA) || (c == 0xD) || ((c >= 0x20) && (c <= 0xD7FF)) || ((c >= 0xE000) && (c <= 0xFFFD)) || ((c >= 0x10000) && (c <= 0x10FFFF))) { return true; } return false; } 

我相信以下文章可能对您有所帮助。

http://commons.apache.org/lang/api-2.1/org/apache/commons/lang/StringEscapeUtils.html http://www.javapractices.com/topic/TopicAction.do?Id=96

不久,尝试使用来自Jakarta项目的StringEscapeUtils。