在Java中过滤非法XML字符

XML规范定义了XML文档中允许的Unicode字符子集: http : //www.w3.org/TR/REC-xml/#charsets 。

如何从Java中的String中过滤掉这些字符?

简单的测试案例:

Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2))) 

找出XML的所有无效字符并非易事。 你需要从Xerces调用或重新实现XMLChar.isInvalid(),

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

此页面包含一个Java方法,用于通过测试每个字符是否在规范内来删除无效的XML字符 ,但它不检查高度不鼓励的字符

顺便说一句,转义字符不是解决方案,因为XML 1.0和1.1规范也不允许转义forms的无效字符。

这是一个解决原始字符的解决方案,以及流中的转义字符与stax或sax一起使用。 它需要扩展到其他无效的字符,但你明白了

 import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.Reader; import java.io.UnsupportedEncodingException; import java.io.Writer; import org.apache.commons.io.IOUtils; import org.apache.xerces.util.XMLChar; public class IgnoreIllegalCharactersXmlReader extends Reader { private final BufferedReader underlyingReader; private StringBuilder buffer = new StringBuilder(4096); private boolean eos = false; public IgnoreIllegalCharactersXmlReader(final InputStream is) throws UnsupportedEncodingException { underlyingReader = new BufferedReader(new InputStreamReader(is, "UTF-8")); } private void fillBuffer() throws IOException { final String line = underlyingReader.readLine(); if (line == null) { eos = true; return; } buffer.append(line); buffer.append('\n'); } @Override public int read(char[] cbuf, int off, int len) throws IOException { if(buffer.length() == 0 && eos) { return -1; } int satisfied = 0; int currentOffset = off; while (false == eos && buffer.length() < len) { fillBuffer(); } while (satisfied < len && buffer.length() > 0) { char ch = buffer.charAt(0); final char nextCh = buffer.length() > 1 ? buffer.charAt(1) : '\0'; if (ch == '&' && nextCh == '#') { final StringBuilder entity = new StringBuilder(); // Since we're reading lines it's safe to assume entity is all // on one line so next char will/could be the hex char int index = 0; char entityCh = '\0'; // Read whole entity while (entityCh != ';') { entityCh = buffer.charAt(index++); entity.append(entityCh); } // if it's bad get rid of it and clean it from the buffer and point to next valid char if (entity.toString().equals("")) { buffer.delete(0, entity.length()); continue; } } if (XMLChar.isValid(ch)) { satisfied++; cbuf[currentOffset++] = ch; } buffer.deleteCharAt(0); } return satisfied; } @Override public void close() throws IOException { underlyingReader.close(); } public static void main(final String[] args) { final File file = new File( ); final File outFile = new File(file.getParentFile(), file.getName() .replace(".xml", ".cleaned.xml")); Reader r = null; Writer w = null; try { r = new IgnoreIllegalCharactersXmlReader(new FileInputStream(file)); w = new OutputStreamWriter(new FileOutputStream(outFile),"UTF-8"); IOUtils.copyLarge(r, w); w.flush(); } catch (Exception e) { e.printStackTrace(); } finally { IOUtils.closeQuietly(r); IOUtils.closeQuietly(w); } } } 

松散地基于Stephen C的答案链接中的注释 ,以及XML 1.1 规范的维基百科,这里有一个java方法,向您展示如何使用正则表达式替换删除非法字符:

 boolean isAllValidXmlChars(String s) { // xml 1.1 spec http://en.wikipedia.org/wiki/Valid_characters_in_XML if (!s.matches("[\\u0001-\\uD7FF\\uE000-\uFFFD\\x{10000}-\\x{10FFFF}]")) { // not in valid ranges return false; } if (s.matches("[\\u0001-\\u0008\\u000b-\\u000c\\u000E-\\u001F\\u007F-\\u0084\\u0086-\\u009F]")) { // a control character return false; } // "Characters allowed but discouraged" if (s.matches( "[\\uFDD0-\\uFDEF\\x{1FFFE}-\\x{1FFFF}\\x{2FFFE}–\\x{2FFFF}\\x{3FFFE}–\\x{3FFFF}\\x{4FFFE}–\\x{4FFFF}\\x{5FFFE}-\\x{5FFFF}\\x{6FFFE}-\\x{6FFFF}\\x{7FFFE}-\\x{7FFFF}\\x{8FFFE}-\\x{8FFFF}\\x{9FFFE}-\\x{9FFFF}\\x{AFFFE}-\\x{AFFFF}\\x{BFFFE}-\\x{BFFFF}\\x{CFFFE}-\\x{CFFFF}\\x{DFFFE}-\\x{DFFFF}\\x{EFFFE}-\\x{EFFFF}\\x{FFFFE}-\\x{FFFFF}\\x{10FFFE}-\\x{10FFFF}]" )) { return false; } return true; } 

使用commons-lang中的 StringEscapeUtils.escapeXml(xml)将转义,而不是过滤字符。

您可以使用正则表达式(正则表达式)来完成工作,请参阅此处的注释中的示例