在Tomcat中自动检测URI编码

我有一个运行Apache Tomcat 6.x的实例,我想让它解释传入URL的字符集比默认行为更加智能。 特别是,我想实现以下映射:

So%DFe => Soße So%C3%9Fe => Soße So%DF%C3%9F => (error) 

我想要的bevavior可以被描述为“尝试将字节流解码为UTF-8,如果它不起作用则假设ISO-8859-1”。

在这种情况下,仅使用URIEncoding配置不起作用。 那么如何配置Tomcat以我想要的方式对请求进行编码?

我可能必须编写一个filter来接收请求(尤其是查询字符串)并将其重新编码为参数。 这是自然的方式吗?

实现我的目标的复杂方法确实是编写自己的javax.servlet.Filter并将其嵌入到filter链中。 此解决方案符合Tomcat Wiki中提供的Apache Tomcat建议- 字符编码问题 。

更新(2010-07-31):此filter的第一个版本解释了查询字符串本身,这是一个坏主意。 它没有正确处理POST请求,并且与URL重写等其他servletfilter结合使用时出现问题。 这个版本改为包装最初提供的参数并重新编码。 要使其正常工作,必须将URIEncoding (例如在Tomcat中)配置为ISO-8859-1

 package de.roland_illig.webapps.webapp1; import java.io.IOException; import java.nio.ByteBuffer; import java.nio.CharBuffer; import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.nio.charset.CodingErrorAction; import java.nio.charset.IllegalCharsetNameException; import java.nio.charset.UnsupportedCharsetException; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.Enumeration; import java.util.LinkedHashMap; import java.util.List; import java.util.Map; import java.util.regex.Pattern; import javax.servlet.Filter; import javax.servlet.FilterChain; import javax.servlet.FilterConfig; import javax.servlet.ServletException; import javax.servlet.ServletRequest; import javax.servlet.ServletResponse; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletRequestWrapper; import javax.servlet.http.HttpServletResponse; /** * Automatically determines the encoding of the request parameters. It assumes * that the parameters of the original request are encoded by a 1:1 mapping from * bytes to characters. * 

* If the request parameters cannot be decoded by any of the given encodings, * the filter chain is not processed further, but a status code of 400 with a * helpful error message is returned instead. *

* The filter can be configured using the following parameters: *

    *
  • {@code encodings}: The comma-separated list of encodings (see * {@link Charset#forName(String)}) that are tried in order. The first one that * can decode the complete query string is taken. *

    * Default value: {@code UTF-8} *

    * Example: {@code UTF-8,EUC-KR,ISO-8859-15}. *

  • {@code inputEncodingParameterName}: When this parameter is defined and a * query parameter of that name is provided by the client, and that parameter's * value contains only non-escaped characters and the server knows an encoding * of that name, then it is used exclusively, overriding the {@code encodings} * parameter for this request. *

    * Default value: {@code null} *

    * Example: {@code ie} (as used by Google). *

*/ public class EncodingFilter implements Filter { private static final Pattern PAT_COMMA = Pattern.compile(",\\s*"); private String inputEncodingParameterName = null; private final List encodings = new ArrayList(); @Override @SuppressWarnings("unchecked") public void init(FilterConfig config) throws ServletException { String encodingsStr = "UTF-8"; Enumeration en = config.getInitParameterNames(); while (en.hasMoreElements()) { final String name = en.nextElement(); final String value = config.getInitParameter(name); if (name.equals("encodings")) { encodingsStr = value; } else if (name.equals("inputEncodingParameterName")) { inputEncodingParameterName = value; } else { throw new IllegalArgumentException("Unknown parameter: " + name); } } for (String encoding : PAT_COMMA.split(encodingsStr)) { Charset charset = Charset.forName(encoding); encodings.add(charset); } } @SuppressWarnings("unchecked") @Override public void doFilter(ServletRequest sreq, ServletResponse sres, FilterChain fc) throws IOException, ServletException { final HttpServletRequest req = (HttpServletRequest) sreq; final HttpServletResponse res = (HttpServletResponse) sres; final Map params; try { params = Util.decodeParameters(req.getParameterMap(), encodings, inputEncodingParameterName); } catch (IOException e) { res.sendError(400, e.getMessage()); return; } HttpServletRequest wrapper = new ParametersWrapper(req, params); fc.doFilter(wrapper, res); } @Override public void destroy() { // nothing to do } static abstract class Util { static CharsetDecoder strictDecoder(Charset cs) { CharsetDecoder dec = cs.newDecoder(); dec.onMalformedInput(CodingErrorAction.REPORT); dec.onUnmappableCharacter(CodingErrorAction.REPORT); return dec; } static int[] toCodePoints(String str) { final int len = str.length(); int[] codePoints = new int[len]; int i = 0, j = 0; while (i < len) { int cp = Character.codePointAt(str, i); codePoints[j++] = cp; i += Character.charCount(cp); } return j == len ? codePoints : Arrays.copyOf(codePoints, len); } public static String recode(String encoded, CharsetDecoder decoder) throws IOException { byte[] bytes = new byte[encoded.length()]; int bytescount = 0; for (int i = 0; i < encoded.length(); i++) { char c = encoded.charAt(i); if (!(c <= '\u00FF')) throw new IOException("Invalid character: #" + (int) c); bytes[bytescount++] = (byte) c; } CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bytes, 0, bytescount)); String result = cbuf.toString(); return result; } static String ensureDefinedUnicode(String s) throws IOException { for (int cp : toCodePoints(s)) { if (!Character.isDefined(cp)) throw new IOException("Undefined unicode code point: " + cp); } return s; } static Map decodeParameters(Map originalParams, List charsets, String ieName) throws IOException { Map params = new LinkedHashMap(); Charset ie = null; { String[] values = originalParams.get(ieName); if (values != null) { for (String value : values) { if (!value.isEmpty() && value.indexOf('%') == -1) { try { if (ie != null) throw new IOException("Duplicate value for input encoding parameter: " + ie + " and " + value + "."); ie = Charset.forName(value); } catch (IllegalCharsetNameException e) { throw new IOException("Illegal input encoding name: " + value); } catch (UnsupportedCharsetException e) { throw new IOException("Unsupported input encoding: " + value); } } } } } Charset[] css = (ie != null) ? new Charset[] { ie } : charsets.toArray(new Charset[charsets.size()]); for (Charset charset : css) { try { params.clear(); CharsetDecoder decoder = strictDecoder(charset); for (Map.Entry entry : originalParams.entrySet()) { final String encodedName = entry.getKey(); final String name = ensureDefinedUnicode(Util.recode(encodedName, decoder)); for (final String encodedValue : entry.getValue()) { final String value = ensureDefinedUnicode(Util.recode(encodedValue, decoder)); String[] oldValues = params.get(name); String[] newValues = (oldValues == null) ? new String[1] : Arrays.copyOf(oldValues, oldValues.length + 1); newValues[newValues.length - 1] = value; params.put(name, newValues); } } return params; } catch (IOException e) { continue; } } List kvs = new ArrayList(); for (Map.Entry entry : originalParams.entrySet()) { final String key = entry.getKey(); for (final String value : entry.getValue()) { kvs.add(key + "=" + value); } } throw new IOException("Could not decode the parameters: " + kvs.toString()); } } @SuppressWarnings("unchecked") static class ParametersWrapper extends HttpServletRequestWrapper { private final Map params; public ParametersWrapper(HttpServletRequest request, Map params) { super(request); this.params = params; } @Override public String getParameter(String name) { String[] values = params.get(name); return (values != null && values.length != 0) ? values[0] : null; } @Override public Map getParameterMap() { return Collections.unmodifiableMap(params); } @Override public Enumeration getParameterNames() { return Collections.enumeration(params.keySet()); } @Override public String[] getParameterValues(String name) { return params.get(name); } } }

虽然代码大小相当小,但有一些实现细节可能会出错,所以我希望Tomcat已经提供了类似的filter。

要激活此filter,我已将以下内容添加到我的web.xml

  EncodingFilter de.roland_illig.webapps.webapp1.EncodingFilter  encodings US-ASCII, UTF-8, EUC-KR, ISO-8859-15, ISO-8859-1   inputEncodingParameterName ie    EncodingFilter /*  

我们已经在SGES2.1.1上做了类似于Roland解决方案的事情(我使用catalina和一些旧的Tomcats一样),但它有一些问题:

  1. 它复制了应用程序服务器的function
  2. 它还必须注意内部JSP请求,包含参数的页面……
  3. 它必须解析查询字符串
  4. 每次调用setRequest时必须再次执行所有操作,但稍后因为2。
  5. 这是太沉重的解决方法

今天,在我阅读了许多博客和建议之后,我删除了整个类并只做了一件简单的事情:从包装器的构造函数中的Content-Type头解析charset并将其设置为包装实例。

它有效,我们所有的988测试都成功了。

 private static final Pattern CHARSET_PATTERN = Pattern.compile("(?i)\\bcharset=\\s*\"?([^\\s;\"]*)"); private static final String CHARSET_DEFAULT = "ISO-8859-2"; public CisHttpRequestWrapper(final HttpServletRequest request) { super(request); if (request.getCharacterEncoding() != null) { return; } final String charset = parseCharset(request); try { setCharacterEncoding(charset); } catch (final UnsupportedEncodingException e) { throw new IllegalStateException("Unknown charset: " + charset, e); } } private String parseCharset(final HttpServletRequest request) { final String contentType = request.getHeader("Content-Type"); if (contentType == null || contentType.isEmpty()) { return CHARSET_DEFAULT; } final Matcher m = CHARSET_PATTERN.matcher(contentType); if (!m.find()) { return CHARSET_DEFAULT; } final String charsetName = m.group(1).trim().toUpperCase(); return charsetName; }