解析文档时的Apache Tika和字符限制
可以请任何人帮我解决一下吗?
它可以这样做
Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024);
但如果你不直接使用Tika,就像这样:
ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext ps = new ParseContext(); for (InputStream is : getInputStreams()) { parser.parse(is, textHandler, metadata, ps); is.close(); System.out.println("Title: " + metadata.get("title")); System.out.println("Author: " + metadata.get("Author")); }
无法设置它,因为您不与WriteOutContentHandler
交互。 顺便说一下,它默认设置为-1
,这意味着没有限制。 但最终的限制是100000个字符。
/** * The maximum number of characters to write to the character stream. * Set to -1 for no limit. */ private final int writeLimit; /** * Number of characters written so far. */ private int writeCount = 0; private WriteOutContentHandler(Writer writer, int writeLimit) { this.writer = writer; this.writeLimit = writeLimit; } /** * Creates a content handler that writes character events to * the given writer. * * @param writer writer */ public WriteOutContentHandler(Writer writer) { this(writer, -1); }
您一定忽略了内容处理程序具有writelimit的构造函数。
ContentHandler textHandler = new BodyContentHandler(int writeLimit);