解析文档时的Apache Tika和字符限制

可以请任何人帮我解决一下吗?

它可以这样做

Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024); 

但如果你不直接使用Tika,就像这样:

 ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext ps = new ParseContext(); for (InputStream is : getInputStreams()) { parser.parse(is, textHandler, metadata, ps); is.close(); System.out.println("Title: " + metadata.get("title")); System.out.println("Author: " + metadata.get("Author")); } 

无法设置它,因为您不与WriteOutContentHandler交互。 顺便说一下,它默认设置为-1 ,这意味着没有限制。 但最终的限制是100000个字符。

 /** * The maximum number of characters to write to the character stream. * Set to -1 for no limit. */ private final int writeLimit; /** * Number of characters written so far. */ private int writeCount = 0; private WriteOutContentHandler(Writer writer, int writeLimit) { this.writer = writer; this.writeLimit = writeLimit; } /** * Creates a content handler that writes character events to * the given writer. * * @param writer writer */ public WriteOutContentHandler(Writer writer) { this(writer, -1); } 

您一定忽略了内容处理程序具有writelimit的构造函数。

 ContentHandler textHandler = new BodyContentHandler(int writeLimit);