Apache Tika提取扫描PDF文件

我在使用Apache TIKA（版本1.10）时遇到了一些麻烦。我得到了一些PDF文件，这些文件只是扫描过的纸片。这意味着每个页面只是一个图像。我的目标是提取PDF文件的文本。

我的tesseract设置正确，提取JPG和PNG文件就像一个魅力。我正在使用的代码看起来像那样（不介意缺少的除外处理）：

public String extractText(InputStream stream) { AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(stream, handler, metadata, context); String text = handler.toString(); return text; }

我搜索了很多，但我找不到任何适合我的解决方案。我已经尝试过PDFParserConfig类的setExtractInlineImages方法，但这没有改变。使用自定义ParsingEmbeddedDocumentExtractor提取嵌入的文档确实提取了doc文件的嵌入资源，但没有提取我的PDF文件。

如果你们中的任何人都能提供一些帮助，那真是棒极了:)

Tim Allison带来了解决方案：

 Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); TesseractOCRConfig config = new TesseractOCRConfig(); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true); ParseContext parseContext = new ParseContext(); parseContext.set(TesseractOCRConfig.class, config); parseContext.set(PDFParserConfig.class, pdfConfig); parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens! parser.parse(stream, handler, new Metadata(), parseContext);

这对我有用:)

编辑：这是完整的解决方案：

 import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.parser.ocr.TesseractOCRConfig; import org.apache.tika.parser.pdf.PDFParserConfig; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; import java.io.FileInputStream; import java.io.IOException; /** * @since 8/26/16 */ public class Sample { public static void main(String[] args) throws IOException, TikaException, SAXException { Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); TesseractOCRConfig config = new TesseractOCRConfig(); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true); ParseContext parseContext = new ParseContext(); parseContext.set(TesseractOCRConfig.class, config); parseContext.set(PDFParserConfig.class, pdfConfig); //need to add this to make sure recursive parsing happens! parseContext.set(Parser.class, parser); FileInputStream stream = new FileInputStream("samplepdf.pdf"); Metadata metadata = new Metadata(); parser.parse(stream, handler, metadata, parseContext); System.out.println(metadata); String content = handler.toString(); System.out.println("==============="); System.out.println(content); System.out.println("Done"); } }

Maven依赖：

   org.apache.tika tika-parsers 1.13   com.levigo.jbig2 levigo-jbig2-imageio 1.6.5

Apache Tika提取扫描PDF文件

使用java比较两个pdf文件（方法）

Apache tika检测到csv的mime类型不正确

java.lang.IllegalArgumentException：protocol = http host = null

如何使用OpenNLP创建自定义模型？

如何为几种文档类型正确配置Apache Tika？

使用Solr CELL的ExtractingRequestHandler从包格式索引/提取文件

使用Apache Tika在solr中的PDF文件的ContentExtraction

无法使用TesseractOCRConfig Apache Tika提取扫描的pdf

解析文档时的Apache Tika和字符限制

使用Tikajar子进行Mimetype检查