从PDF中提取不可选择的内容

我正在使用Apache PDFBox从PDF文件中提取页面，我找不到提取不可选内容（文本或图像）的方法。使用可从PDF文件中选择的内容，没有问题。

请注意，有问题的PDF对复制内容没有任何限制，至少从我在文档的“文档限制摘要”中看到的内容：它们都允许“内容复制”和“内容复制可访问性”！在同一PDF文件中，有可选择的内容和不可选择的其他部分。所发生的是，提取的页面带有“洞”，即它们只有PDF的可选部分。但是在MS Word上，如果我将PDF添加为对象，则会显示PDF页面的全部内容！所以我希望对PDFBox lib或任何其他Java lib做同样的事情！

这是我用来将PDF页面转换为图像的代码：

private void convertPdfToImage(File pdfFile, int pdfId) throws IOException { PDDocument document = PDDocument.loadNonSeq(pdfFile, null); List pdPages = document.getDocumentCatalog().getAllPages(); for (PDPage pdPage : pdPages) { BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300); ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300); } document.close(); }

有没有办法从这个Apache PDFBox库（或任何其他类似的库）中提取PDF中的不可选内容？或者这根本不可能？如果确实不是，为什么呢？

非常感谢任何帮助！

编辑：我使用Adobe Reader作为PDF查看器和PDFBox v1.8。以下是PDF示例： https ： //dl.dropboxusercontent.com/u/2815529/test.pdf

所讨论的两个图像，右上角的fischer徽标和略微向下的小草图，每个都通过在页面上填充带有平铺图案的区域来绘制，该平铺图案又在其内容流中绘制相应的图像。

Adobe Reader不允许选择模式的内容，并且自动图像提取器通常也不会遍历Pattern资源树。

PDFBox 1.8.10

您可以使用PDFBox轻松构建模式图像提取器，例如PDFBox 1.8.10：

 public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException { List pages = document.getDocumentCatalog().getAllPages(); if (pages == null) return; for (int i = 0; i < pages.size(); i++) { String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s"); extractPatternImages(pages.get(i), pageFormat); } } public void extractPatternImages(PDPage page, String pageFormat) throws IOException { PDResources resources = page.getResources(); if (resources == null) return; Map patterns = resources.getPatterns(); for (Map.Entry patternEntry : patterns.entrySet()) { String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s"); extractPatternImages(patternEntry.getValue(), patternFormat); } } public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException { COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES); if (resourcesDict == null) return; PDResources resources = new PDResources(resourcesDict); Map xObjects = resources.getXObjects(); if (xObjects == null) return; for (Map.Entry entry : xObjects.entrySet()) { PDXObject xObject = entry.getValue(); String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s"); if (xObject instanceof PDXObjectForm) extractPatternImages((PDXObjectForm)xObject, xObjectFormat); else if (xObject instanceof PDXObjectImage) extractPatternImages((PDXObjectImage)xObject, xObjectFormat); } } public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException { PDResources resources = form.getResources(); if (resources == null) return; Map xObjects = resources.getXObjects(); if (xObjects == null) return; for (Map.Entry entry : xObjects.entrySet()) { PDXObject xObject = entry.getValue(); String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s"); if (xObject instanceof PDXObjectForm) extractPatternImages((PDXObjectForm)xObject, xObjectFormat); else if (xObject instanceof PDXObjectImage) extractPatternImages((PDXObjectImage)xObject, xObjectFormat); } Map patterns = resources.getPatterns(); for (Map.Entry patternEntry : patterns.entrySet()) { String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s"); extractPatternImages(patternEntry.getValue(), patternFormat); } } public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException { image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix()))); }

（ ExtractPatternImages.java ）

我将它应用到您的样本PDF中

 public void testtestDrJorge() throws IOException { try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf")) { PDDocument document = PDDocument.load(resource); extractPatternImages(document, "testDrJorge%s.%s");; } }

（ ExtractPatternImages.java ）

并得到两个图像：

`testDrJorge-O-R15-R14.png
testDrJorge-0-R38-R37.png

图像丢失了红色部分。这很可能是因为PDFBox版本1.xx不能正确支持CMYK图像的提取，参见 PDFBOX-2128（未正确支持CMYK图像），并且您的图像为CMYK。

PDFBox 2.0.0发布候选版

我将代码更新为PDFBox 2.0.0（目前仅作为候选版本提供）：

 public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException { PDPageTree pages = document.getDocumentCatalog().getPages(); if (pages == null) return; for (int i = 0; i < pages.getCount(); i++) { String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s"); extractPatternImages(pages.get(i), pageFormat); } } public void extractPatternImages(PDPage page, String pageFormat) throws IOException { PDResources resources = page.getResources(); if (resources == null) return; Iterable patternNames = resources.getPatternNames(); for (COSName patternName : patternNames) { String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s"); extractPatternImages(resources.getPattern(patternName), patternFormat); } } public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException { COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES); if (resourcesDict == null) return; PDResources resources = new PDResources(resourcesDict); Iterable xObjectNames = resources.getXObjectNames(); if (xObjectNames == null) return; for (COSName xObjectName : xObjectNames) { PDXObject xObject = resources.getXObject(xObjectName); String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s"); if (xObject instanceof PDFormXObject) extractPatternImages((PDFormXObject)xObject, xObjectFormat); else if (xObject instanceof PDImageXObject) extractPatternImages((PDImageXObject)xObject, xObjectFormat); } } public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException { PDResources resources = form.getResources(); if (resources == null) return; Iterable xObjectNames = resources.getXObjectNames(); if (xObjectNames == null) return; for (COSName xObjectName : xObjectNames) { PDXObject xObject = resources.getXObject(xObjectName); String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s"); if (xObject instanceof PDFormXObject) extractPatternImages((PDFormXObject)xObject, xObjectFormat); else if (xObject instanceof PDImageXObject) extractPatternImages((PDImageXObject)xObject, xObjectFormat); } Iterable patternNames = resources.getPatternNames(); for (COSName patternName : patternNames) { String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s"); extractPatternImages(resources.getPattern(patternName), patternFormat); } } public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException { String filename = String.format(imageFormat, "", image.getSuffix()); ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename)); }

得到

testDrJorge-0-COSName{R15}-COSName{R14}.png
testDrJorge-0-COSName{R38}-COSName{R37}.png

看起来像一个改进……;）

从PDF中提取不可选择的内容

PDFBox 1.8.10

PDFBox 2.0.0发布候选版

JDBC和JMS的Spring事务同步

如何实现Java客户端应用程序（Android）和PHP服务器应用程序之间的通信？

保证6位数的随机数

Runtime.getRuntime（）。exec（cmd）挂起

jsp是如何工作的？

使用JPanel上的滑块绘制二次公式

如何使用正则表达式validationJTable的第一列？

使用@Autowired与AspectJ和Springboot

如何在Servlet-Applet场景中进行Unittests

为什么我不能在具有基于注释的配置的不同包中具有相同的Controller名称？