如何搜索某些特定的字符串或单词，并在java中的pdf文档中进行坐标

我正在使用Pdfbox从pdf文件中搜索单词（或字符串），我也想知道该单词的坐标。例如： – 在pdf文件中有一个类似“$ {abc}”的字符串。我想知道这个字符串的坐标。我尝试了几个例子，但根据我没有得到结果。结果它显示了角色的坐标。

这是守则

@Override protected void writeString(String string, List textPositions) throws IOException { for(TextPosition text : textPositions) { System.out.println( "String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getUnicode()); } }

我正在使用pdfbox 2.0

PDFBox’的PDFTextStripper类仍然具有位置文本（在缩减为纯文本之前）的最后一种方法是方法

 /** * Write a Java string to the output stream. The default implementation will ignore the textPositions * and just calls {@link #writeString(String)}. * * @param text The text to write to the stream. * @param textPositions The TextPositions belonging to the text. * @throws IOException If there is an error when writing the text. */ protected void writeString(String text, List textPositions) throws IOException

一个人应该在这里拦截，因为这个方法接收预处理的，特别是排序的 TextPosition对象（如果一个请求排序开始）。

（实际上我更倾向于在调用方法writeLine拦截，根据其参数的名称和局部变量具有一行的所有TextPosition实例并且每个word调用一次writeString ;不幸的是，PDFBox开发人员已经将此方法声明为私有……好吧，也许这会改变，直到最后的2.0.0发布… 轻推，轻推 。 更新：不幸的是它在发布中没有改变…… 叹息）

此外，使用辅助类将TextPosition实例的序列包装在类似String的类中以使代码更清晰是有帮助的。

考虑到这一点，人们可以搜索这样的变量

 List findSubwords(PDDocument document, int page, String searchTerm) throws IOException { final List hits = new ArrayList(); PDFTextStripper stripper = new PDFTextStripper() { @Override protected void writeString(String text, List textPositions) throws IOException { TextPositionSequence word = new TextPositionSequence(textPositions); String string = word.toString(); int fromIndex = 0; int index; while ((index = string.indexOf(searchTerm, fromIndex)) > -1) { hits.add(word.subSequence(index, index + searchTerm.length())); fromIndex = index + 1; } super.writeString(text, textPositions); } }; stripper.setSortByPosition(true); stripper.setStartPage(page); stripper.setEndPage(page); stripper.getText(document); return hits; }

有了这个助手类

 public class TextPositionSequence implements CharSequence { public TextPositionSequence(List textPositions) { this(textPositions, 0, textPositions.size()); } public TextPositionSequence(List textPositions, int start, int end) { this.textPositions = textPositions; this.start = start; this.end = end; } @Override public int length() { return end - start; } @Override public char charAt(int index) { TextPosition textPosition = textPositionAt(index); String text = textPosition.getUnicode(); return text.charAt(0); } @Override public TextPositionSequence subSequence(int start, int end) { return new TextPositionSequence(textPositions, this.start + start, this.start + end); } @Override public String toString() { StringBuilder builder = new StringBuilder(length()); for (int i = 0; i < length(); i++) { builder.append(charAt(i)); } return builder.toString(); } public TextPosition textPositionAt(int index) { return textPositions.get(start + index); } public float getX() { return textPositions.get(start).getXDirAdj(); } public float getY() { return textPositions.get(start).getYDirAdj(); } public float getWidth() { TextPosition first = textPositions.get(start); TextPosition last = textPositions.get(end); return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj(); } final List textPositions; final int start, end; }

要输出它们的位置，宽度，最终字母和最终字母位置，您可以使用它

 void printSubwords(PDDocument document, String searchTerm) throws IOException { System.out.printf("* Looking for '%s'\n", searchTerm); for (int page = 1; page <= document.getNumberOfPages(); page++) { List hits = findSubwords(document, page, searchTerm); for (TextPositionSequence hit : hits) { TextPosition lastPosition = hit.textPositionAt(hit.length() - 1); System.out.printf(" Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n", page, hit.getX(), hit.getY(), hit.getWidth(), lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj()); } } }

对于测试，我使用MS Word创建了一个小测试文件：

带变量的示例文件

这个测试的输出

 @Test public void testVariables() throws IOException { try ( InputStream resource = getClass().getResourceAsStream("Variables.pdf"); PDDocument document = PDDocument.load(resource); ) { System.out.println("\nVariables.pdf\n-------------\n"); printSubwords(document, "${var1}"); printSubwords(document, "${var 2}"); } }

是

 Variables.pdf ------------- * Looking for '${var1}' Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06 Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995 Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997 Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18 * Looking for '${var 2}' Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997 Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74 Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998 Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81

我有点惊讶，因为如果在单行上找到${var 2} ; 毕竟，PDFBox代码让我假设方法writeString我overrode只检索单词 ; 它看起来好像检索线的较长部分而不仅仅是单词……

如果您需要分组TextPosition实例中的其他数据，只需相应地增强TextPositionSequence 。

如上所述，这不是您的问题的答案，但下面是如何在IText执行此操作的框架示例。这并不是说在Pdfbox中也不可能。

基本上你创建了一个RenderListener ，它接受“解析事件”。您将此侦听器传递给PdfReaderContentParser.processContent 。在侦听器的renderText方法中，您将获得重建布局所需的所有信息，包括x / y坐标和构成内容的text / image / ….

 RenderListener listener = new RenderListener() { @Override public void renderText(TextRenderInfo arg0) { LineSegment segment = arg0.getBaseline(); int x = (int) segment.getStartPoint().get(Vector.I1); // smaller Y means closer to the BOTTOM of the page. So we negate the Y to get proper top-to-bottom ordering int y = -(int) segment.getStartPoint().get(Vector.I2); int endx = (int) segment.getEndPoint().get(Vector.I1); log.debug("renderText "+x+".."+endx+"/"+y+": "+arg0.getText()); ... } ... // other overrides }; PdfReaderContentParser p = new PdfReaderContentParser(reader); for (int i = 1; i <= reader.getNumberOfPages(); i++) { log.info("handling page "+i); p.processContent(i, listener); }

我一直在寻找在PDF文件中突出显示不同的单词。为此，我需要正确地知道单词坐标，所以我正在做的是从左上角，第一个字母和最后一个（x，y）坐标获取（x，y）坐标。来自右上角的信。

稍后，将点保存在一个数组中。请记住，为了正确获得y坐标，您需要相对于页面大小的相对位置，因为给定了坐标。但是getYDirAdj()方法是绝对的，并且很多时间与页面中的时间不匹配。

 protected void writeString(String string, List textPositions) throws IOException { boolean isFound = false; float posXInit = 0, posXEnd = 0, posYInit = 0, posYEnd = 0, width = 0, height = 0, fontHeight = 0; String[] criteria = {"Word1", "Word2", "Word3", ....}; for (int i = 0; i < criteria.length; i++) { if (string.contains(criteria[i])) { isFound = true; } } if (isFound) { posXInit = textPositions.get(0).getXDirAdj(); posXEnd = textPositions.get(textPositions.size() - 1).getXDirAdj() + textPositions.get(textPositions.size() - 1).getWidth(); posYInit = textPositions.get(0).getPageHeight() - textPositions.get(0).getYDirAdj(); posYEnd = textPositions.get(0).getPageHeight() - textPositions.get(textPositions.size() - 1).getYDirAdj(); width = textPositions.get(0).getWidthDirAdj(); height = textPositions.get(0).getHeightDir(); System.out.println(string + "X-Init = " + posXInit + "; Y-Init = " + posYInit + "; X-End = " + posXEnd + "; Y-End = " + posYEnd + "; Font-Height = " + fontHeight); float quadPoints[] = {posXInit, posYEnd + height + 2, posXEnd, posYEnd + height + 2, posXInit, posYInit - 2, posXEnd, posYEnd - 2}; List annotations = document.getPage(this.getCurrentPageNo() - 1).getAnnotations(); PDAnnotationTextMarkup highlight = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT); PDRectangle position = new PDRectangle(); position.setLowerLeftX(posXInit); position.setLowerLeftY(posYEnd); position.setUpperRightX(posXEnd); position.setUpperRightY(posYEnd + height); highlight.setRectangle(position); // quadPoints is array of x,y coordinates in Z-like order (top-left, top-right, bottom-left,bottom-right) // of the area to be highlighted highlight.setQuadPoints(quadPoints); PDColor yellow = new PDColor(new float[]{1, 1, 1 / 255F}, PDDeviceRGB.INSTANCE); highlight.setColor(yellow); annotations.add(highlight); } }

如何搜索某些特定的字符串或单词，并在java中的pdf文档中进行坐标

什么是Java中的参数多态（带示例）？

基本随机滚动骰子Java

抽象和封装之间有什么区别？

如何停止执行的Jar文件

Java import语句语法

为什么java ThreadPoolExecutor在发生RuntimeException时会终止线程？

Hibernatevalidation失败时出现意外的UnsupportedOperationException

Eclipse IDE插件开发：将文件从插件jar复制到活动项目文件夹

Java线程基础知识

Java中的Graphics.drawImage（）在某些计算机上非常慢，而在其他计算机上则要快得多