使用pdfbox从pdf中删除不可见的文本
链接到pdf
当我尝试从上面的pdf中提取文本时,我得到了一个在evince查看器中看不见的文本混合文本以及可见的文本。 此外,一些所需的文本缺少观众中没有丢失的字符,例如“FALCONS”中的“S”和许多缺少的“½”字符。 我相信这是由于隐形文本的干扰,因为当在查看器中突出显示pdf时,可以看到不可见文本与可见文本重叠。
有没有办法删除不可见的文字? 还是有其他解决方案吗?
码:
import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; public class App { public static String getPdfText(String pdfPath) throws IOException { File file = new File(pdfPath); PDDocument document = null; PDFTextStripper textStripper = null; String text = null; try { document = PDDocument.load(file); textStripper = new PDFTextStripper(); textStripper.setEndPage(1); text = textStripper.getText(document); } catch (IOException e) { throw new IOException("Could not load file and strip text.", e); } finally { try { if (document != null) document.close(); } catch (IOException e) { System.out.println("Could not close document"); } } return text; } public static void main(String[] args) { String filename = "RevTeaser09072016.pdf"; String text = null; try { text = getPdfText(filename); } catch (IOException e) { e.printStackTrace(); System.exit(1); } System.out.println(text); } }
输出(粗体文本是所需文本):
145 143 159 144 160 141 157155 156154150 153149 152148 151147 142 158 500 146 选 团队数量 金额赌注 反向的问题 标记框如图所示 表示主队 PRO FOOTBALL - 2012年11月15日星期四 1比尔★NFL PM8:25 2 DOLPHINS7-½6-½ PRO FOOTBALL - 星期日,2012年11月18日 3 REDSKINS★PM1:00 4 EAGLES10-½3½ 5包装PM1:00 6 LIONS★10-½3½ 7 FALCONS★PM1:00 8 CARDINALS17-½3+½ 9 BUCCANEERS PM1:00 10 PANTHERS★7-½6-½ 11 COWBOYS★PM1:00 12 BROWNS14-½+½ 13 RAMS★PM1:00 14 JETS10-½3½ 15爱国者★PM4:25 16 COLTS17-½3+½ 17 TEXANS★PM1:00 18 JAGUARS23-½9+½ 19 BENGALS PM1:00 20 CHIEFS★10-½3-½ 21 SAINTS PM4:05 22 RAIDERS★12-½1-½ 23 BRONCOS★PM4:25 24 CHARGERS14-½+½ 25 RAVENS NBC PM8:30 26 STEELERS★7-½6-½ PRO FOOTBALL - 星期一,2012年11月19日 27 49ERS★ESPN PM8:40 28 BEARS10-½3½ 1000 145 143 159 144 160 141 157155 156154150 153149 152148 151147 142 158 500 146 选 团队数量 金额赌注 反向的问题 标记框如同 表示主队 PRO FOOTBALL - 2012年11月15日星期四 1比尔★NFL PM8:25 2 DOLPHINS7-½6-½ PRO FOOTBALL - 星期日,2012年11月18日 3 REDSKINS★PM1:00 4 EAGLES10-½3½ 5包装PM1:00 6 LIONS★10-½3½ 7 FALCONS★PM1:00 8 CARDINALS17-½3+½ 9 BUCCANEERS PM1:00 10 PANTHERS★7-½6-½ 11 COWBOYS★PM1:00 12 BROWNS14-½+½ 13 RAMS★PM1:00 14 JETS10-½3½ 15爱国者★PM4:25 16 COLTS17-½3+½ 17 TEXANS★PM1:00 18 JAGUARS23-½9+½ 19 BENGALS PM1:00 20 CHIEFS★10-½3-½ 21 SAINTS PM4:05 22 RAIDERS★12-½1-½ 23 BRONCOS★PM4:25 24 CHARGERS14-½+½ 25 RAVENS NBC PM8:30 26钢RS★7-½6-½ PRO FOOTBALL - 星期一,2012年11月19日 27 49ERS★ESPN PM8:40 28 BEARS10-½3½ 1000 145 143 159 14 160 41 15715 156154150 153149 152148 51147 142 158 50 146 调整 团队数量 金额赌注 方舟子 表示主队 PRO F OTBALL - 2012年11月15日星期四 1比尔★NFL PM8:25 2 DOLPHINS7-½6-½ PRO F OTBALL - 星期日,2012年11月18日 3 REDSKINS★PM1:0 4 EAGLES10-½3½ 5包装PM1:0 6 LIONS★10-½3½ 7 FALCONS★PM1:0 8CARDINALS17-½3+½ 9 BU CANEERS PM1:0 10 PANTHERS★7-½6-½ 11 COWBOYS★PM1:0 12 BROWNS14-½+½ 13 RAMS★PM1:0 14 JETS10-½3½ 15爱国者★PM4:25 16 COLTS17-½3+½ 17 TEXANS★PM1:0 18 JAGUARS23-½9+½ 19 BENGALS PM1:0 20 CHIEFS★10-½3½ 21 SAINTS PM4:05 22 RAIDERS★12-½1-½ 23 BRONCOS★PM4:25 24 CHARGERS14-½+½ 25 RAVENS NBC PM8:30 26 STEELERS★7-½6-½ PRO F OTBALL - 星期一,2012年11月19日 27 49ERS★ESPN PM8:40 28 BEARS10-½3½ 1,0 MARK BOX如图所示 禁止家庭团队 PRO FOOTBALL - 星期四,2016年9月8日 1 PANTHERS nbc - 10½8:30p 2 BRONCOS - 3½ PRO FOOTBALL - 星期日,2016年9月11日 FALCON - 9 1:00p 4 BUCCANEERS - 4½ 5 VIKINGS - 9½1:00p 6TITANS - 4½ 7EAGLES - 10½1:00p 8 BROWNS - 3½ 9 BENGALS - 9½1:00p 10 JETS - 4½ 11SAINTS - 7½1:00p 12 RAIDERS - 6½ 13CHIEFS - 14½1:00p 14充电器+½ 15RAVENS - 10½1:00p 16 BILLS - 3½ 17TEXANS - 14 1:00p 18 BEARS +½ 19 PACKERS - 12 1:00p 20 JAGUARS - 1½ 21 SEAHAWKS - 17½4:05p 22 DOLPHINS +3½ 23COWBOYS - 7½4:25p 24 GIANTS - 6½ 25种颜色 - 10½4:25p 26 LIONS - 3½ 27 CARDINALSnbc - 14½8:30p 28 PATRIOTS +½ 职业足球 - 星期一,2016年9月12日 29钢铁espn - 10½7:10p 30 REDSKINS - 3½ 31 RAMS espn - 9 10:20p 32 49ERS - 4½
OP的示例PDF中的不可见文本通常通过定义剪辑路径(文本的边界之外)和填充路径(隐藏下面的文本)而变得不可见 。 因此,我们必须在文本提取期间考虑与路径相关的指令以忽略该不可见文本 。
不幸的是,为这些指令设计的PDFTextStripper
未在PDFTextStripper
或其父类LegacyPDFStreamEngine
和PDFStreamEngine
。
但是它们在另一个主要的PDFStreamEngine
子类PDFGraphicsStreamEngine
,并且它们在PageDrawer
中明智地实现。
因此,为了充分利用这一点,我们可以将PageDrawer
实现复制并粘贴并调整为PageDrawer
的子类,例如:
public class PDFVisibleTextStripper extends PDFTextStripper { public PDFVisibleTextStripper() throws IOException { addOperator(new AppendRectangleToPath()); addOperator(new ClipEvenOddRule()); addOperator(new ClipNonZeroRule()); addOperator(new ClosePath()); addOperator(new CurveTo()); addOperator(new CurveToReplicateFinalPoint()); addOperator(new CurveToReplicateInitialPoint()); addOperator(new EndPath()); addOperator(new FillEvenOddAndStrokePath()); addOperator(new FillEvenOddRule()); addOperator(new FillNonZeroAndStrokePath()); addOperator(new FillNonZeroRule()); addOperator(new LineTo()); addOperator(new MoveTo()); addOperator(new StrokePath()); } @Override protected void processTextPosition(TextPosition text) { Matrix textMatrix = text.getTextMatrix(); Vector start = textMatrix.transform(new Vector(0, 0)); Vector end = new Vector(start.getX() + text.getWidth(), start.getY()); PDGraphicsState gs = getGraphicsState(); Area area = gs.getCurrentClippingPath(); if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY()))) super.processTextPosition(text); } private GeneralPath linePath = new GeneralPath(); void deleteCharsInPath() { for (List list : charactersByArticle) { List toRemove = new ArrayList<>(); for (TextPosition text : list) { Matrix textMatrix = text.getTextMatrix(); Vector start = textMatrix.transform(new Vector(0, 0)); Vector end = new Vector(start.getX() + text.getWidth(), start.getY()); if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) { toRemove.add(text); } } if (toRemove.size() != 0) { System.out.println(toRemove.size()); list.removeAll(toRemove); } } } public final class AppendRectangleToPath extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { if (operands.size() < 4) { throw new MissingOperandException(operator, operands); } if (!checkArrayTypesClass(operands, COSNumber.class)) { return; } COSNumber x = (COSNumber) operands.get(0); COSNumber y = (COSNumber) operands.get(1); COSNumber w = (COSNumber) operands.get(2); COSNumber h = (COSNumber) operands.get(3); float x1 = x.floatValue(); float y1 = y.floatValue(); // create a pair of coordinates for the transformation float x2 = w.floatValue() + x1; float y2 = h.floatValue() + y1; Point2D p0 = context.transformedPoint(x1, y1); Point2D p1 = context.transformedPoint(x2, y1); Point2D p2 = context.transformedPoint(x2, y2); Point2D p3 = context.transformedPoint(x1, y2); // to ensure that the path is created in the right direction, we have to create // it by combining single lines instead of creating a simple rectangle linePath.moveTo((float) p0.getX(), (float) p0.getY()); linePath.lineTo((float) p1.getX(), (float) p1.getY()); linePath.lineTo((float) p2.getX(), (float) p2.getY()); linePath.lineTo((float) p3.getX(), (float) p3.getY()); // close the subpath instead of adding the last line so that a possible set line // cap style isn't taken into account at the "beginning" of the rectangle linePath.closePath(); } @Override public String getName() { return "re"; } } public final class StrokePath extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { linePath.reset(); } @Override public String getName() { return "S"; } } public final class FillEvenOddRule extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD); deleteCharsInPath(); linePath.reset(); } @Override public String getName() { return "f*"; } } public class FillNonZeroRule extends OperatorProcessor { @Override public final void process(Operator operator, List operands) throws IOException { linePath.setWindingRule(GeneralPath.WIND_NON_ZERO); deleteCharsInPath(); linePath.reset(); } @Override public String getName() { return "f"; } } public final class FillEvenOddAndStrokePath extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD); deleteCharsInPath(); linePath.reset(); } @Override public String getName() { return "B*"; } } public class FillNonZeroAndStrokePath extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { linePath.setWindingRule(GeneralPath.WIND_NON_ZERO); deleteCharsInPath(); linePath.reset(); } @Override public String getName() { return "B"; } } public final class ClipEvenOddRule extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD); getGraphicsState().intersectClippingPath(linePath); } @Override public String getName() { return "W*"; } } public class ClipNonZeroRule extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { linePath.setWindingRule(GeneralPath.WIND_NON_ZERO); getGraphicsState().intersectClippingPath(linePath); } @Override public String getName() { return "W"; } } public final class MoveTo extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { if (operands.size() < 2) { throw new MissingOperandException(operator, operands); } COSBase base0 = operands.get(0); if (!(base0 instanceof COSNumber)) { return; } COSBase base1 = operands.get(1); if (!(base1 instanceof COSNumber)) { return; } COSNumber x = (COSNumber) base0; COSNumber y = (COSNumber) base1; Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue()); linePath.moveTo(pos.x, pos.y); } @Override public String getName() { return "m"; } } public class LineTo extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { if (operands.size() < 2) { throw new MissingOperandException(operator, operands); } COSBase base0 = operands.get(0); if (!(base0 instanceof COSNumber)) { return; } COSBase base1 = operands.get(1); if (!(base1 instanceof COSNumber)) { return; } // append straight line segment from the current point to the point COSNumber x = (COSNumber) base0; COSNumber y = (COSNumber) base1; Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue()); linePath.lineTo(pos.x, pos.y); } @Override public String getName() { return "l"; } } public class CurveTo extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { if (operands.size() < 6) { throw new MissingOperandException(operator, operands); } if (!checkArrayTypesClass(operands, COSNumber.class)) { return; } COSNumber x1 = (COSNumber) operands.get(0); COSNumber y1 = (COSNumber) operands.get(1); COSNumber x2 = (COSNumber) operands.get(2); COSNumber y2 = (COSNumber) operands.get(3); COSNumber x3 = (COSNumber) operands.get(4); COSNumber y3 = (COSNumber) operands.get(5); Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue()); Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue()); Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue()); linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y); } @Override public String getName() { return "c"; } } public final class CurveToReplicateFinalPoint extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { if (operands.size() < 4) { throw new MissingOperandException(operator, operands); } if (!checkArrayTypesClass(operands, COSNumber.class)) { return; } COSNumber x1 = (COSNumber) operands.get(0); COSNumber y1 = (COSNumber) operands.get(1); COSNumber x3 = (COSNumber) operands.get(2); COSNumber y3 = (COSNumber) operands.get(3); Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue()); Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue()); linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y); } @Override public String getName() { return "y"; } } public class CurveToReplicateInitialPoint extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { if (operands.size() < 4) { throw new MissingOperandException(operator, operands); } if (!checkArrayTypesClass(operands, COSNumber.class)) { return; } COSNumber x2 = (COSNumber) operands.get(0); COSNumber y2 = (COSNumber) operands.get(1); COSNumber x3 = (COSNumber) operands.get(2); COSNumber y3 = (COSNumber) operands.get(3); Point2D currentPoint = linePath.getCurrentPoint(); Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue()); Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue()); linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y); } @Override public String getName() { return "v"; } } public final class ClosePath extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { linePath.closePath(); } @Override public String getName() { return "h"; } } public final class EndPath extends OperatorProcessor { @Override public void process(Operator operator, List operands) throws IOException { linePath.reset(); } @Override public String getName() { return "n"; } } }
( PDFVisibleTextStripper )
请确保使用PDFVisibleTextStripper
构造函数中的内部运算符类,而不是PageDrawer
使用的具有相同名称的类。 要确保只需按照代码下的链接。
这会减少输出
REVERSE tEaSER caRd 500 elections er of Teams t Bet 1,000 MARK BOX AS SHOWN DENOTES HOME TEAM PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016 1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS - 3½ PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016 3 FALCONS - 9½ 1:00p 4 BUCCANEERS - 4½ 5 VIKINGS - 9½ 1:00p 6 TITANS - 4½ 7 EAGLES - 10½ 1:00p 8 BROWNS - 3½ 9 BENGALS - 9½ 1:00p 10 JETS - 4½ 11 SAINTS - 7½ 1:00p 12 RAIDERS - 6½ 13 CHIEFS - 14½ 1:00p 14 CHARGERS + ½ 15 RAVENS - 10½ 1:00p 16 BILLS - 3½ 17 TEXANS - 14½ 1:00p 18 BEARS + ½ 19 PACKERS - 12½ 1:00p 20 JAGUARS - 1½ 21 SEAHAWKS - 17½ 4:05p 22 DOLPHINS + 3½ 23 COWBOYS - 7½ 4:25p 24 GIANTS - 6½ 25 COLTS - 10½ 4:25p 26 LIONS - 3½ 27 CARDINALS nbc - 14½ 8:30p 28 PATRIOTS + ½ PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016 29 STEELERS espn - 10½ 7:10p 30 REDSKINS - 3½ 31 RAMS espn - 9½ 10:20p 32 49ERS - 4½
这会丢弃大部分不需要的数据。
在这个问题的上下文中,很明显, processTextPosition
和deleteCharsInPath
计算字符基线结束的方式隐含地假定水平文本没有页面旋转。 但是,如果放松一个人的“可见性”标准,如果其基线的开始可见,则可以假定一个角色是可见的。 在这种情况下,不再需要计算出的Vector end
,并且代码也适用于旋转页面。