Pdfbox 2.0.2 > Calling of Pagedrawer.Processpage Method Caught Exceptions

pdfbox 2.0.2 Calling of PageDrawer.processPage method caught exceptions

Extending PageDrawer didn't really work, so I extended PDFGraphicsStreamEngine and here's the result. I do some of the stuff that is done in PageDrawer. To collect lines, either evaluate the shape in strokePath(), or collect points and lines in the other methods where I have included a println.

public class LineCatcher extends PDFGraphicsStreamEngine
{
    private final GeneralPath linePath = new GeneralPath();
    private int clipWindingRule = -1;

    public LineCatcher(PDPage page)
    {
        super(page);
    }

    public static void main(String[] args) throws IOException
    {
        try (PDDocument document = PDDocument.load(new File("Test.pdf")))
        {
            PDPage page = document.getPage(0);
            LineCatcher test = new LineCatcher(page);
            test.processPage(page);
        }
    }

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        System.out.println("appendRectangle");
        // to ensure that the path is created in the right direction, we have to create
        // it by combining single lines instead of creating a simple rectangle
        linePath.moveTo((float) p0.getX(), (float) p0.getY());
        linePath.lineTo((float) p1.getX(), (float) p1.getY());
        linePath.lineTo((float) p2.getX(), (float) p2.getY());
        linePath.lineTo((float) p3.getX(), (float) p3.getY());

        // close the subpath instead of adding the last line so that a possible set line
        // cap style isn't taken into account at the "beginning" of the rectangle
        linePath.closePath();
    }

    @Override
    public void drawImage(PDImage pdi) throws IOException
    {
    }

    @Override
    public void clip(int windingRule) throws IOException
    {
        // the clipping path will not be updated until the succeeding painting operator is called
        clipWindingRule = windingRule;

    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        linePath.moveTo(x, y);
        System.out.println("moveTo");
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        linePath.lineTo(x, y);
        System.out.println("lineTo");
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        linePath.curveTo(x1, y1, x2, y2, x3, y3);
        System.out.println("curveTo");
    }

    @Override
    public Point2D getCurrentPoint() throws IOException
    {
        return linePath.getCurrentPoint();
    }

    @Override
    public void closePath() throws IOException
    {
        linePath.closePath();
    }

    @Override
    public void endPath() throws IOException
    {
        if (clipWindingRule != -1)
        {
            linePath.setWindingRule(clipWindingRule);
            getGraphicsState().intersectClippingPath(linePath);
            clipWindingRule = -1;
        }
        linePath.reset();

    }

    @Override
    public void strokePath() throws IOException
    {
        // do stuff
        System.out.println(linePath.getBounds2D());

        linePath.reset();
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        linePath.reset();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        linePath.reset();
    }

    @Override
    public void shadingFill(COSName cosn) throws IOException
    {
    }
}

Update 19.3.2019: See also follow-up answer by mkl here.

Extending PageDrawer in pdfbox 2.0.x

Found that PageDrawer alone cannot be subclassed. Rather, a custom PDFRenderer class was created where we can set our custom PageDrawer by overriding the createPageDrawer method.

PDFBox: Detecting the highlighted text in a given page

No need to detect them the original text is there, that is a classic case of redaction failure it does not matter if the highlight is black or see through yellow. Just copy and paste or export the pages as plain text.

Sample Image

Here we can see there is no direct relationship between the black rectangles "paths" or the text that's below them. They are independent objects on the page. Only good downstream processing could marry them together.

Sample Image

The zone of interest is a region of multiple rectangles with ragged edges and trying to match any text that is within or overlapping that zone of interest with variable means of clipping the text between inside and out, which is the reason redaction is a common fail. Sounds like one big challenge that requires lots and lots of honing.

[Later Edit]

The pdfbox team can give advice. and @TilmanHausherr suggested start by looking at pdfbox 2.0.2 > Calling of PageDrawer.processPage method caught exceptions

Get bounding box dimensions of Vector Graphics (bitmap graphics) PDFBOX Java

Following the same approach that is posted here:

pdfbox 2.0.2 > Calling of PageDrawer.processPage method caught exceptions

They mentioned that the logic should be placed on the "strokePath()" method, but for my case as mentioned by @TilmanHausherr, I used the "fillPath()" to write my logic there.

Be aware that the class you define should be extend from PDFGraphicsStreamEngine.

PDFBox - Line / Rectangle extraction

As far as I understand the requirements here, the OP works in a coordinate system with the origin in the upper left corner of the visible page (taking the page rotation into account), x coordinates increasing to the right, y coordinates increasing downwards, and the units being the PDF default user space units (usually ¹/₇₂ inch).

In this coordinate system he needs to extract (horizontal or vertical) lines in the form of

coordinates of the left / top end point and
the width / height.

Transforming `LineCatcher` results

The helper class LineCatcher he got from Tilman, on the other hand, does not take page rotation into account. Furthermore, it returns the bottom end point for vertical lines, not the top end point. Thus, a coordinate transformation has to be applied to of the LineCatcher results.

For this simply replace

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x = Double.toString(rect.getX());
    String y = Double.toString(page_height - rect.getY()) ;
    String w = Double.toString(rect.getWidth());
    String h = Double.toString(rect.getHeight());
    writeToFile(pageNum, x, y, w, h, osw);
}

int pageRotation = page.getRotation();
PDRectangle pageCropBox = page.getCropBox();

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x, y, w, h;
    switch(pageRotation) {
    case 0:
        x = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        y = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 90:
        x = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        y = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    case 180:
        x = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        y = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 270:
        x = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        y = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    default:
        throw new IOException(String.format("Unsupported page rotation %d on page %d.", pageRotation, page));
    }
    writeToFile(pageNum, x, y, w, h, osw);
}

(ExtractLinesWithDir test testExtractLineRotationTestWithDir)

Relation to `TextPosition.get?DirAdj()` coordinates

The OP describes the coordinates by referring to the TextPosition class methods getXDirAdj() and getYDirAdj(). Indeed, these methods return coordinates in a coordinate system with the origin in the upper left page corner and y coordinates increasing downwards after rotating the page so that the text is drawn upright.

In case of the example document all the text is drawn so that it is upright after applying the page rotation. From this my understanding of the requirement written at the top has been derived.

The problem with using the TextPosition.get?DirAdj() values as coordinates globally, though, is that in documents with pages with text drawn in different directions, the collected text coordinates suddenly are relative to different coordinate systems. Thus, a general solution should not collect coordinates wildly like that. Instead it should determine a page orientation at first (e.g. the orientation given by the page rotation or the orientation shared by most of the text) and use coordinates in the fixed coordinate system given by that orientation plus an indication of the writing direction of the text piece in question.

Pdfbox 2.0.2 > Calling of Pagedrawer.Processpage Method Caught Exceptions