Using Itextpdf to Trim a Page's Whitespace

Using iTextPDF to trim a page's whitespace

As no actual solution has been posted, here some pointers from the accompanying itext-questions mailing list thread:

  1. As you want to merely trim pages, this is not a case of PdfWriter + getImportedPage usage but instead of PdfStamper usage. Your main code using a PdfStamper might look like this:

    PdfReader reader = new PdfReader(resourceStream); 
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("target/test-outputs/test-trimmed-stamper.pdf"));

    // Go through all pages
    int n = reader.getNumberOfPages();
    for (int i = 1; i <= n; i++)
    {
    Rectangle pageSize = reader.getPageSize(i);
    Rectangle rect = getOutputPageSize(pageSize, reader, i);

    PdfDictionary page = reader.getPageN(i);
    page.put(PdfName.CROPBOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()}));
    stamper.markUsed(page);
    }
    stamper.close();

    As you see I also added another argument to your getOutputPageSize method to-be. It is the page number. The amount of white space to trim might differ on different pages after all.

  2. If the source document did not contain vector graphics, you could simply use the iText parser package classes. There even already is a TextMarginFinder based on them. In this case the getOutputPageSize method (with the additional page parameter) could look like this:

    private Rectangle getOutputPageSize(Rectangle pageSize, PdfReader reader, int page) throws IOException 
    {
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    TextMarginFinder finder = parser.processContent(page, new TextMarginFinder());
    Rectangle result = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
    System.out.printf("Text/bitmap boundary: %f,%f to %f, %f\n", finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
    return result;
    }

    Using this method with your file test.pdf results in:

    Sample Image

    As you see the code trims according to text (and bitmap image) content on the page.

  3. To find the bounding box respecting vector graphics, too, you essentially have to do the same but you have to extend the parser framework used here to inform its listeners (the TextMarginFinder essentially is a listener to drawing events sent from the parser framework) about vector graphics operations, too. This is non-trivial, especially if you don't know PDF syntax by heart yet.

  4. If your PDFs to trim are not too generic but can be forced to include some text or bitmap graphics in relevant positions, though, you could use the sample code above (probably with minor changes) anyways.

    E.g. if your PDFs always start with text on top and end with text at the bottom, you could change getOutputPageSize to create the result rectangle like this:

    Rectangle result = new Rectangle(pageSize.getLeft(), finder.getLly(), pageSize.getRight(), finder.getUry());

    This only trims top and bottom empty space:

    Sample Image

    Depending on your input data pool and requirements this might suffice.

    Or you can use some other heuristics depending on your knowledge on the input data. If you know something about the positioning of text (e.g. the heading to always be centered and some other text to always start at the left), you can easily extend the TextMarginFinder to take advantage of this knowledge.


Recent (April 2015, iText 5.5.6-SNAPSHOT) improvements

The current development version, 5.5.6-SNAPSHOT, extends the parser package to also include vector graphics parsing. This allows for an extension of iText's original TextMarginFinder class implementing the new ExtRenderListener methods like this:

@Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
List<Vector> points = new ArrayList<Vector>();
if (renderInfo.getOperation() == PathConstructionRenderInfo.RECT)
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
float w = renderInfo.getSegmentData().get(2);
float h = renderInfo.getSegmentData().get(3);
points.add(new Vector(x, y, 1));
points.add(new Vector(x+w, y, 1));
points.add(new Vector(x, y+h, 1));
points.add(new Vector(x+w, y+h, 1));
}
else if (renderInfo.getSegmentData() != null)
{
for (int i = 0; i < renderInfo.getSegmentData().size()-1; i+=2)
{
points.add(new Vector(renderInfo.getSegmentData().get(i), renderInfo.getSegmentData().get(i+1), 1));
}
}

for (Vector point: points)
{
point = point.cross(renderInfo.getCtm());
Rectangle2D.Float pointRectangle = new Rectangle2D.Float(point.get(Vector.I1), point.get(Vector.I2), 0, 0);
if (currentPathRectangle == null)
currentPathRectangle = pointRectangle;
else
currentPathRectangle.add(pointRectangle);
}
}

@Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
{
if (textRectangle == null)
textRectangle = currentPathRectangle;
else
textRectangle.add(currentPathRectangle);
}
currentPathRectangle = null;

return null;
}

@Override
public void clipPath(int rule)
{
}

(Full source: MarginFinder.java)

Using this class to trim the white space results in

Sample Image

which is pretty much what one would hope for.

Beware: The implementation above is far from optimal. It is not even correct as it includes all curve control points which is too much. Furthermore it ignores stuff like line width or wedge types. It actually merely is a proof-of-concept.

All test code is in TestTrimPdfPage.java.

Eliminate all white space in PDF using iText

Okay, you're setting the page margins to 1 point in your Document constructor. 1 point is 1/72 of an inch. You should probably use 0f instead, but that doesn't explain a 1 inch margin. Tiny white sliver? Sure... but not what you're describing.

The problem almost certainly stems from you wrapping the Image in a Paragraph which is in turn wrapped in a PdfPTable.

I suggest you scale the image to match the page size, then add the image directly to the document rather than wrapping it in a table:

Image img1 = Image.getInstance(path);
img1.scaleAbsoluteHeight(PageSize.LEGAL.getHeight());
img1.scaleAbsoluteWidth(PageSize.LEGAL.getWidth());

// you might need this, you might not.
img1.setAbsolutePosition(0, 0);

// and add it directly.
document.add(img1);

how to keep up white spaces between words in iText5

If your white space is in left (" hello") , the generated pdf will take that whitespace automatically.
for right side ("hello ") replace white space with "\u00a0" .

String string="hello ";
string=string.replace(" ","\u00a0");

Is there any way to keep whitespaces before text in iText7?

iText will trim spaces.

But it will not remove non-breaking spaces.

File outputFile = new File(System.getProperty("user.home"), "output.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfWriter(outputFile));
Document layoutDocument = new Document(pdfDocument);

layoutDocument.add(new Paragraph("\u00A0\u00A0\u00A0Lorem Ipsum"));
layoutDocument.add(new Paragraph("Lorem Ipsum"));
layoutDocument.close();

Remove white space from top in pdf

If you are using the automatic layout features of iText (which you are, because you are not specifically stating where you want to add the image), then iText will always add a margin to the Document.

What you could do is add the image at an absolute position, that way you can avoid the automated margin.

You'll find this example from their website quite helpful:

https://itextpdf.com/en/resources/faq/technical-support/itext-7/how-precisely-position-image-top-table

Alternatively, you could also set the margins of the page to 0.

https://itextpdf.com/en/resources/faq/getting-started/itext-7/how-use-full-size-page

Keep in mind, I've listed examples for iText 7 Java. You may need to look around a bit to find the equivalent android API.

iText: Importing styled Text and informations from an existing PDF

You can trim away empty space of the XSLT generated PDF and then import the trimmed pages as in your code.

Example code

The following code borrows from the code in my answer to Using iTextPDF to trim a page's whitespace. In contrast to the code there, though, we have to manipulate the media box, not the crop box, because this is the only box respected by PdfWriter.getImportedPage.

Before importing a page from a given PdfReader, crop it using this method:

static void cropPdf(PdfReader reader) throws IOException
{
int n = reader.getNumberOfPages();
for (int i = 1; i <= n; i++)
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MarginFinder finder = parser.processContent(i, new MarginFinder());
Rectangle rect = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());

PdfDictionary page = reader.getPageN(i);
page.put(PdfName.MEDIABOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()}));
}
}

(excerpt from ImportPageWithoutFreeSpace.java)

The extended render listener MarginFinder is taken as is from the question linked to above. You can find a copy here: MarginFinder.java.

Example run

Using this code

PdfReader readerText = new PdfReader(docText);
cropPdf(readerText);
PdfReader readerGraphics = new PdfReader(docGraphics);
cropPdf(readerGraphics);
try ( FileOutputStream fos = new FileOutputStream(new File(RESULT_FOLDER, "importPages.pdf")))
{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, fos);
document.open();
document.add(new Paragraph("Let's import 'textOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
document.add(Image.getInstance(writer.getImportedPage(readerText, 1)));
document.add(new Paragraph("and now 'graphicsOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
document.add(Image.getInstance(writer.getImportedPage(readerGraphics, 1)));
document.add(new Paragraph("That's all, folks!", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));

document.close();
}
finally
{
readerText.close();
readerGraphics.close();
}

(excerpt from unit test method testImportPages in ImportPageWithoutFreeSpace.java)

I imported both the page from the docText document

<code>docText</code> document

and the page from the docGraphics document

<code>docGraphics</code> document

into a new document with some text before, between, and after. The result:

result of the import

As you can see, source styles are preserved but free space around is discarded.



Related Topics



Leave a reply



Submit