How to Extract Images from a PDF File

How to extract images from a PDF in their original format

You can't (reliably) know the source image file format by looking at an image in PDF. For example, TIFF images can be compressed with (off the top of me head) none, RLE, CCITT (couple variations), LZW, Flate, Jpeg. If an image in a PDF is compressed with DCT (jpeg), how do you decide whether or not the source was TIFF or Jpeg? If it is compressed with Flate, how do you distinguish between TIFF and PNG? Further, it is the software generating the PDF which decides the compression, so I can take a Flate compressed TIFF image and encode it into a PDF using JPEG2000 or a CCITT compressed image and compress it with Jbig2 or a jpeg image, reduce it to an 8-bit paletted image and compress it with Flate.

TL;DR you can't know.

How can I extract images from a PDF file?

pdfimages does just that. It's is part of the poppler-utils and xpdf-utils packages.

From the manpage:


Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files.


Pdfimages reads the PDF file, scans one or more pages, PDF-file, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).


NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.

How to Extract Images from a PDF Form with iText

The visual appearance of a button in PDF can be fully customized, with text, graphics and images. So, the image data could be stored in a slightly different way in different PDF documents. But generally speaking, the form field's widget annotation will have an appearance stream, which will have the image data as an XObject in its Resources dictionary.

Creating a PDF with a button with image for testing:

String fieldname = "Image1_af_image";
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
PdfButtonFormField imagefield = PdfFormField.createButton(pdfDoc, new Rectangle(100, 100, 50, 50),
PdfButtonFormField.FF_PUSH_BUTTON);
imagefield.setImage("button.png").setFieldName(fieldname);
form.addField(imagefield);

Getting the image data from a button:

PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
for (PdfName key : xObjDic.keySet()) {
System.out.println(key);
PdfStream s = xObjDic.getAsStream(key);
// only process images
if (PdfName.Image.equals(s.getAsName(PdfName.Subtype))) {
PdfImageXObject pixo = new PdfImageXObject(s);
byte[] imgbytes = pixo.getImageBytes();
String ext = pixo.identifyImageFileExtension();

// write the image to file
FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
fos.write(imgbytes);
fos.close();
}
}

You can use a PDF object viewer, such as iText RUPS or Adobe Acrobat's built-in "Browse Internal PDF Structure", to inspect the exact structure of your PDF document and find out where the image data is stored.

iText RUPS

EDIT:

A more generic way of extracting the image data, in case it's in nested Form XObjects:

PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
extractImagesFromXObj(xObjDic);
public void extractImagesFromXObj(PdfDictionary xObjDic) throws IOException {
for (PdfName key : xObjDic.keySet()) {
System.out.println(key);
PdfStream s = xObjDic.getAsStream(key);
PdfName subType = s.getAsName(PdfName.Subtype);
// only process images
if (PdfName.Image.equals(subType)) {
PdfImageXObject pixo = new PdfImageXObject(s);
byte[] imgbytes = pixo.getImageBytes();
String ext = pixo.identifyImageFileExtension();

// write the image to file
FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
fos.write(imgbytes);
fos.close();
}
// process nested XObject dictionaries recursively
else if (PdfName.Form.equals(subType)) {
PdfDictionary nestedXObjDic = s.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
extractImagesFromXObj(nestedXObjDic);
}
}
}


Related Topics



Leave a reply



Submit