How to extract images from a PDF in their original format
You can't (reliably) know the source image file format by looking at an image in PDF. For example, TIFF images can be compressed with (off the top of me head) none, RLE, CCITT (couple variations), LZW, Flate, Jpeg. If an image in a PDF is compressed with DCT (jpeg), how do you decide whether or not the source was TIFF or Jpeg? If it is compressed with Flate, how do you distinguish between TIFF and PNG? Further, it is the software generating the PDF which decides the compression, so I can take a Flate compressed TIFF image and encode it into a PDF using JPEG2000 or a CCITT compressed image and compress it with Jbig2 or a jpeg image, reduce it to an 8-bit paletted image and compress it with Flate.
TL;DR you can't know.
How can I extract images from a PDF file?
pdfimages does just that. It's is part of the poppler-utils and xpdf-utils packages.
From the manpage:
Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files.
Pdfimages reads the PDF file, scans one or more pages, PDF-file, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).
NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.
How to Extract Images from a PDF Form with iText
The visual appearance of a button in PDF can be fully customized, with text, graphics and images. So, the image data could be stored in a slightly different way in different PDF documents. But generally speaking, the form field's widget annotation will have an appearance stream, which will have the image data as an XObject in its Resources dictionary.
Creating a PDF with a button with image for testing:
String fieldname = "Image1_af_image";
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
PdfButtonFormField imagefield = PdfFormField.createButton(pdfDoc, new Rectangle(100, 100, 50, 50),
PdfButtonFormField.FF_PUSH_BUTTON);
imagefield.setImage("button.png").setFieldName(fieldname);
form.addField(imagefield);
Getting the image data from a button:
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
for (PdfName key : xObjDic.keySet()) {
System.out.println(key);
PdfStream s = xObjDic.getAsStream(key);
// only process images
if (PdfName.Image.equals(s.getAsName(PdfName.Subtype))) {
PdfImageXObject pixo = new PdfImageXObject(s);
byte[] imgbytes = pixo.getImageBytes();
String ext = pixo.identifyImageFileExtension();
// write the image to file
FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
fos.write(imgbytes);
fos.close();
}
}
You can use a PDF object viewer, such as iText RUPS or Adobe Acrobat's built-in "Browse Internal PDF Structure", to inspect the exact structure of your PDF document and find out where the image data is stored.
EDIT:
A more generic way of extracting the image data, in case it's in nested Form XObjects:
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
extractImagesFromXObj(xObjDic);
public void extractImagesFromXObj(PdfDictionary xObjDic) throws IOException {
for (PdfName key : xObjDic.keySet()) {
System.out.println(key);
PdfStream s = xObjDic.getAsStream(key);
PdfName subType = s.getAsName(PdfName.Subtype);
// only process images
if (PdfName.Image.equals(subType)) {
PdfImageXObject pixo = new PdfImageXObject(s);
byte[] imgbytes = pixo.getImageBytes();
String ext = pixo.identifyImageFileExtension();
// write the image to file
FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
fos.write(imgbytes);
fos.close();
}
// process nested XObject dictionaries recursively
else if (PdfName.Form.equals(subType)) {
PdfDictionary nestedXObjDic = s.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
extractImagesFromXObj(nestedXObjDic);
}
}
}
Related Topics
Using Array_Search for Multi Dimensional Array
Php: Get HTML Source Code with Curl
Deep Recursive Array of Directory Structure in PHP
Upload Multiple Files in Codeigniter
Joomla 3.2.1 Password Encryption
What's the Disadvantage of Mt_Rand
File_Get_Contents() Give Me 403 Forbidden
Understanding Nested PHP Ternary Operator
Pdo MySQL: Insert Multiple Rows in One Query
Phpunit Mock Objects and Static Methods
Converting <Br /> into a New Line for Use in a Text Area
How to Include the Split Delimiter in Results for Preg_Split()
Php: How to Explode a String by Commas, But Not Wheres the Commas Are Within Quotes
Manipulate an Archive in Memory with PHP (Without Creating a Temporary File on Disk)
PHP Get Highest Value from Array
PHP Exec() Will Not Execute Shell Command When Executed via Browser
PHP - Get Base64 Img String Decode and Save as Jpg (Resulting Empty Image )