How to Convert PDF Files to Images

How to convert PDF files to images

The thread "converting PDF file to a JPEG image" is suitable for your request.

One solution is to use a third-party library. ImageMagick is a very popular and is freely available too. You can get a .NET wrapper for it here. The original ImageMagick download page is here.

  • Convert PDF pages to image files using the Solid Framework Convert PDF pages to image files using the Solid Framework (dead link, the deleted document is available on Internet Archive).
  • Convert PDF to JPG Universal Document Converter
  • 6 Ways to Convert a PDF to a JPG Image

And you also can take a look at the thread
"How to open a page from a pdf file in pictureBox in C#".

If you use this process to convert a PDF to tiff, you can use this class to retrieve the bitmap from TIFF.

public class TiffImage
{
private string myPath;
private Guid myGuid;
private FrameDimension myDimension;
public ArrayList myImages = new ArrayList();
private int myPageCount;
private Bitmap myBMP;

public TiffImage(string path)
{
MemoryStream ms;
Image myImage;

myPath = path;
FileStream fs = new FileStream(myPath, FileMode.Open);
myImage = Image.FromStream(fs);
myGuid = myImage.FrameDimensionsList[0];
myDimension = new FrameDimension(myGuid);
myPageCount = myImage.GetFrameCount(myDimension);
for (int i = 0; i < myPageCount; i++)
{
ms = new MemoryStream();
myImage.SelectActiveFrame(myDimension, i);
myImage.Save(ms, ImageFormat.Bmp);
myBMP = new Bitmap(ms);
myImages.Add(myBMP);
ms.Close();
}
fs.Close();
}
}

Use it like so:

private void button1_Click(object sender, EventArgs e)
{
TiffImage myTiff = new TiffImage("D:\\Some.tif");
//imageBox is a PictureBox control, and the [] operators pass back
//the Bitmap stored at that position in the myImages ArrayList in the TiffImage
this.pictureBox1.Image = (Bitmap)myTiff.myImages[0];
this.pictureBox2.Image = (Bitmap)myTiff.myImages[1];
this.pictureBox3.Image = (Bitmap)myTiff.myImages[2];
}

Extract a page from a pdf as a jpeg

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for page in pages:
page.save('out.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler.
Windows users will have to install poppler for Windows.
Mac users will have to install poppler for Mac.
Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.

Time efficient way to convert PDF to image

I found an answer to that problem using another module called fitz which is a python binding to MuPDF.

First of all install PyMuPDF:

The documentation can be found here but for windows users it's rather simple:

pip install PyMuPDF

Then import the fitz module:

import fitz
print(fitz.__doc__)

>>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
>>>Version date: 2021-05-05 06:32:22.
>>>Built for Python 3.7 on win32 (64-bit).

Open your file and save every page as images:

The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.

def convert_pdf_to_image(fic):
#open your file
doc = fitz.open(fic)
#iterate through the pages of the document and create a RGB image of the page
for page in doc:
pix = page.get_pixmap()
pix.save("page-%i.png" % page.number)

Hope this helps anyone else.

Convert PDFs to Images using pdf2image in python

You could try something like this (this script finds pdf files in the same directory with your python program):

 import os

from pdf2image import convert_from_path, convert_from_bytes

from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)

# get all pdf files from directory
pdf_files = [filename for filename in os.listdir(
'.') if filename.endswith('.pdf')]

for pdf_file in pdf_files:
images = convert_from_path(pdf_file)

print(pdf_file)

for i, image in enumerate(images):
fname = pdf_file+'_image'+str(i)+'.jpg'
image.save(fname, "JPEG")

Converting PDF to image (with proper formatting)

it turns out that jpedal(lgpl) does the converting perfectly(just like a snapshot).

here is what I have used :

PdfDecoder decode_pdf = new PdfDecoder(true);

FontMappings.setFontReplacements();

decode_pdf.openPdfFile("Hello World.pdf");

decode_pdf.setExtractionMode(0,800,3);

try {

for(int i=0;i<40;i++)
{
BufferedImage img=decode_pdf.getPageAsImage(2+i);

ImageIO.write(img, "png",new File(String.valueOf(i)+"out.png"));
}
} catch (IOException ex) {
Logger.getLogger(NewJFrame.class.getName()).log(Level.SEVERE, null, ex);
}

decode_pdf.closePdfFile();

} catch (PdfException e) {
e.printStackTrace();
}

it works fine.

Convert pdf to images using Node package or open source tool

Be aware generally https://npmjs.com/package/pdf-img-convert is frequently updated thus the better of the two, but has 3 pending pull requests so review if they impact your useage. (Note https://npmjs.com/package/pdf-image has a significantly much heavier set of dependencies to break and also has a much bigger list of pending pull requests thus your correct assumption the older it is ....)

However current pdf-img-convert 1.0.3 has a breaking dependency that needs a manual correction due to a change in Mozilla naming earlier this year from es5 to legacy.

see https://github.com/olliet88/pdf-img-convert.js/issues/10

For a cross platform Open Source CLI tool I would suggest Artifex MuTool (AGPL is not free for commercial use, but your getting quality support) has continuous daily commits, it can be programmed via Mutool Run ecma.js

Out of the box a simple convert in.pdf out%4d.png will attempt fixing broken PDF but may reject some that need a more forgiving secondary approach such as above.

Convert a PDF file to image

You can easily convert 04-Request-Headers.pdf file pages into image format.

Convert all pdf pages into image format in Java using PDF Box.

Solution for Apache PDFBox 1.8.* version:

Jar required pdfbox-1.8.3.jar

or the maven dependency

<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>1.8.3</version>
</dependency>

Here is the solution:

package com.pdf.pdfbox.examples;

import java.awt.image.BufferedImage;
import java.io.File;
import java.util.List;

import javax.imageio.ImageIO;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

@SuppressWarnings("unchecked")
public class ConvertPDFPagesToImages {
public static void main(String[] args) {
try {
String sourceDir = "C:/Documents/04-Request-Headers.pdf"; // Pdf files are read from this folder
String destinationDir = "C:/Documents/Converted_PdfFiles_to_Image/"; // converted images from pdf document are saved here

File sourceFile = new File(sourceDir);
File destinationFile = new File(destinationDir);
if (!destinationFile.exists()) {
destinationFile.mkdir();
System.out.println("Folder Created -> "+ destinationFile.getAbsolutePath());
}
if (sourceFile.exists()) {
System.out.println("Images copied to Folder: "+ destinationFile.getName());
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
System.out.println("Total files to be converted -> "+ list.size());

String fileName = sourceFile.getName().replace(".pdf", "");
int pageNumber = 1;
for (PDPage page : list) {
BufferedImage image = page.convertToImage();
File outputfile = new File(destinationDir + fileName +"_"+ pageNumber +".png");
System.out.println("Image Created -> "+ outputfile.getName());
ImageIO.write(image, "png", outputfile);
pageNumber++;
}
document.close();
System.out.println("Converted Images are saved at -> "+ destinationFile.getAbsolutePath());
} else {
System.err.println(sourceFile.getName() +" File not exists");
}

} catch (Exception e) {
e.printStackTrace();
}
}
}

Possible conversions of image into jpg, jpeg, png, bmp, gif format.

Note: I mentioned the mainly used image formats.

ImageIO.write(image , "jpg", new File( destinationDir +fileName+"_"+pageNumber+".jpg" ));
ImageIO.write(image , "jpeg", new File( destinationDir +fileName+"_"+pageNumber+".jpeg" ));
ImageIO.write(image , "png", new File( destinationDir +fileName+"_"+pageNumber+".png" ));
ImageIO.write(image , "bmp", new File( destinationDir +fileName+"_"+pageNumber+".bmp" ));
ImageIO.write(image , "gif", new File( destinationDir +fileName+"_"+pageNumber+".gif" ));

Console Output:

Images copied to Folder: Converted_PdfFiles_to_Image
Total files to be converted -> 13
Aug 06, 2014 1:35:49 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_1.png
Aug 06, 2014 1:35:50 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_2.png
Aug 06, 2014 1:35:51 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_3.png
Aug 06, 2014 1:35:51 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_4.png
Aug 06, 2014 1:35:52 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_5.png
Aug 06, 2014 1:35:52 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_6.png
Aug 06, 2014 1:35:53 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_7.png
Aug 06, 2014 1:35:53 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_8.png
Aug 06, 2014 1:35:54 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_9.png
Aug 06, 2014 1:35:54 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_10.png
Aug 06, 2014 1:35:54 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_11.png
Aug 06, 2014 1:35:55 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_12.png
Aug 06, 2014 1:35:55 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Image Created -> 04-Request-Headers_13.png
Converted Images are saved at -> C:\Documents\Converted_PdfFiles_to_Image

Solution for Apache PDFBox 2.0.* version:

Required Jars pdfbox-2.0.16.jar, fontbox-2.0.16.jar, commons-logging-1.2.jar

or from the pom.xml dependencies

<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.16</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/fontbox -->
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>fontbox</artifactId>
<version>2.0.16</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-logging/commons-logging -->
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.2</version>
</dependency>

Solution for 2.0.16 version:

package com.pdf.pdfbox.examples;

import java.awt.image.BufferedImage;
import java.io.File;

import javax.imageio.ImageIO;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;

/**
*
* @author venkataudaykiranp
*
* @version 2.0.16(Apache PDFBox version support)
*
*/
public class ConvertPDFPagesToImages {
public static void main(String[] args) {
try {
String sourceDir = "C:\\Users\\venkataudaykiranp\\Downloads\\04-Request-Headers.pdf"; // Pdf files are read from this folder
String destinationDir = "C:\\Users\\venkataudaykiranp\\Downloads\\Converted_PdfFiles_to_Image/"; // converted images from pdf document are saved here

File sourceFile = new File(sourceDir);
File destinationFile = new File(destinationDir);
if (!destinationFile.exists()) {
destinationFile.mkdir();
System.out.println("Folder Created -> "+ destinationFile.getAbsolutePath());
}
if (sourceFile.exists()) {
System.out.println("Images copied to Folder Location: "+ destinationFile.getAbsolutePath());
PDDocument document = PDDocument.load(sourceFile);
PDFRenderer pdfRenderer = new PDFRenderer(document);

int numberOfPages = document.getNumberOfPages();
System.out.println("Total files to be converting -> "+ numberOfPages);

String fileName = sourceFile.getName().replace(".pdf", "");
String fileExtension= "png";
/*
* 600 dpi give good image clarity but size of each image is 2x times of 300 dpi.
* Ex: 1. For 300dpi 04-Request-Headers_2.png expected size is 797 KB
* 2. For 600dpi 04-Request-Headers_2.png expected size is 2.42 MB
*/
int dpi = 300;// use less dpi for to save more space in harddisk. For professional usage you can use more than 300dpi

for (int i = 0; i < numberOfPages; ++i) {
File outPutFile = new File(destinationDir + fileName +"_"+ (i+1) +"."+ fileExtension);
BufferedImage bImage = pdfRenderer.renderImageWithDPI(i, dpi, ImageType.RGB);
ImageIO.write(bImage, fileExtension, outPutFile);
}

document.close();
System.out.println("Converted Images are saved at -> "+ destinationFile.getAbsolutePath());
} else {
System.err.println(sourceFile.getName() +" File not exists");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}


Related Topics



Leave a reply



Submit