Extract a Page from a PDF as a Jpeg

Extract a page from a pdf as a jpeg

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for page in pages:
page.save('out.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler.
Windows users will have to install poppler for Windows.
Mac users will have to install poppler for Mac.
Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.

Output filenames when extracting a range of pages from pdf into jpeg using Imagemagick

Finally managed to do this. Leaving a answer in case somebody else is looking for the same. The solution works with Imagemagick 6.5.1.

So we want to extract page numbered i to j from a.pdf into individual jpegs with files named from a-10.jpeg to a-20.jpeg.

convert a.pdf[i-j] -set filename:page "%[fx:t+i]" a-%[filename:page].jpeg

This uses fx operators. fx:t gives the screen number of current image in sequence and we can add our offset to it.

How can I converting multi page PDF file to many images .jpeg with Vips in C++?

I found a way to solve this using crop.

    VImage in = VImage().pdfload("/Users/MyUser/Desktop/PDF_Reader/files/TEST_DOC_READER.pdf", voptions);
pages = in.get_int("n-pages");
h = in.height()/pages;

for(int i=0; i<pages; i++){
in.crop(0,i*h, in.width(), h).jpegsave((outdir+to_string(i)+format).c_str());
}

How to convert PDF files to images

The thread "converting PDF file to a JPEG image" is suitable for your request.

One solution is to use a third-party library. ImageMagick is a very popular and is freely available too. You can get a .NET wrapper for it here. The original ImageMagick download page is here.

  • Convert PDF pages to image files using the Solid Framework Convert PDF pages to image files using the Solid Framework (dead link, the deleted document is available on Internet Archive).
  • Convert PDF to JPG Universal Document Converter
  • 6 Ways to Convert a PDF to a JPG Image

And you also can take a look at the thread
"How to open a page from a pdf file in pictureBox in C#".

If you use this process to convert a PDF to tiff, you can use this class to retrieve the bitmap from TIFF.

public class TiffImage
{
private string myPath;
private Guid myGuid;
private FrameDimension myDimension;
public ArrayList myImages = new ArrayList();
private int myPageCount;
private Bitmap myBMP;

public TiffImage(string path)
{
MemoryStream ms;
Image myImage;

myPath = path;
FileStream fs = new FileStream(myPath, FileMode.Open);
myImage = Image.FromStream(fs);
myGuid = myImage.FrameDimensionsList[0];
myDimension = new FrameDimension(myGuid);
myPageCount = myImage.GetFrameCount(myDimension);
for (int i = 0; i < myPageCount; i++)
{
ms = new MemoryStream();
myImage.SelectActiveFrame(myDimension, i);
myImage.Save(ms, ImageFormat.Bmp);
myBMP = new Bitmap(ms);
myImages.Add(myBMP);
ms.Close();
}
fs.Close();
}
}

Use it like so:

private void button1_Click(object sender, EventArgs e)
{
TiffImage myTiff = new TiffImage("D:\\Some.tif");
//imageBox is a PictureBox control, and the [] operators pass back
//the Bitmap stored at that position in the myImages ArrayList in the TiffImage
this.pictureBox1.Image = (Bitmap)myTiff.myImages[0];
this.pictureBox2.Image = (Bitmap)myTiff.myImages[1];
this.pictureBox3.Image = (Bitmap)myTiff.myImages[2];
}

converting image-based pdf to image file (png/jpg) in python

The pdf2image library converts pdf to images. As looking at your pdfs they are just images nothing else, you can convert the page to image

Install

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

# Saving pages in jpeg format

for page in pages:
page.save('out.jpg', 'JPEG')

Get a page from pdf and save it to an image file with itext

Appearently (according to 1T3XT BVBA), you can only save an iText Image from a PDF page, not a raster image.
You can store it everywhere, if you will use later to put it in another PDF page... otherwise, you'll have to use a tool like JPedal:

http://www.idrsolutions.com/convert-pdf-to-images/

===================================

EDIT: maybe PDFBox can do it for you too!:

http://pdfbox.apache.org/commandlineutilities/PDFToImage.html

http://gal-levinsky.blogspot.it/2011/11/convert-pdf-to-image-via-pdfbox.html

Converting PDF to images automatically

If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.

You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.

Why is my code only creating a jpeg from the last page of the PDF and therefore only writing the last page to a text file?

The problem is in the filename declaration.

When the first loop finishes:

for page in pages: 
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1

Your filename variable set to the final image_counter. When you read the using filename variable you read the last image for 1 to filelimit + 1 time.

One solution is re-declaring filename in the second-loop.

for i in range(1, filelimit + 1): 
filename = "page_"+str(i)+".jpg"
text = str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
f.write(text)

f.close()

That should solve the problem for reading each filename separately.



Related Topics



Leave a reply



Submit