Extract Images from PDF Without Resampling, in Python

Pdf miner how to extract images

I have never used pdfminer, however I found this code and this document from Denis Papathanasiou explaining it, which might be of some help to figure this out, as pdfminer's documentation is not very exhaustive. The document is from an outdated version, but the code was recently updated.

If you are not required to use pdfminer, there are alternatives which might be easier such as PyMuPDF found in this answer which extracts all images in the PDF as PNG.

Extract a page from a pdf as a jpeg

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for page in pages:
page.save('out.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler.
Windows users will have to install poppler for Windows.
Mac users will have to install poppler for Mac.
Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.

Extract images from PDF in high resolution with Python

As stated in this issue for PyMuPDF, you have to use a matrix:
issue on Github.

The example given is:

zoom = 2    # zoom factor
mat = fitz.Matrix(zoom, zoom)
pix = page.getPixmap(matrix = mat, <...>)

Indicated in the issue is also that the default resolution is 72 dpi if you don't use a matrix which likely explains your getting low resolution.



Related Topics



Leave a reply



Submit