Pdf miner how to extract images
I have never used pdfminer, however I found this code and this document from Denis Papathanasiou explaining it, which might be of some help to figure this out, as pdfminer's documentation is not very exhaustive. The document is from an outdated version, but the code was recently updated.
If you are not required to use pdfminer, there are alternatives which might be easier such as PyMuPDF found in this answer which extracts all images in the PDF as PNG.
Extract a page from a pdf as a jpeg
The pdf2image library can be used.
You can install it simply using,
pip install pdf2image
Once installed you can use following code to get images.
from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)
Saving pages in jpeg format
for page in pages:
page.save('out.jpg', 'JPEG')
Edit: the Github repo pdf2image also mentions that it uses pdftoppm
and that it requires other installations:
pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler.
Windows users will have to install poppler for Windows.
Mac users will have to install poppler for Mac.
Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, runsudo apt install poppler-utils
.
You can install the latest version under Windows using anaconda by doing:
conda install -c conda-forge poppler
note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.
Extract images from PDF in high resolution with Python
As stated in this issue for PyMuPDF, you have to use a matrix:
issue on Github.
The example given is:
zoom = 2 # zoom factor
mat = fitz.Matrix(zoom, zoom)
pix = page.getPixmap(matrix = mat, <...>)
Indicated in the issue is also that the default resolution is 72 dpi if you don't use a matrix which likely explains your getting low resolution.
Related Topics
How to Convert a Datetime to Date
Setting Different Color for Each Series in Scatter Plot on Matplotlib
Differencebetween Drawing Plots Using Plot, Axes or Figure in Matplotlib
Iterating Each Character in a String Using Python
Creating a New Column Based on If-Elif-Else Condition
How to Get Rid of "Unnamed: 0" Column in a Pandas Dataframe Read in from CSV File
Basic Http File Downloading and Saving to Disk in Python
How to Insert a Column at a Specific Column Index in Pandas
How to Remove Nan Values from a Numpy Array
Appending Pandas Dataframes Generated in a for Loop
Choosing a File in Python with Simple Dialog
How to See If There's an Available and Active Network Connection in Python
Django Filefield with Upload_To Determined at Runtime
Python 3: Importerror "No Module Named Setuptools"
Read File from Line 2 or Skip Header Row