How do I use pdfminer as a library
Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def convert_pdf(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
This solution was valid until API changes in November 2013.
Extracting text from a PDF file using PDFMiner in python?
Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.
Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x
Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six
, released on November 2018.
How to use pdfminer.six's pdf2txt.py in python script and outside command line?
The good news is that you can use the PDFMiner library to recreate any attributes/commands you might run with pdf2text on the command line. See below for a basic example I use:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
def pdf_to_text(path):
manager = PDFResourceManager()
retstr = BytesIO()
layout = LAParams(all_texts=True)
device = TextConverter(manager, retstr, laparams=layout)
filepath = open(path, 'rb')
interpreter = PDFPageInterpreter(manager, device)
for page in PDFPage.get_pages(filepath, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
filepath.close()
device.close()
retstr.close()
return text
if __name__ == "__main__":
text = pdf_to_text("yourfile.pdf")
print(text)
If you need to apply page numbers or passwords, those are optional parameters in PDFPage.get_pages. Likewise if you need to make layout changes such as all-texts or margin-size, there are optional attributes for the LAParams initializer
Extract first page of pdf file using pdfminer library of python3
extract_pages
has an optional argument which can do that:
def extract_pages(pdf_file, password='', page_numbers=None, maxpages=0,
caching=True, laparams=None):
"""Extract and yield LTPage objects
:param pdf_file: Either a file path or a file-like object for the PDF file
to be worked on.
:param password: For encrypted PDFs, the password to decrypt.
:param page_numbers: List of zero-indexed page numbers to extract.
:param maxpages: The maximum number of pages to parse
Source: https://github.com/pdfminer/pdfminer.six/blob/22f90521b823ac5a22785d1439a64c7bdf2c2c6d/pdfminer/high_level.py#L126
So extract_pages(path, page_numbers=[0], maxpages=1)[0]
should return only the first page data if I understand correctly.
How to use pdfminer.six
The official documentation assumes that .py
scripts can automatically run. But that is not the case for all operating systems (if it is possible, your local system doesn't need to be set up to make it work).
To start PDFminer
manually from the command line, use the regular way of starting a Python script:
python pdf2txt.py sample.pdf
and it will run the script and exit back to the command line when done. If you get an error somewhere or want to stay in Python for some reason, you can use
python -i pdf2txt.py sample.pdf
Related Topics
Run Child Processes as Different User from a Long Running Python Process
How to Set the Absolute Position of Figure Windows with Matplotlib
How to Get a Raw, Compiled SQL Query from a SQLalchemy Expression
List of Tables, Db Schema, Dump etc Using the Python SQLite3 API
Easy Pretty Printing of Floats
Make a Post Request While Redirecting in Flask
Duplicate Log Output When Using Python Logging Module
How to Specify an Authenticated Proxy for a Python Http Connection
Replacing Text in a File with Python
How to Get Last Items of a List in Python
Function Name Is Undefined in Python Class
Python Serialization - Why Pickle
Get Raw Post Body in Python Flask Regardless of Content-Type Header
Running Infinite Loops Using Threads in Python
Hiding a Password in a Python Script (Insecure Obfuscation Only)