How to Use PDFminer as a Library

How do I use pdfminer as a library

Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

This solution was valid until API changes in November 2013.

Extracting text from a PDF file using PDFMiner in python?

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.

Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x

Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.

How to use pdfminer.six's pdf2txt.py in python script and outside command line?

The good news is that you can use the PDFMiner library to recreate any attributes/commands you might run with pdf2text on the command line. See below for a basic example I use:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO

def pdf_to_text(path):
    manager = PDFResourceManager()
    retstr = BytesIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr, laparams=layout)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)

    for page in PDFPage.get_pages(filepath, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    filepath.close()
    device.close()
    retstr.close()
    return text

if __name__ == "__main__":
    text = pdf_to_text("yourfile.pdf")
    print(text)

If you need to apply page numbers or passwords, those are optional parameters in PDFPage.get_pages. Likewise if you need to make layout changes such as all-texts or margin-size, there are optional attributes for the LAParams initializer

Extract first page of pdf file using pdfminer library of python3

extract_pages has an optional argument which can do that:

def extract_pages(pdf_file, password='', page_numbers=None, maxpages=0,
                  caching=True, laparams=None):
    """Extract and yield LTPage objects
    :param pdf_file: Either a file path or a file-like object for the PDF file
        to be worked on.
    :param password: For encrypted PDFs, the password to decrypt.
    :param page_numbers: List of zero-indexed page numbers to extract.
    :param maxpages: The maximum number of pages to parse

Source: https://github.com/pdfminer/pdfminer.six/blob/22f90521b823ac5a22785d1439a64c7bdf2c2c6d/pdfminer/high_level.py#L126

So extract_pages(path, page_numbers=[0], maxpages=1)[0] should return only the first page data if I understand correctly.

How to use pdfminer.six

The official documentation assumes that .py scripts can automatically run. But that is not the case for all operating systems (if it is possible, your local system doesn't need to be set up to make it work).

To start PDFminer manually from the command line, use the regular way of starting a Python script:

python pdf2txt.py sample.pdf

and it will run the script and exit back to the command line when done. If you get an error somewhere or want to stay in Python for some reason, you can use

python -i pdf2txt.py sample.pdf