Python Module for Converting PDF to Text

Python module for converting PDF to text

Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format.

The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.

A Python 3 version is available under:

  • https://github.com/pdfminer/pdfminer.six

How to convert whole pdf to text in python

You may want to use textract as this answer recommends to get the full document if all you want is the text.

If you want to use PyPDF2 then you can first get the number of pages then iterate over each page such as:

 from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
text = ""
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
text += page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)

Though you may want to remember which page the text came from in which case you could use a list:

page_text = []
for page_num in range(pdf.getNumPages()): # For each page
page = pdf.getPage(page_num) # Get that page's reference
page_text.append(page.extractText()) # Add that page to our array
for page in page_text:
print(page) # print each page

Convert pdf to text without creating a file

AFAIK, you will have to at least create a temp file so that you can
perform your process.

You can use the following code which takes / reads a PDF file and converts it to a TEXT file.
This makes use of PDFMINER and Python 3.7.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
manager = PDFResourceManager()
codec = 'utf-8'
caching = True
output = io.StringIO()
converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
interpreter.process_page(page)

convertedPDF = output.getvalue()
print(convertedPDF)

infile.close()
converter.close()
output.close()
return convertedPDF

Main function to call the above program:

import os
import converter
import sys, getopt

class ConvertMultiple:
def convert_multiple(pdf_dir, txt_dir):
if pdf_dir == "": pdf_dir = os.getcwd() + "\\" # if no pdfDir passed in
for pdf in os.listdir(pdf_dir): # iterate through pdfs in pdf directory
print("File name is %s", os.path.basename(pdf))
file_extension = pdf.split(".")[-1]
print("file extension is %s", file_extension)
if file_extension == "pdf":
pdf_file_name = pdf_dir + pdf
path = 'E:/pdf/' + os.path.basename(pdf)
print(path)
text = converter.convert('text', path) # get string of text content of pdf
text_file_name = txt_dir + pdf + ".txt"
text_file = open(text_file_name, "w") # make text file
text_file.write(text) # write text to text file

pdf_dir = "E:/pdf"
txt_dir = "E:/text"
ConvertMultiple.convert_multiple(pdf_dir, txt_dir)

Of course you can tune it some more and may be some more room for improvement, but this thing certainly works.

Just make sure instead of providing pdf folder provide a temp pdf
file directly.

Hope this helps you..Happy Coding!

Best tool for text extraction from PDF in Python 3.4

You need to install PyPDF2 module to be able to work with PDFs in Python 3.4. PyPDF2 cannot extract images, charts or other media but it can extract text and return it as a Python string. To install it run pip install PyPDF2 from the command line. This module name is case-sensitive so make sure to type 'y' in lowercase and all other characters as uppercase.

>>> import PyPDF2
>>> pdfFileObj = open('my_file.pdf','rb') #'rb' for read binary mode
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pdfReader.numPages
56
>>> pageObj = pdfReader.getPage(9) #'9' is the page number
>>> pageObj.extractText()

last statement returns all the text that is available in page-9 of 'my_file.pdf' document.

How to convert Web PDF to Text

There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :

  • pdfplumber
  • tesseract
  • pdftotext
  • ...

Here is a simple code example for that (using pdfplumber)

from urllib.request import urlopen
import pdfplumber
url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
response = urlopen(url)
file = open("img.pdf", 'wb')
file.write(response.read())
file.close()
try:
pdf = pdfplumber.open('img.pdf')
except:
# Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
print(f'Error. Are you sure this is a PDF ?')
continue
#PDF plumber text extraction
page = pdf.pages[0]
text = page.extract_text()

EDIT : My bad, just realised you asked "without saving it to my PC".
That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as "img.pdf" so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(

Converting a PDF file to a Text file in Python

You could use pdftotext.exe that you can download from http://www.foolabs.com/xpdf/download.html and then execute it on your pdf files via Python:

import os
import glob
import subprocess

#remember to put your pdftotxt.exe to the folder with your pdf files
for filename in glob.glob(os.getcwd() + '\\*.pdf'):
subprocess.call([os.getcwd() + '\\pdftotext', filename, filename[0:-4]+".txt"])

At least it worked for one of my projects.

How to extract text from pdf in Python 3.7

Using tika worked for me!

from tika import parser

rawText = parser.from_file('January2019.pdf')

rawList = rawText['content'].splitlines()

This made it really easy to extract separate each line in the bank statement into a list.

Convert scanned pdf to text python

Take a look at this library: https://pypi.python.org/pypi/pypdfocr
but a PDF file can have also images in it. You may be able to analyse the page content streams. Some scanners break up the single scanned page into images, so you won't get the text with ghostscript.



Related Topics



Leave a reply



Submit