How to Extract Text from a PDF File

How to extract text from pdf in Python 3.7

Using tika worked for me!

from tika import parser

rawText = parser.from_file('January2019.pdf')

rawList = rawText['content'].splitlines()

This made it really easy to extract separate each line in the bank statement into a list.

How to extract text and text coordinates from a PDF file?

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, pdfminer.six has multiple API's to extract text and information from a PDF. For programmatically extracting information I would advice to use extract_pages(). This allows you to inspect all of the elements on a page, ordered in a meaningful hierarchy created by the layout algorithm.

The following example is a pythonic way of showing all the elements in the hierachy. It uses the simple1.pdf from the samples directory of pdfminer.six.

from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages

def show_ltitem_hierarchy(o: Any, depth=0):
"""Show location and text of LTItem and all its descendants"""
if depth == 0:
print('element x1 y1 x2 y2 text')
print('------------------------------ --- --- --- ---- -----')

print(
f'{get_indented_name(o, depth):<30.30s} '
f'{get_optional_bbox(o)} '
f'{get_optional_text(o)}'
)

if isinstance(o, Iterable):
for i in o:
show_ltitem_hierarchy(i, depth=depth + 1)

def get_indented_name(o: Any, depth: int) -> str:
"""Indented name of LTItem"""
return ' ' * depth + o.__class__.__name__

def get_optional_bbox(o: Any) -> str:
"""Bounding box of LTItem if available, otherwise empty string"""
if hasattr(o, 'bbox'):
return ''.join(f'{i:<4.0f}' for i in o.bbox)
return ''

def get_optional_text(o: Any) -> str:
"""Text of LTItem if available, otherwise empty string"""
if hasattr(o, 'get_text'):
return o.get_text().strip()
return ''

path = Path('~/Downloads/simple1.pdf').expanduser()

pages = extract_pages(path)
show_ltitem_hierarchy(pages)

The output shows the different elements in the hierarchy. The bounding box for each. And the text that this element contains.

element                        x1  y1  x2  y2   text
------------------------------ --- --- --- ---- -----
generator
LTPage 0 0 612 792
LTTextBoxHorizontal 100 695 161 719 Hello
LTTextLineHorizontal 100 695 161 719 Hello
LTChar 100 695 117 719 H
LTChar 117 695 131 719 e
LTChar 131 695 136 719 l
LTChar 136 695 141 719 l
LTChar 141 695 155 719 o
LTChar 155 695 161 719
LTAnno
LTTextBoxHorizontal 261 695 324 719 World
LTTextLineHorizontal 261 695 324 719 World
LTChar 261 695 284 719 W
LTChar 284 695 297 719 o
LTChar 297 695 305 719 r
LTChar 305 695 311 719 l
LTChar 311 695 324 719 d
LTAnno
LTTextBoxHorizontal 100 595 161 619 Hello
LTTextLineHorizontal 100 595 161 619 Hello
LTChar 100 595 117 619 H
LTChar 117 595 131 619 e
LTChar 131 595 136 619 l
LTChar 136 595 141 619 l
LTChar 141 595 155 619 o
LTChar 155 595 161 619
LTAnno
LTTextBoxHorizontal 261 595 324 619 World
LTTextLineHorizontal 261 595 324 619 World
LTChar 261 595 284 619 W
LTChar 284 595 297 619 o
LTChar 297 595 305 619 r
LTChar 305 595 311 619 l
LTChar 311 595 324 619 d
LTAnno
LTTextBoxHorizontal 100 495 211 519 H e l l o
LTTextLineHorizontal 100 495 211 519 H e l l o
LTChar 100 495 117 519 H
LTAnno
LTChar 127 495 141 519 e
LTAnno
LTChar 151 495 156 519 l
LTAnno
LTChar 166 495 171 519 l
LTAnno
LTChar 181 495 195 519 o
LTAnno
LTChar 205 495 211 519
LTAnno
LTTextBoxHorizontal 321 495 424 519 W o r l d
LTTextLineHorizontal 321 495 424 519 W o r l d
LTChar 321 495 344 519 W
LTAnno
LTChar 354 495 367 519 o
LTAnno
LTChar 377 495 385 519 r
LTAnno
LTChar 395 495 401 519 l
LTAnno
LTChar 411 495 424 519 d
LTAnno
LTTextBoxHorizontal 100 395 211 419 H e l l o
LTTextLineHorizontal 100 395 211 419 H e l l o
LTChar 100 395 117 419 H
LTAnno
LTChar 127 395 141 419 e
LTAnno
LTChar 151 395 156 419 l
LTAnno
LTChar 166 395 171 419 l
LTAnno
LTChar 181 395 195 419 o
LTAnno
LTChar 205 395 211 419
LTAnno
LTTextBoxHorizontal 321 395 424 419 W o r l d
LTTextLineHorizontal 321 395 424 419 W o r l d
LTChar 321 395 344 419 W
LTAnno
LTChar 354 395 367 419 o
LTAnno
LTChar 377 395 385 419 r
LTAnno
LTChar 395 395 401 419 l
LTAnno
LTChar 410 395 424 419 d
LTAnno

(Similar answer
here,
here and
here
, I'll try to keep them in sync.)

How to extract text from multiple pdf in a location with specific line and store in Excel?

Tika is one of the Python packages that you can use to extract the data from your PDF files.

In the example below I'm using Tika and regular expressions to extract these five data elements:

  • bid no
  • end date
  • item category
  • organisation name
  • total quantity
import re as regex
from tika import parser

parse_entire_pdf = parser.from_file('2022251527199.pdf', xmlContent=True)
for key, values in parse_entire_pdf.items():
if key == 'content':
bid_number = regex.search(r'(Bid Number:)\W(GEM\W\d{4}\W[A-Z]\W\d+)', values)
print(bid_number.group(2))
GEM/2022/B/1916455

bid_end_date = regex.search(r'(Bid End Date\WTime)\W(\d{2}-\d{2}-\d{4}\W\d{2}:\d{2}:\d{2})', values)
print(bid_end_date.group(2))
21-02-2022 15:00:00

org_name = regex.search(r'(Organisation Name)\W(.*)', values)
print(org_name.group(2))
State Election Commission (sec), Gujarat

item_category = regex.search(r'(Item Category)\W(.*)', values)
print(item_category.group(2))
Desktop Computers (Q2) , Computer Printers (Q2)

total_quantity = regex.search(r'(Total Quantity)\W(\d+)', values)
print(total_quantity.group(2))
18

Here is one way to write out the extracted data to a CSV file:

import csv
import re as regex
from tika import parser

document_elements = []

# processing 2 documents
documents = ['202225114747453.pdf', '2022251527199.pdf']
for doc in documents:
parse_entire_pdf = parser.from_file(doc, xmlContent=True)
for key, values in parse_entire_pdf.items():
if key == 'content':
bid_number = regex.search(r'(Bid Number:)\W(GEM\W\d{4}\W[A-Z]\W\d+)', values)

bid_end_date = regex.search(r'(Bid End Date\WTime)\W(\d{2}-\d{2}-\d{4}\W\d{2}:\d{2}:\d{2})', values)

org_name = regex.search(r'(Organisation Name)\W(.*)', values)

item_category = regex.search(r'(Item Category)\W(.*)', values)

total_quantity = regex.search(r'(Total Quantity)\W(\d+)', values)

document_elements.append([bid_number.group(2),
bid_end_date.group(2),
org_name.group(2),
item_category.group(2),
total_quantity.group(2)])

with open("out.csv", "w", newline="") as f:
headerList = ['bid_number', 'bid_end_date', 'org_name', 'item_category', 'total_quantity']
writer = csv.writer(f)
writer.writerow(headerList)
writer.writerows(document_elements)

Sample Image

Here is the additional code that you asked for in the comments.

import os
import re as regex
from tika import parser

document_elements = []

image_directory = "pdf_files"
image_directory_abspath = os.path.abspath(image_directory)
for dirpath, dirnames, filenames in os.walk(image_directory_abspath):
for filename in [f for f in filenames if f.endswith(".pdf")]:
parse_entire_pdf = parser.from_file(os.path.join(dirpath, filename), xmlContent=True)
for key, values in parse_entire_pdf.items():
if key == 'content':
bid_number = regex.search(r'(Bid Number:)\W(GEM\W\d{4}\W[A-Z]\W\d+)', values)

bid_end_date = regex.search(r'(Bid End Date\WTime)\W(\d{2}-\d{2}-\d{4}\W\d{2}:\d{2}:\d{2})', values)

org_name = regex.search(r'(Organisation Name)\W(.*)', values)

item_category = regex.search(r'(Item Category)\W(.*)', values)

total_quantity = regex.search(r'(Total Quantity)\W(\d+)', values)

document_elements.append([bid_number.group(2),
bid_end_date.group(2),
org_name.group(2),
item_category.group(2),
total_quantity.group(2)])

with open("out.csv", "w", newline="") as f:
headerList = ['bid_number', 'bid_end_date', 'org_name', 'item_category', 'total_quantity']
writer = csv.writer(f)
writer.writerow(headerList)
writer.writerows(document_elements)

SPECIAL NOTE: I noted that some PDFs don't have an org_name, so you will have to figure out how to handle these with either a N/A, None, or Null

How extract text from this compressed PDF/A?

If you want to decompress the streams in a PDF file, I can recommend using qdpf, but on this file

 qpdf --decrypt --stream-data=uncompress document.pdf out.pdf

doesn't help either.

I am not sure though why your efforts with xpdf and tesseract did not work out, using image-magick's convert
to create PNG files in a temporary directory and tesseract, you can do:

import os
from pathlib import Path
from tempfile import TemporaryDirectory
import subprocess

DPI=600

def call(*args):
cmd = [str(x) for x in args]
return subprocess.check_output(cmd, stderr=subprocess.STDOUT).decode('utf-8')

def ocr(docpath, lang):
result = []
abs_path = Path(docpath).expanduser().resolve()
old_dir = os.getcwd()
out = Path('out.txt')
with TemporaryDirectory() as tmpdir:
os.chdir(tmpdir)
call('convert', '-density', DPI, abs_path, 'out.png')
index = -1
while True:
# names have no leading zeros on the digits, would be difficult to sort glob() output
# so just count them
index += 1
png = Path(f'out-{index}.png')
if not png.exists():
break
call('tesseract', '--dpi', DPI, png, out.stem, '-l', lang)
result.append(out.read_text())
os.chdir(old_dir)
return result

pages = ocr('~/Downloads/document.pdf', 'por')
print('\n'.join(pages[1].splitlines()[21:24]))

which gives:

DA NÃO REALIZAÇÃO DE AUDIÊNCIA DE AUTOCOMPOSIÇÃO NO CASO EM CONCRETO

Com vista a obter maior celeridade processual, assim como da impossibilidade de conciliação entre

If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that.

The final print is limited as the full output is quite large.

This converting and OCR-ing takes a while so you might want to uncomment the print in call() to get some sense of progress.



Related Topics



Leave a reply



Submit