How to Read Pdf Files One by One from a Folder in Python

How to read pdf files one by one from a folder in python

First read all files that are available under that directory

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

And then run your code for each file in that list

import PyPDF2
from os import listdir
from os.path import isfile, join


onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
fileReader = PyPDF2.PdfFileReader(open(file,'rb'))

count = 0

while count < 3:

pageObj = fileReader.getPage(count)
count +=1
text = pageObj.extractText()

os.listdir() will get you everything that's in a directory - files and directories. So be careful to have only pdf files in your path or you will need to implement simple filtration for list.

Edit 1

You can also use glob module, as it does pattern matching.

>>> import glob
>>> print(glob.glob('/home/rszamszur/*.sh'))
['/home/rszamszur/work-monitors.sh', '/home/rszamszur/default-monitor.sh', '/home/rszamszur/home-monitors.sh']

Key difference between OS module and glob is that OS will work for all systems, where glob only for Unix like.

Read and extract multiple PDF's from multiple folders using python

Maybe you could try something like this :

# your code

import os

folder = ['A','B','C','D','E','F','G','H']
allyourpdf = []


for fold in folder:
allyourfiles = os.listdir(fold)
firstpdf = ""
for i in allyourfiles:
if '.pdf' in i:
firstpdf = i
break

with open('F:/technophile/Proj/SOURCE/'+fold+firstpdf, 'rb') as fh:

for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
page_interpreter.process_page(page)

text = fake_file_handle.getvalue()
allyourpdf.append(text)

# your code

I think it should work



Related Topics



Leave a reply



Submit