Find String Inside PDF with Shell

Find string inside pdf with shell

As nicely pointed by Simon, you can simply convert the pdf to plain text using pdftotext, and then, just search for what you're looking for.

After conversion, you may use grep, bash regex, or any variation you want:

while read line; do

    if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
        echo ">>> Found date;";
    fi

done < <(pdftotext infile.pdf -)

How to search contents of multiple pdf files?

Your distribution should provide a utility called pdftotext:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

The "-" is necessary to have pdftotext output to stdout, not to files.
The --with-filename and --label= options will put the file name in the output of grep.
The optional --color flag is nice and tells grep to output using colors on the terminal.

(In Ubuntu, pdftotext is provided by the package xpdf-utils or poppler-utils.)

This method, using pdftotext and grep, has an advantage over pdfgrep if you want to use features of GNU grep that pdfgrep doesn't support. Note: pdfgrep-1.3.x supports -C option for printing line of context.

search a word inside a pdf in terminal linux without any app

If you have the pdftotext utility installed, you can use the following command to search through the text of a PDF file:

pdftotext myfile.pdf - | grep 'pattern'

You have to use some utility (such as pdftotext) to convert the PDF file to text before feeding it into grep (otherwise grep would have a hard time making sense out of the raw PDF file), but any utility that does this should work.

On Ubuntu and Debian, pdftotext is part of the poppler-utils package.

Shell Script to check content of PDF Files

Updated Answer

Ok, I think you are looking for "My specified string NNN" in any PDF, so you need a Perl PCRE with pdfgrep -Po like this:

pdfgrep -Po '(?<=My specified string )\d+' *.pdf

Original Answer

I think you mean you want to search for either of two things in a PDF:

pdfgrep -e "hallo|name" YourFile.pdf

Or maybe you want to search for both of two things:

pdfgrep "hallo" YourFile.pdf && pdfgrep "name" YourFile.pdf && echo "Both present"

Or, you can get a list of all files that contain "string1" with:

pdfgrep -l "string1" *pdf

Or, get a list of files that contain "string1" and then look only in those files for "string2":

pdfgrep -lZ "string1" *pdf | xargs -0 pdfgrep -l "string2"

How to write shell script for finding number of pages in PDF?

Without any extra package:

strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
    | sort -rn | head -n 1

Using pdfinfo:

pdfinfo file.pdf | awk '/^Pages:/ {print $2}'

Using pdftk:

pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'

You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:

find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
    awk '/^Pages:/ {n += $2} END {print n}'

Find String Inside PDF with Shell