How to Search The Content of a PDF File in Linux Shell Script

Shell Script to check content of PDF Files

Updated Answer

Ok, I think you are looking for "My specified string NNN" in any PDF, so you need a Perl PCRE with pdfgrep -Po like this:

pdfgrep -Po '(?<=My specified string )\d+' *.pdf 

Original Answer

I think you mean you want to search for either of two things in a PDF:

pdfgrep -e "hallo|name" YourFile.pdf

Or maybe you want to search for both of two things:

pdfgrep "hallo" YourFile.pdf && pdfgrep "name" YourFile.pdf && echo "Both present"

Or, you can get a list of all files that contain "string1" with:

pdfgrep -l "string1" *pdf

Or, get a list of files that contain "string1" and then look only in those files for "string2":

pdfgrep -lZ "string1" *pdf | xargs -0 pdfgrep -l "string2"

How can I search the content of a pdf file in linux shell script?

I do not know if this works for your journal, it works on some pdf files:

strings "myjournal.pdf" | egrep "/Author|/Title" | tr '/' '\n' | egrep "Author|Title"

How to search contents of multiple pdf files?

Your distribution should provide a utility called pdftotext:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

The "-" is necessary to have pdftotext output to stdout, not to files.
The --with-filename and --label= options will put the file name in the output of grep.
The optional --color flag is nice and tells grep to output using colors on the terminal.

(In Ubuntu, pdftotext is provided by the package xpdf-utils or poppler-utils.)

This method, using pdftotext and grep, has an advantage over pdfgrep if you want to use features of GNU grep that pdfgrep doesn't support. Note: pdfgrep-1.3.x supports -C option for printing line of context.

Find string inside pdf with shell

As nicely pointed by Simon, you can simply convert the pdf to plain text using pdftotext, and then, just search for what you're looking for.

After conversion, you may use grep, bash regex, or any variation you want:

while read line; do

if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
echo ">>> Found date;";
fi

done < <(pdftotext infile.pdf -)

search a word inside a pdf in terminal linux without any app

If you have the pdftotext utility installed, you can use the following command to search through the text of a PDF file:

pdftotext myfile.pdf - | grep 'pattern'

You have to use some utility (such as pdftotext) to convert the PDF file to text before feeding it into grep (otherwise grep would have a hard time making sense out of the raw PDF file), but any utility that does this should work.

On Ubuntu and Debian, pdftotext is part of the poppler-utils package.

Bash, searching in all pdf files

Quite a few issues here.

  • Your script will spew errors if filenames include a percent sign, since printf "$file" will interpret its first argument as a format. Use printf '%s' "$file" instead.
  • You haven't quoted the filename argument when you run pdftotext, which is likely why it throws its help message -- pdftext foo bar.pdf - looks like two arguments, not one filename. pdftotext "$file" instead. (As a rule, always quote your variables in bash.)
  • If you want to show output only for matching files, you need to evaluate a condition before you print the filename.

I don't know how pdftotext behaves exactly, but assuming it doesn't produce a bunch of stderr, the following might work:

#!/usr/bin/env bash

line=$(printf '%032s' 0); line=${line//0/-}

for file in */*.pdf; do
output="$(pdftotext "$file" - | grep -i "$1")"
if [ -n "$output" ]; then
printf "%s\n$line\n%s\n$line\n\n" "$file" "$output"
fi
done

Note: I haven't tested this. You might want to expand the printf with the $line references for readability if this format appears complex or obtuse.

How to write shell script for finding number of pages in PDF?

Without any extra package:

strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1

Using pdfinfo:

pdfinfo file.pdf | awk '/^Pages:/ {print $2}'

Using pdftk:

pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'

You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:

find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
awk '/^Pages:/ {n += $2} END {print n}'


Related Topics



Leave a reply



Submit