Shell Script to check content of PDF Files
Updated Answer
Ok, I think you are looking for "My specified string NNN"
in any PDF, so you need a Perl PCRE with pdfgrep -Po
like this:
pdfgrep -Po '(?<=My specified string )\d+' *.pdf
Original Answer
I think you mean you want to search for either of two things in a PDF:
pdfgrep -e "hallo|name" YourFile.pdf
Or maybe you want to search for both of two things:
pdfgrep "hallo" YourFile.pdf && pdfgrep "name" YourFile.pdf && echo "Both present"
Or, you can get a list of all files that contain "string1" with:
pdfgrep -l "string1" *pdf
Or, get a list of files that contain "string1" and then look only in those files for "string2":
pdfgrep -lZ "string1" *pdf | xargs -0 pdfgrep -l "string2"
How can I search the content of a pdf file in linux shell script?
I do not know if this works for your journal, it works on some pdf files:
strings "myjournal.pdf" | egrep "/Author|/Title" | tr '/' '\n' | egrep "Author|Title"
How to search contents of multiple pdf files?
Your distribution should provide a utility called pdftotext
:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;
The "-" is necessary to have pdftotext output to stdout, not to files.
The --with-filename
and --label=
options will put the file name in the output of grep.
The optional --color
flag is nice and tells grep to output using colors on the terminal.
(In Ubuntu, pdftotext
is provided by the package xpdf-utils
or poppler-utils
.)
This method, using pdftotext
and grep
, has an advantage over pdfgrep
if you want to use features of GNU grep
that pdfgrep
doesn't support. Note: pdfgrep-1.3.x supports -C
option for printing line of context.
Find string inside pdf with shell
As nicely pointed by Simon, you can simply convert the pdf
to plain text using pdftotext
, and then, just search for what you're looking for.
After conversion, you may use grep
, bash regex, or any variation you want:
while read line; do
if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
echo ">>> Found date;";
fi
done < <(pdftotext infile.pdf -)
search a word inside a pdf in terminal linux without any app
If you have the pdftotext
utility installed, you can use the following command to search through the text of a PDF file:
pdftotext myfile.pdf - | grep 'pattern'
You have to use some utility (such as pdftotext
) to convert the PDF file to text before feeding it into grep
(otherwise grep
would have a hard time making sense out of the raw PDF file), but any utility that does this should work.
On Ubuntu and Debian, pdftotext
is part of the poppler-utils
package.
Bash, searching in all pdf files
Quite a few issues here.
- Your script will spew errors if filenames include a percent sign, since
printf "$file"
will interpret its first argument as a format. Useprintf '%s' "$file"
instead. - You haven't quoted the filename argument when you run
pdftotext
, which is likely why it throws its help message --pdftext foo bar.pdf -
looks like two arguments, not one filename.pdftotext "$file"
instead. (As a rule, always quote your variables in bash.) - If you want to show output only for matching files, you need to evaluate a condition before you print the filename.
I don't know how pdftotext
behaves exactly, but assuming it doesn't produce a bunch of stderr, the following might work:
#!/usr/bin/env bash
line=$(printf '%032s' 0); line=${line//0/-}
for file in */*.pdf; do
output="$(pdftotext "$file" - | grep -i "$1")"
if [ -n "$output" ]; then
printf "%s\n$line\n%s\n$line\n\n" "$file" "$output"
fi
done
Note: I haven't tested this. You might want to expand the printf
with the $line
references for readability if this format appears complex or obtuse.
How to write shell script for finding number of pages in PDF?
Without any extra package:
strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1
Using pdfinfo:
pdfinfo file.pdf | awk '/^Pages:/ {print $2}'
Using pdftk:
pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'
You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:
find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
awk '/^Pages:/ {n += $2} END {print n}'
Related Topics
Convert a Base64 Ldif File to Plaintext (For Import)
In Shellscript Assign Variable Based on Curl Output
Bash Script for Executing Commands on Multiple Server
Does Routing Affect a Socket with a Bound Source Address
Run Script on Startup with Raspbian Jessi Wheezy and Raspberry Pi2B
./Configure-With-Boost No Such File or Directory
Finding All Directories That Are World Readable
Bash Script to Install Postgresql - Not Working
Linux Grep/Sed Certain Lines - Space Removal
Inconsistent Systemd Startup of Freeswitch
What Is The 'Tr' Command in Windows
Run Meteor as a Daemon Process
Reading Microphone Data by Polling Using Alsa [Or V4L2]
How to Make a Cross Compiler Using Gcc