Find string inside pdf with shell
As nicely pointed by Simon, you can simply convert the pdf
to plain text using pdftotext
, and then, just search for what you're looking for.
After conversion, you may use grep
, bash regex, or any variation you want:
while read line; do
if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
echo ">>> Found date;";
fi
done < <(pdftotext infile.pdf -)
How to search contents of multiple pdf files?
Your distribution should provide a utility called pdftotext
:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;
The "-" is necessary to have pdftotext output to stdout, not to files.
The --with-filename
and --label=
options will put the file name in the output of grep.
The optional --color
flag is nice and tells grep to output using colors on the terminal.
(In Ubuntu, pdftotext
is provided by the package xpdf-utils
or poppler-utils
.)
This method, using pdftotext
and grep
, has an advantage over pdfgrep
if you want to use features of GNU grep
that pdfgrep
doesn't support. Note: pdfgrep-1.3.x supports -C
option for printing line of context.
search a word inside a pdf in terminal linux without any app
If you have the pdftotext
utility installed, you can use the following command to search through the text of a PDF file:
pdftotext myfile.pdf - | grep 'pattern'
You have to use some utility (such as pdftotext
) to convert the PDF file to text before feeding it into grep
(otherwise grep
would have a hard time making sense out of the raw PDF file), but any utility that does this should work.
On Ubuntu and Debian, pdftotext
is part of the poppler-utils
package.
Shell Script to check content of PDF Files
Updated Answer
Ok, I think you are looking for "My specified string NNN"
in any PDF, so you need a Perl PCRE with pdfgrep -Po
like this:
pdfgrep -Po '(?<=My specified string )\d+' *.pdf
Original Answer
I think you mean you want to search for either of two things in a PDF:
pdfgrep -e "hallo|name" YourFile.pdf
Or maybe you want to search for both of two things:
pdfgrep "hallo" YourFile.pdf && pdfgrep "name" YourFile.pdf && echo "Both present"
Or, you can get a list of all files that contain "string1" with:
pdfgrep -l "string1" *pdf
Or, get a list of files that contain "string1" and then look only in those files for "string2":
pdfgrep -lZ "string1" *pdf | xargs -0 pdfgrep -l "string2"
How to write shell script for finding number of pages in PDF?
Without any extra package:
strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1
Using pdfinfo:
pdfinfo file.pdf | awk '/^Pages:/ {print $2}'
Using pdftk:
pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'
You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:
find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
awk '/^Pages:/ {n += $2} END {print n}'
Related Topics
Bash Concurrent Jobs Gets Stuck
Git Says Everything Up-To-Date
Command and Script to Re-Read a File in Gnuplot
Linux Shared Library Depends on Symbols in Another Shared Library Opened by Dlopen with Rtld_Local
Arm Linux ":Start_Kernel Is Not Calling After Decompressing UImage"
Bus Error Opening and Mmap'Ing a File
What's The Relation Between 32/64-Bit Application, Os and Processor
Cannot Find Module 'Firebase-Admin' When Trying to Deploy Firebase Functions
Bluez: Setting Local Address to Be Private and Non-Resolvable
./Configure-With-Boost No Such File or Directory
Bash Command to Search for Any Occurrence of Phrase and Return List of Files and Paths
Linking with 32Bit Libraries Under Linux 64Bit
How to Continue Next Iteration When an Error Occurs in Bash