How to Search Contents of Multiple PDF Files

How to search contents of multiple pdf files?

Your distribution should provide a utility called pdftotext:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

The "-" is necessary to have pdftotext output to stdout, not to files.
The --with-filename and --label= options will put the file name in the output of grep.
The optional --color flag is nice and tells grep to output using colors on the terminal.

(In Ubuntu, pdftotext is provided by the package xpdf-utils or poppler-utils.)

This method, using pdftotext and grep, has an advantage over pdfgrep if you want to use features of GNU grep that pdfgrep doesn't support. Note: pdfgrep-1.3.x supports -C option for printing line of context.

how to search for a word in multiple pdf files using pdftext in linux

The following should list the files matching the pattern:

for i in `find . -type f -name "*.pdf"`; do
pdftotext "${i}" - | grep -lq "search-word" && echo $i;
done

The -q option for grep prevents any output to STDOUT. -l lists matching files.

How can I do a full-text search of PDF files from Perl?

The PerlMonks thread here talks about this problem.

It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:

my @search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;

How do you search for a certain text across pdf files

It may be helpful to specify your operating system and the types of searches that you intend to perform (words, exact phrases, PDF metadata?). Some built-in search systems like OS X's Spotlight will automatically search for multiple words across all PDF files in your account.

On Linux, I would probably temporarily convert PDF files to ASCII with 'pdftotext' utility and then search through each one with 'grep':

find /start/path -name '*.pdf' -print \
-exec pdftotext {} /tmp/tmp.txt \; \
-exec grep -i "search words" /tmp/tmp.txt \;

Programmatically search multiple PDF files for keyword and note page number

I suggest that you evaluate the use of Apache Solr - which can index PDF files very efficiently.

http://lucene.apache.org/solr/



Related Topics



Leave a reply



Submit