How to Use Grep with Large (Millions) Number of Files to Search for String and Get Result in Few Minutes

How to use grep with large (millions) number of files to search for string and get result in few minutes

You should remove -0 argument to xargs and up -n parameter instead:

... | xargs -n16 ...

How to use grep efficiently?

If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.

Environment:

Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB

Tests:

1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8

real 3m24.358s
user 1m27.654s
sys 9m40.316s

2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings

real 16m3.051s
user 0m56.012s
sys 8m42.540s

3. Suggested by @Stephen: Find the necessary files and use + instead of xargs

time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings

real 53m45.438s
user 0m5.829s
sys 0m40.778s

4. Regular recursive grep.

grep -R "string" >> Strings

real 235m12.823s
user 38m57.763s
sys 38m8.301s

For my purposes, the first command worked just fine.

Diagnosing a slow grep or ack search through a complex directory (code, files, php scripts, etc) for faster repeated use

Ended up using the -L switch to make it output all the files that it matches, for quick visual diagnosis of problems. Used just the command structure below:

ack-grep -L "Oops! Required fields were not all completed."

Fastest possible grep

Try with GNU parallel, which includes an example of how to use it with grep:

grep -r greps recursively through directories. On multicore CPUs GNU
parallel can often speed this up.

find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

This will run 1.5 job per core, and give 1000 arguments to grep.

For big files, it can split it the input in several chunks with the --pipe and --block arguments:

 parallel --pipe --block 2M grep foo < bigfile

You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):

parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile

Grepping a huge file (80GB) any way to speed it up?

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

grep -f maximum number of patterns?

i got the same problem with approx. 4 million patterns to search for in a file with 9 million lines. Seems like it is a problem of RAM. so i got this neat little work around which might be slower than splitting and joining but it just need this one line.

 while read line; do grep $line fileToSearchIn;done < patternFile

I needed to use the work around since the -F flag is no solution for that large files...

EDIT: This seems to be really slow for large files. After some more research i found 'faSomeRecords' and really other awesome tools from Kent NGS-editing-Tools

I tried it on my own by extracting 2 million fasta-rec from 5.5million records file. Took approx. 30 sec..

cheers

EDIT: direct download link



Related Topics



Leave a reply



Submit