How to Use Grep Efficiently

How to use grep efficiently?

If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.

Environment:

Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB

Tests:

1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8

real    3m24.358s
user    1m27.654s
sys     9m40.316s

2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings

real    16m3.051s
user    0m56.012s
sys     8m42.540s

3. Suggested by @Stephen: Find the necessary files and use + instead of xargs

time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings

real    53m45.438s
user    0m5.829s
sys     0m40.778s

4. Regular recursive grep.

grep -R "string" >> Strings

real    235m12.823s
user    38m57.763s
sys     38m8.301s

For my purposes, the first command worked just fine.

Grepping a huge file (80GB) any way to speed it up?

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

How to use grep with large (millions) number of files to search for string and get result in few minutes

You should remove -0 argument to xargs and up -n parameter instead:

... | xargs -n16 ...

Is it more efficient to grep twice or use a regular expression once?

grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.

You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:

Both sed and awk perform their own file reading; no need for pipe overhead
The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex

However, I haven't tested them. YMMV :)

efficiently grep strings between 2 patterns in LARGE log files

You can try:

awk '/pattern1/,/pattern2/'

In my experience mawk can be significantly faster than sed with this kind of operation and is usually the fastest. Alternatively gawk4 can be much faster than gawk3, so you could try that too.

--edit--

FWIW, just did a small test on a file with 4 million lines

On MacOS 10.13:

sed  :         1.62 real         1.61 user         0.00 sys
gsed :         1.31 real         1.30 user         0.00 sys
awk  :         2.14 real         2.12 user         0.00 sys
gawk3:         5.05 real         3.90 user         1.13 sys
gawk4:         0.61 real         0.60 user         0.00 sys
mawk :         0.42 real         0.40 user         0.00 sys

On Centos 7.4:

gsed :         1.56 real         1.54 user         0.01 sys
gawk4:         1.31 real         1.29 user         0.01 sys
mawk :         0.56 real         0.54 user         0.01 sys

Fastest possible grep

Try with GNU parallel, which includes an example of how to use it with grep:

grep -r greps recursively through directories. On multicore CPUs GNU
parallel can often speed this up.
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to grep.

For big files, it can split it the input in several chunks with the --pipe and --block arguments:

 parallel --pipe --block 2M grep foo < bigfile

You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):

parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile

Bash How to efficiently manipulate a grep -Poz multiline output?

The problem is with your read command. By default, read will read until a newline, but you are trying to process null-separated strings.

You should be able to use

while IFS= read -r -d '' LINE ; do

grep a large list against a large file

Try

grep -f the_ids.txt huge.csv

Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)

Efficient search of several strings in a text file

There's -m option that limits the number of matches:

-m NUM, --max-count=NUM
     Stop reading a file after NUM matching lines.

You can't use it directly with your complex pattern, though, because then you'll only get 1 line for all subpatterns. What you can do is loop over your subpatterns calling fgrep -m 1:

for pat in $patterns; do
    fgrep -m 1 $pat my_file
done

P.S. Another option is to use the complex pattern as you do and specify the number of matches equal to the number of subpatterns, but that'll result in slower comparison for each file line.

To grep efficiently by upward recursion

If you want to grep recursively, use -R/-r and a path:

grep -R "TODO" .

So either you're missing the path (.) or I misunderstand your question.

How to Use Grep Efficiently