How to Use Grep Efficiently

How to use grep efficiently?

If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.

Environment:

Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB

Tests:

1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8

real 3m24.358s
user 1m27.654s
sys 9m40.316s

2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.

time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings

real 16m3.051s
user 0m56.012s
sys 8m42.540s

3. Suggested by @Stephen: Find the necessary files and use + instead of xargs

time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings

real 53m45.438s
user 0m5.829s
sys 0m40.778s

4. Regular recursive grep.

grep -R "string" >> Strings

real 235m12.823s
user 38m57.763s
sys 38m8.301s

For my purposes, the first command worked just fine.

Grepping a huge file (80GB) any way to speed it up?

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

How to use grep with large (millions) number of files to search for string and get result in few minutes

You should remove -0 argument to xargs and up -n parameter instead:

... | xargs -n16 ...

Is it more efficient to grep twice or use a regular expression once?

grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.

You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:

  • Both sed and awk perform their own file reading; no need for pipe overhead
  • The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex

However, I haven't tested them. YMMV :)

efficiently grep strings between 2 patterns in LARGE log files

You can try:

awk '/pattern1/,/pattern2/'

In my experience mawk can be significantly faster than sed with this kind of operation and is usually the fastest. Alternatively gawk4 can be much faster than gawk3, so you could try that too.

--edit--

FWIW, just did a small test on a file with 4 million lines

On MacOS 10.13:

sed  :         1.62 real         1.61 user         0.00 sys
gsed : 1.31 real 1.30 user 0.00 sys
awk : 2.14 real 2.12 user 0.00 sys
gawk3: 5.05 real 3.90 user 1.13 sys
gawk4: 0.61 real 0.60 user 0.00 sys
mawk : 0.42 real 0.40 user 0.00 sys

On Centos 7.4:

gsed :         1.56 real         1.54 user         0.01 sys
gawk4: 1.31 real 1.29 user 0.01 sys
mawk : 0.56 real 0.54 user 0.01 sys

Fastest possible grep

Try with GNU parallel, which includes an example of how to use it with grep:

grep -r greps recursively through directories. On multicore CPUs GNU
parallel can often speed this up.

find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

This will run 1.5 job per core, and give 1000 arguments to grep.

For big files, it can split it the input in several chunks with the --pipe and --block arguments:

 parallel --pipe --block 2M grep foo < bigfile

You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):

parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile

Bash How to efficiently manipulate a grep -Poz multiline output?

The problem is with your read command. By default, read will read until a newline, but you are trying to process null-separated strings.

You should be able to use

while IFS= read -r -d '' LINE ; do

grep a large list against a large file

Try

grep -f the_ids.txt huge.csv

Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.

   -F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)

Efficient search of several strings in a text file

There's -m option that limits the number of matches:

-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines.

You can't use it directly with your complex pattern, though, because then you'll only get 1 line for all subpatterns. What you can do is loop over your subpatterns calling fgrep -m 1:

for pat in $patterns; do
fgrep -m 1 $pat my_file
done

P.S. Another option is to use the complex pattern as you do and specify the number of matches equal to the number of subpatterns, but that'll result in slower comparison for each file line.

To grep efficiently by upward recursion

If you want to grep recursively, use -R/-r and a path:

grep -R "TODO" .

So either you're missing the path (.) or I misunderstand your question.



Related Topics



Leave a reply



Submit