How to use grep efficiently?
If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.
Environment:
Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB
Tests:
1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.
time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8
real 3m24.358s
user 1m27.654s
sys 9m40.316s
2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.
time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings
real 16m3.051s
user 0m56.012s
sys 8m42.540s
3. Suggested by @Stephen: Find the necessary files and use + instead of xargs
time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings
real 53m45.438s
user 0m5.829s
sys 0m40.778s
4. Regular recursive grep.
grep -R "string" >> Strings
real 235m12.823s
user 38m57.763s
sys 38m8.301s
For my purposes, the first command worked just fine.
Grepping a huge file (80GB) any way to speed it up?
Here are a few options:
1) Prefix your grep command with LC_ALL=C
to use the C locale instead of UTF-8.
2) Use fgrep
because you're searching for a fixed string, not a regular expression.
3) Remove the -i
option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.
How to use grep with large (millions) number of files to search for string and get result in few minutes
You should remove -0
argument to xargs
and up -n
parameter instead:
... | xargs -n16 ...
Is it more efficient to grep twice or use a regular expression once?
grep -E '(foo|bar)'
will find lines containing 'foo' OR 'bar'.
You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:
sed '/foo/!d;/bar/!d' file.log
awk '/foo/ && /bar/' file.log
Both commands -- in theory -- should be much more efficient than your cat | grep | grep
construct because:
- Both
sed
andawk
perform their own file reading; no need for pipe overhead - The 'programs' I gave to
sed
andawk
above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex
However, I haven't tested them. YMMV :)
efficiently grep strings between 2 patterns in LARGE log files
You can try:
awk '/pattern1/,/pattern2/'
In my experience mawk
can be significantly faster than sed
with this kind of operation and is usually the fastest. Alternatively gawk4
can be much faster than gawk3
, so you could try that too.
--edit--
FWIW, just did a small test on a file with 4 million lines
On MacOS 10.13:
sed : 1.62 real 1.61 user 0.00 sys
gsed : 1.31 real 1.30 user 0.00 sys
awk : 2.14 real 2.12 user 0.00 sys
gawk3: 5.05 real 3.90 user 1.13 sys
gawk4: 0.61 real 0.60 user 0.00 sys
mawk : 0.42 real 0.40 user 0.00 sys
On Centos 7.4:
gsed : 1.56 real 1.54 user 0.01 sys
gawk4: 1.31 real 1.29 user 0.01 sys
mawk : 0.56 real 0.54 user 0.01 sys
Fastest possible grep
Try with GNU parallel, which includes an example of how to use it with grep
:
grep -r
greps recursively through directories. On multicore CPUs GNU
parallel
can often speed this up.find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to
grep
.
For big files, it can split it the input in several chunks with the --pipe
and --block
arguments:
parallel --pipe --block 2M grep foo < bigfile
You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):
parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile
Bash How to efficiently manipulate a grep -Poz multiline output?
The problem is with your read
command. By default, read
will read until a newline, but you are trying to process null-separated strings.
You should be able to use
while IFS= read -r -d '' LINE ; do
grep a large list against a large file
Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F
option might speed up grep
.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Efficient search of several strings in a text file
There's -m
option that limits the number of matches:
-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines.
You can't use it directly with your complex pattern, though, because then you'll only get 1 line for all subpatterns. What you can do is loop over your subpatterns calling fgrep -m 1
:
for pat in $patterns; do
fgrep -m 1 $pat my_file
done
P.S. Another option is to use the complex pattern as you do and specify the number of matches equal to the number of subpatterns, but that'll result in slower comparison for each file line.
To grep efficiently by upward recursion
If you want to grep recursively, use -R/-r and a path:
grep -R "TODO" .
So either you're missing the path (.) or I misunderstand your question.
Related Topics
Randomly Shuffling Lines in Linux/Bash
Replacing Control Character in Sed
Pack Shared Libraries into the Elf
Multiple Websites on Nginx & Sites-Available
Counter Increment in Bash Loop Not Working
Pthread Mutex Lock Unlock by Different Threads
Prevent Linux Thread from Being Interrupted by Scheduler
Self Modifying Code Always Segmentation Faults on Linux
Count Number of Files Within a Directory in Linux
Docker Command Can't Connect to Docker Daemon
How to Run Crontab Job Every Week on Sunday
How Does Bash Deal with Nested Quotes
What Is the Purpose of Map_Anonymous Flag in Mmap System Call
How to Install PHP 7 on Ec2 T2.Micro Instance Running Amazon Linux Distro