grep a large list against a large file
Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F
option might speed up grep
.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Grepping a huge file (80GB) any way to speed it up?
Here are a few options:
1) Prefix your grep command with LC_ALL=C
to use the C locale instead of UTF-8.
2) Use fgrep
because you're searching for a fixed string, not a regular expression.
3) Remove the -i
option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.
grep -vf too slow with large files
Based on Inian's solution in the related post, this awk
command should solve your issue:
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > op.txt
How to use grep with large (millions) number of files to search for string and get result in few minutes
You should remove -0
argument to xargs
and up -n
parameter instead:
... | xargs -n16 ...
Comparing large files with grep or python
This method creates a set from the first file (listA
). The the only memory requirement is enough space to hold this set. It then iterates through each url in the listB.txt
file (very memory efficient). If the url is not in this set, it writes it to a new file (also very memory efficient).
filename_1 = 'listA.txt'
filename_2 = 'listB.txt'
filename_3 = 'listC.txt'
with open(filename_1, 'r') as f1, open(filename_2, 'r') as f2, open(filename_3, 'w') as fout:
s = set(val.strip() for val in f1.readlines())
for row in f2:
row = row.strip()
if row not in s:
fout.write(row + '\n')
grep in a directory for the set of search patterns saved in a file
With an awk that supports nextfile
, e.g. GNU awk, and assuming searchpattern
is just a file of newline-separated file names and B1
doesn't contain more files than can fit in a shell command arguments list:
awk '
NR==FNR { names[$0]; next }
FILENAME in names { print FILENAME }
{ nextfile }
' searchpattern B1/*
exclude regular expression and process very large files
Using awk and "
a delimiter, so basically every even numbered field is a word (blabla"word"blalbla"another_word"...
):
$ awk -F\" 'NR==FNR{a[$1];next}!($4 in a)' exclude original
Output:
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
Edit: Just noticed I want to compare words only in "block-list:name". The placeholder is important in the commants so I changed the !($2 in a)&&!($4 in a)
to !($4 in )
. If the placement of block-list:name
varies, use:
$ awk '
NR==FNR { # process the exclude file
a[$1] # hash word
next
}
{ # process the original file
for(i=1;i<=NF;i++) # loop every spave separated string
if($i~/^block-list:name=/) { # when we meet the desired string
t=$i # copy string to temp var
gsub(/^[^"]+"|".*/,"",t) # extract the word
if(!(t in a)) # if the word is not to be excluded
print # output record
next # move the next record anyway
}
}' exclude original
Related Topics
How to Check If Smtp Is Working from Commandline (Linux)
Creating Subdomains in Amazon Ec2
How to Grep for the Dollar Symbol ($)
How to Escape Colon (:) in $Path on Unix
Unit Testing for Shell Scripts
How I Could Add Dir to $Path in Makefile
Command Line: Search and Replace in All Filenames Matched by Grep
Linux: Find All Symlinks of a Given 'Original' File? (Reverse 'Readlink')
Linux Shell Script to Add Leading Zeros to File Names
Display Two Files Side by Side
How to Parse CSV Files on the Linux Command Line
/Usr/Bin/Ld: Cannot Find -Llapack
How to Create a Configure Script
How to Delete All Lines in a File Starting from After a Matching Line