Grep a Large List Against a Large File

grep a large list against a large file

Try

grep -f the_ids.txt huge.csv

Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.

   -F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)

Grepping a huge file (80GB) any way to speed it up?

Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.

grep -vf too slow with large files

Based on Inian's solution in the related post, this awk command should solve your issue:

awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > op.txt

How to use grep with large (millions) number of files to search for string and get result in few minutes

You should remove -0 argument to xargs and up -n parameter instead:

... | xargs -n16 ...

Comparing large files with grep or python

This method creates a set from the first file (listA). The the only memory requirement is enough space to hold this set. It then iterates through each url in the listB.txt file (very memory efficient). If the url is not in this set, it writes it to a new file (also very memory efficient).

filename_1 = 'listA.txt'
filename_2 = 'listB.txt'
filename_3 = 'listC.txt'
with open(filename_1, 'r') as f1, open(filename_2, 'r') as f2, open(filename_3, 'w') as fout:
s = set(val.strip() for val in f1.readlines())
for row in f2:
row = row.strip()
if row not in s:
fout.write(row + '\n')

grep in a directory for the set of search patterns saved in a file

With an awk that supports nextfile, e.g. GNU awk, and assuming searchpattern is just a file of newline-separated file names and B1 doesn't contain more files than can fit in a shell command arguments list:

awk '
NR==FNR { names[$0]; next }
FILENAME in names { print FILENAME }
{ nextfile }
' searchpattern B1/*

exclude regular expression and process very large files

Using awk and " a delimiter, so basically every even numbered field is a word (blabla"word"blalbla"another_word"...):

$ awk -F\" 'NR==FNR{a[$1];next}!($4 in a)' exclude original

Output:

<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />

Edit: Just noticed I want to compare words only in "block-list:name". The placeholder is important in the commants so I changed the !($2 in a)&&!($4 in a) to !($4 in ). If the placement of block-list:name varies, use:

$ awk '
NR==FNR { # process the exclude file
a[$1] # hash word
next
}
{ # process the original file
for(i=1;i<=NF;i++) # loop every spave separated string
if($i~/^block-list:name=/) { # when we meet the desired string
t=$i # copy string to temp var
gsub(/^[^"]+"|".*/,"",t) # extract the word
if(!(t in a)) # if the word is not to be excluded
print # output record
next # move the next record anyway
}
}' exclude original


Related Topics



Leave a reply



Submit