What's the Best Way to Search for a String in a File

What's the best way to search for a string in a file?


File.open(filename).grep(/string/)

This loads the whole file into memory (slurps the file). You should avoid file slurping when dealing with large files. That means loading one line at a time, instead of the whole file.

File.foreach(filename).grep(/string/)

It's good practice to clean up after yourself rather than letting the garbage collector handle it at some point. This is more important if your program is long-lived and not just some quick script. Using a code block ensures that the File object is closed when the block terminates.

File.foreach(filename) do |file|
file.grep(/string/)
end

How to find all files containing specific text (string) on Linux?

Do the following:

grep -rnw '/path/to/somewhere/' -e 'pattern'
  • -r or -R is recursive,
  • -n is line number, and
  • -w stands for match the whole word.
  • -l (lower-case L) can be added to just give the file name of matching files.
  • -e is the pattern used during the search

Along with these, --exclude, --include, --exclude-dir flags could be used for efficient searching:

  • This will only search through those files which have .c or .h extensions:
grep --include=\*.{c,h} -rnw '/path/to/somewhere/' -e "pattern"
  • This will exclude searching all the files ending with .o extension:
grep --exclude=\*.o -rnw '/path/to/somewhere/' -e "pattern"
  • For directories it's possible to exclude one or more directories using the --exclude-dir parameter. For example, this will exclude the dirs dir1/, dir2/ and all of them matching *.dst/:
grep --exclude-dir={dir1,dir2,*.dst} -rnw '/path/to/somewhere/' -e "pattern"

This works very well for me, to achieve almost the same purpose like yours.

For more options, see man grep.

How to search for a string in text files?

The reason why you always got True has already been given, so I'll just offer another suggestion:

If your file is not too large, you can read it into a string, and just use that (easier and often faster than reading and checking line per line):

with open('example.txt') as f:
if 'blabla' in f.read():
print("true")

Another trick: you can alleviate the possible memory problems by using mmap.mmap() to create a "string-like" object that uses the underlying file (instead of reading the whole file in memory):

import mmap

with open('example.txt') as f:
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
if s.find('blabla') != -1:
print('true')

NOTE: in python 3, mmaps behave like bytearray objects rather than strings, so the subsequence you look for with find() has to be a bytes object rather than a string as well, eg. s.find(b'blabla'):

#!/usr/bin/env python3
import mmap

with open('example.txt', 'rb', 0) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
if s.find(b'blabla') != -1:
print('true')

You could also use regular expressions on mmap e.g., case-insensitive search: if re.search(br'(?i)blabla', s):

Fastest way to search string in large text file

The standard way to do this is to implement the Aho-Corasick algorithm. It reads the file one time and finds all occurrences of all the strings you give it. See https://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 for an article that provides an implementation and some examples.

Update after more info

Assuming that the list of numbers in your file A is small enough to fit in memory, here's what you'd do, using the implementation in the above-linked article:

// Construct the automaton
AhoCorasickStringSearcher matcher = new AhoCorasickStringSearcher();
foreach (var searchWord in File.ReadLines(File_a)
{
matcher.AddItem(searchWord);
}
matcher.CreateFailureFunction();

// And then do the search on each file
foreach (var fileName in listOfFiles)
{
foreach (var line in File.ReadLines(filename))
{
var matches = matcher.Search(line);
foreach (m in matches)
{
// output match
}
}
}

Note that it only makes one pass through each file, and it never has to load more than one line of the file into memory at any time. The limiting factor here is the memory it takes to build the automaton.

I've used this to search files that totaled over 100 gigabytes, for about 15 million different strings. It takes a few minutes to build the automaton, but then it searches very quickly. One really nice property of the algorithm is that its complexity is O(n + m), where n is the size of the input files, and m is the number of matched items. The number of strings it's searching for doesn't matter. It can search for a million different strings just as quickly as it can search for one or two.

100 gigabytes will take you ... something on the order of about 40 minutes to read. I'd be really surprised if it took an hour for this to find all occurrences of 15 million different strings in 100 gigabytes of data.

Matching whole words

Another option, if you're searching for whole words is to ditch the Aho-Corasick algorithm. Instead, load all of the numbers you're looking for into a HashSet<string>. Then read each line and use a regular expression to find all of the numbers in the line and check to see if they exist in the hash set. For example:

Regex re = new Regex("\w+");
foreach (var line in File.ReadLines(filename))
{
var matches = re.Matchs(line);
foreach (var m in matches)
{
if (hashSetOfValues.Contains(m))
{
// output match
}
}
}

This will likely be somewhat slower than the Aho-Corasick algorithm, but it still makes only one pass through the data. This assumes, of course, that you have enough memory to hold all of those numbers in a hash set.

There are other options for whole words, as I mention in the comments.

Another option, if you know that the words you're looking for are always separated by spaces, is to add spaces to the start and end of the words that you add to the automaton. Or, with some modification to the implementation itself, you could force the matcher's Search method to only return matches that occur in whole words. That could more easily handle matches at the start and end of lines, and additional non-word characters.

Cheap way to search a large text file for a string

If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:

with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something

Best way to search for a string in a list of strings in C

Use a bool flag:

bool found = false;
while ((token = strsep(&str, ","))) {
printf("\nLOOKING FOR: %s\n", token);
while (fgets(line, sizeof line, fd)!=NULL) {
....
found = true;
break;
}
if (found) { ...

But a better way is to use a function:

//...
while ((token = strsep(&str, ","))) {
printf("\nLOOKING FOR: %s\n", token);
if (find_device(fd, token)) {
printf("FOUND THE DEVICE!!");
} else {
printf("DID NOT FIND THE DEVICE!!"");
}
}

bool find_device(FILE *fd, char *token) {
char line [13]; /* or some other suitable maximum line size */
while (fgets(line, sizeof line, fd) != NULL) {
if (strcasecmp(line, token) == 0)
return true;
}
return false;
}

To read the file once, you can rearrange the original code:

char line[13];
while (fgets(line, sizeof line, fd) != NULL) {
char *token, *str, *tofree;
tofree = str = strdup(device_list_str);
while ((token = strsep(&str, ","))) {
printf("\nLOOKING FOR: %s\n", token);
if (strcasecmp(line, token) == 0) {
printf("FOUND THE DEVICE!!");
break;
}
printf("DID NOT FIND THE DEVICE!!"");
}
free(tofree);
}

but in this case you should change the function too:

bool find_device(char* line, char* device_list_str) {
// ..., leaving this as an exercise
return false;
}


Related Topics



Leave a reply



Submit