Cheap Way to Search a Large Text File for a String

Cheap way to search a large text file for a string

If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:

with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something

Fastest way to search string in large text file

The standard way to do this is to implement the Aho-Corasick algorithm. It reads the file one time and finds all occurrences of all the strings you give it. See https://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 for an article that provides an implementation and some examples.

Update after more info

Assuming that the list of numbers in your file A is small enough to fit in memory, here's what you'd do, using the implementation in the above-linked article:

// Construct the automaton
AhoCorasickStringSearcher matcher = new AhoCorasickStringSearcher();
foreach (var searchWord in File.ReadLines(File_a)
{
matcher.AddItem(searchWord);
}
matcher.CreateFailureFunction();

// And then do the search on each file
foreach (var fileName in listOfFiles)
{
foreach (var line in File.ReadLines(filename))
{
var matches = matcher.Search(line);
foreach (m in matches)
{
// output match
}
}
}

Note that it only makes one pass through each file, and it never has to load more than one line of the file into memory at any time. The limiting factor here is the memory it takes to build the automaton.

I've used this to search files that totaled over 100 gigabytes, for about 15 million different strings. It takes a few minutes to build the automaton, but then it searches very quickly. One really nice property of the algorithm is that its complexity is O(n + m), where n is the size of the input files, and m is the number of matched items. The number of strings it's searching for doesn't matter. It can search for a million different strings just as quickly as it can search for one or two.

100 gigabytes will take you ... something on the order of about 40 minutes to read. I'd be really surprised if it took an hour for this to find all occurrences of 15 million different strings in 100 gigabytes of data.

Matching whole words

Another option, if you're searching for whole words is to ditch the Aho-Corasick algorithm. Instead, load all of the numbers you're looking for into a HashSet<string>. Then read each line and use a regular expression to find all of the numbers in the line and check to see if they exist in the hash set. For example:

Regex re = new Regex("\w+");
foreach (var line in File.ReadLines(filename))
{
var matches = re.Matchs(line);
foreach (var m in matches)
{
if (hashSetOfValues.Contains(m))
{
// output match
}
}
}

This will likely be somewhat slower than the Aho-Corasick algorithm, but it still makes only one pass through the data. This assumes, of course, that you have enough memory to hold all of those numbers in a hash set.

There are other options for whole words, as I mention in the comments.

Another option, if you know that the words you're looking for are always separated by spaces, is to add spaces to the start and end of the words that you add to the automaton. Or, with some modification to the implementation itself, you could force the matcher's Search method to only return matches that occur in whole words. That could more easily handle matches at the start and end of lines, and additional non-word characters.

How to quickly search a large file for a String in Java?

1st figure out how long it takes you to actually read the entire file's contents vs how long it takes to scan them for your pattern.

if your results are dominated by the read time (and assumming you read it properly, so channels or at the very least buffered readers) there's not much to do.

if its the scanning time that dominates you could read all lines and then ship small batches of lines to be searched in to a work queue, where you could have multiple threads picking up line batches and searching in them.

ballpark figures

  • assuming 50 MB/sec as the hard drive read speed (and thats slow by modern standards) you should be able to read up the entire file into memory in <10 seconds.
  • looking at MD5 hashing speed benchmarks (example here) shows us that the hashing rate can be at least as fast (often faster) than disk read speed. also, string searching is faster, simpler and parallelizes better than hashing.

given those 2 estimates i think a proper implementation can easily land you a run time on the order of 10 seconds (if you start kicking off search jobs as you read line batches), and be largely dominated by your disk read time.

Searching numerous very large (50GB+) txt files for text matching

Assuming you have new lines, then it's simple enough to use a stream with a good buffer size. FileStreams and alike have an internal buffer and the internal mechanism will read from disk when it needs it allowing you to read an entire file without running into the fundamental .net array size limit or allocating large files into memory.

Note that anything over 85k will end up on your Large Object Heap anyway, so you might want to be mindful of the size one way or another.

using var sr = new StreamReader(
new FileStream("SomeFileName",
FileMode.Open,
FileAccess.Read,
FileShare.None,
1024 * 1024,// some nasty buffer size that you have benchmarked for your system
FileOptions.SequentialScan));

while (!sr.EndOfStream)
{
if (sr.ReadLine().Contains("bob"))
return true;
}

Notes : The buffer size will be key to performance here, SSD's can take a larger size than the old spindal crayon hdds. Determining the right size will require benchmarking

Fastest Text search method in a large text file

  1. Load the whole text in RAM at once. Don't read line by line.
  2. Search for the pattern in the blob. If you find it, use text.count('\n',0,pos) to get the line number.
  3. If you don't need the line number, look for the previous and next EOL to cut the line out of the text.

The loop in Python is slow. String searching is very fast. If you need to look for several strings, use regular expressions.

If that's not fast enough, use an external program like grep.

Search Large Text File for Thousands of strings

Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:

hand, has, have, foot, file

The resulting tree would look something like this:
Tree generated by the list of search terms

The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.

Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.

  • If you get to a leaf node (the ones shown in red), you have found a match, and can store it.
  • If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree

This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.

Edit

The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm

tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
# Keeps track of where I'm currently in the tree
nodes = []
for character in huge_file:
foreach node in nodes:
if node.has_child(character):
node.follow_edge(character)
if node.isLeaf():
# You found a match!!
else:
nodes.delete(node)
if tree.has_child(character):
nodes.add(tree.get_child(character))

Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.

Matching a string in a Large text file?

Are you going to have to match against this text file several times? If so, I'd create a HashSet<string>. Otherwise, just read it line by line (I'm assuming there's one string per line) and see whether it matches.

152MB of ASCII will end up as over 300MB of Unicode data in memory - but in modern machines have plenty of memory, so keeping the whole lot in a HashSet<string> will make repeated lookups very fast indeed.

The absolute simplest way to do this is probably to use File.ReadAllLines, although that will create an array which will then be discarded - not great for memory usage, but probably not too bad:

HashSet<string> strings = new HashSet<string>(File.ReadAllLines("data.txt"));
...

if (strings.Contains(stringToCheck))
{
...
}

How to search string in large text file?

You can use this to search the file:

var output = File.ReadLines(FileName).
Where(line => line.Split(',')[1] == word).
FirstOrDefault();

But it won't solve this:

if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.

There's not a practical way to avoid this for a basic file.

The only ways around actually reading through the file is either maintaining an index, which requires absolute control over everything that might write into the file, or if you can guarantee the file is already sorted by the columns that matter, in which case you can do something like a binary search.

But neither is likely for a random csv file. This is one of the reasons people use databases.

However, we also need to stop and check whether this is really a problem for you. I'd expect the code above to handle files up to a couple hundred MB in around 1 to 2 seconds on modern hardware, even if you need to look through the whole file.



Related Topics



Leave a reply



Submit