Fast String Search in a Very Large File

Fastest way to search string in large text file

The standard way to do this is to implement the Aho-Corasick algorithm. It reads the file one time and finds all occurrences of all the strings you give it. See https://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 for an article that provides an implementation and some examples.

Update after more info

Assuming that the list of numbers in your file A is small enough to fit in memory, here's what you'd do, using the implementation in the above-linked article:

// Construct the automaton
AhoCorasickStringSearcher matcher = new AhoCorasickStringSearcher();
foreach (var searchWord in File.ReadLines(File_a)
{
matcher.AddItem(searchWord);
}
matcher.CreateFailureFunction();

// And then do the search on each file
foreach (var fileName in listOfFiles)
{
foreach (var line in File.ReadLines(filename))
{
var matches = matcher.Search(line);
foreach (m in matches)
{
// output match
}
}
}

Note that it only makes one pass through each file, and it never has to load more than one line of the file into memory at any time. The limiting factor here is the memory it takes to build the automaton.

I've used this to search files that totaled over 100 gigabytes, for about 15 million different strings. It takes a few minutes to build the automaton, but then it searches very quickly. One really nice property of the algorithm is that its complexity is O(n + m), where n is the size of the input files, and m is the number of matched items. The number of strings it's searching for doesn't matter. It can search for a million different strings just as quickly as it can search for one or two.

100 gigabytes will take you ... something on the order of about 40 minutes to read. I'd be really surprised if it took an hour for this to find all occurrences of 15 million different strings in 100 gigabytes of data.

Matching whole words

Another option, if you're searching for whole words is to ditch the Aho-Corasick algorithm. Instead, load all of the numbers you're looking for into a HashSet<string>. Then read each line and use a regular expression to find all of the numbers in the line and check to see if they exist in the hash set. For example:

Regex re = new Regex("\w+");
foreach (var line in File.ReadLines(filename))
{
var matches = re.Matchs(line);
foreach (var m in matches)
{
if (hashSetOfValues.Contains(m))
{
// output match
}
}
}

This will likely be somewhat slower than the Aho-Corasick algorithm, but it still makes only one pass through the data. This assumes, of course, that you have enough memory to hold all of those numbers in a hash set.

There are other options for whole words, as I mention in the comments.

Another option, if you know that the words you're looking for are always separated by spaces, is to add spaces to the start and end of the words that you add to the automaton. Or, with some modification to the implementation itself, you could force the matcher's Search method to only return matches that occur in whole words. That could more easily handle matches at the start and end of lines, and additional non-word characters.

How to speed up a search on large collection of text files (1TB)

There are a lot of answers already, I just wanted to add my two cents:

  1. Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
  2. Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.

Suggestion

  1. I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
  2. About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.

  3. In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.

  4. I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.

Cheap way to search a large text file for a string

If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:

with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something

Is there any faster way to search in huge file without loading it into memory?

I propose to use modified version of Knuth Moriss Pratt algorithm.
algorithm kmp_search:
input:
a stream of characters, S (the text to be searched)
an array of characters, W (the word sought)
output:
an integer (the zero-based position in S at which W is found)

define variables:
an integer, m ← 0 (the beginning of the current match in S)
an integer, i ← 0 (the position of the current character in W)
an array of integers, T (the table, computed elsewhere)

while m + i < length(S) do
if W[i] = S[m + i] then
if i = length(W) - 1 then
return m
let i ← i + 1
else
if T[i] > -1 then
let m ← m + i - T[i], i ← T[i]
else
let m ← m + 1, i ← 0

(if we reach here, we have searched all of S unsuccessfully)
return the length of S

The text string can be streamed in because the KMP algorithm does not backtrack in the text. (This is another improvement over the naive algorithm, which doesn’t naturally support streaming.) If streaming, the amortized time to process an incoming character is Ɵ(1) but the worst-case time is Ɵ(min(m, n′)), where n′ is the number of text characters seen so far. Source

Referecne (Java) implementation could be found here

package com.twitter.elephantbird.util;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher {

protected byte[] pattern_;
protected int[] borders_;

// An upper bound on pattern length for searching. Results are undefined for longer patterns.
public static final int MAX_PATTERN_LENGTH = 1024;

public StreamSearcher(byte[] pattern) {
setPattern(pattern);
}

/**
* Sets a new pattern for this StreamSearcher to use.
* @param pattern
* the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern) {
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}

/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* @return bytes consumed if found, -1 otherwise.
* @throws IOException
*/
public long search(InputStream stream) throws IOException {
long bytesRead = 0;

int b;
int j = 0;

while ((b = stream.read()) != -1) {
bytesRead++;

while (j >= 0 && (byte)b != pattern_[j]) {
j = borders_[j];
}
// Move to the next character in the pattern.
++j;

// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length) {
return bytesRead;
}
}

// No dice, Note that the stream is now completely consumed.
return -1;
}

/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
protected void preProcess() {
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length) {
while (j >= 0 && pattern_[i] != pattern_[j]) {
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}

Similar question: Efficient way to search a stream for a string

Fastest Text search method in a large text file

  1. Load the whole text in RAM at once. Don't read line by line.
  2. Search for the pattern in the blob. If you find it, use text.count('\n',0,pos) to get the line number.
  3. If you don't need the line number, look for the previous and next EOL to cut the line out of the text.

The loop in Python is slow. String searching is very fast. If you need to look for several strings, use regular expressions.

If that's not fast enough, use an external program like grep.



Related Topics



Leave a reply



Submit