Fastest Way to Find Strings in a File

Fastest way to find strings in a file

I would read it line by line and check the conditions. Once you have seen a group you can quit. This way you don't need to read the whole file into memory. Like this:

    public bool ContainsGroup(string file)
{
using (var reader = new StreamReader(file))
{
var hasAction = false;
var hasInput = false;
var hasResult = false;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (!hasAction)
{
if (line.StartsWith("ACTION:"))
hasAction = true;
}
else if (!hasInput)
{
if (line.StartsWith("INPUT:"))
hasInput = true;
}
else if (!hasResult)
{
if (line.StartsWith("RESULT:"))
hasResult = true;
}

if (hasAction && hasInput && hasResult)
return true;
}
return false;
}
}

This code checks if there is a line starting with ACTION then one with INPUT and then one with RESULT. If the order of those is not important then you can omit the if () else if () checks. In case the line does not start with the strings replace StartsWith with Contains.

Fastest way to search string in large text file

The standard way to do this is to implement the Aho-Corasick algorithm. It reads the file one time and finds all occurrences of all the strings you give it. See https://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 for an article that provides an implementation and some examples.

Update after more info

Assuming that the list of numbers in your file A is small enough to fit in memory, here's what you'd do, using the implementation in the above-linked article:

// Construct the automaton
AhoCorasickStringSearcher matcher = new AhoCorasickStringSearcher();
foreach (var searchWord in File.ReadLines(File_a)
{
matcher.AddItem(searchWord);
}
matcher.CreateFailureFunction();

// And then do the search on each file
foreach (var fileName in listOfFiles)
{
foreach (var line in File.ReadLines(filename))
{
var matches = matcher.Search(line);
foreach (m in matches)
{
// output match
}
}
}

Note that it only makes one pass through each file, and it never has to load more than one line of the file into memory at any time. The limiting factor here is the memory it takes to build the automaton.

I've used this to search files that totaled over 100 gigabytes, for about 15 million different strings. It takes a few minutes to build the automaton, but then it searches very quickly. One really nice property of the algorithm is that its complexity is O(n + m), where n is the size of the input files, and m is the number of matched items. The number of strings it's searching for doesn't matter. It can search for a million different strings just as quickly as it can search for one or two.

100 gigabytes will take you ... something on the order of about 40 minutes to read. I'd be really surprised if it took an hour for this to find all occurrences of 15 million different strings in 100 gigabytes of data.

Matching whole words

Another option, if you're searching for whole words is to ditch the Aho-Corasick algorithm. Instead, load all of the numbers you're looking for into a HashSet<string>. Then read each line and use a regular expression to find all of the numbers in the line and check to see if they exist in the hash set. For example:

Regex re = new Regex("\w+");
foreach (var line in File.ReadLines(filename))
{
var matches = re.Matchs(line);
foreach (var m in matches)
{
if (hashSetOfValues.Contains(m))
{
// output match
}
}
}

This will likely be somewhat slower than the Aho-Corasick algorithm, but it still makes only one pass through the data. This assumes, of course, that you have enough memory to hold all of those numbers in a hash set.

There are other options for whole words, as I mention in the comments.

Another option, if you know that the words you're looking for are always separated by spaces, is to add spaces to the start and end of the words that you add to the automaton. Or, with some modification to the implementation itself, you could force the matcher's Search method to only return matches that occur in whole words. That could more easily handle matches at the start and end of lines, and additional non-word characters.

Fastest way to find text in large online text file

Below is the fastest method I can think of. That's not to say you can't shave a few ticks here and there... Happy for other feedback to see if we can streamline this any more.

// WebClient is very bulky with a lot of stuff we don't need.
// By dealing with the request, response and stream ourself we can get the string a bit faster.
WebRequest request = WebRequest.Create("http://www.UrlToDownloadStringFrom.com");
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader streamReader = new StreamReader(stream);

// the result should return "firstWord~:::~secondWord" as expected.
string result = streamReader.ReadToEnd();

// split the string apart whenever the string ~:::~ appears within it.
string[] resultSplit = result.Split(new string[] { "~:::~" }, StringSplitOptions.None);

// resultSplit[0] is firstWord, resultSplit[1] is second word
string secondWord = resultSplit[1];

Fastest way to find a string in a text file with java

Have a look at the Scanner class, that ships with JDK (See official documentation). You will be able to skip certain parts of input (in this case - text file) and match against regular expression of your desire. I'm not sure if this is the most efficient way, but sure enough - it's pretty simple. You might also take a look at this example, which will help you get started.

Fastest Text search method in a large text file



Related Topics



Leave a reply



Submit