Regexp Search Through a Very Large File

Regex search pattern in very large file

I think the solution for you would be to implement CharSequence as a wrapper over very large text files.

Why? Because building a Matcher from a Pattern takes a CharSequence as an argument.

Of course, easier said than done... But then you only have three methods to implement, so that shouldn't be too hard...


EDIT I took the plunge and I ate my own dog's food. The "worst part" is that it actually works!

Regexp search through a very large file

  1. Traverse the file in chunks, instead of line by line, where chunks are created by occurences of a frequently occuring character or pattern , say "X".
  2. "X" is such that it never exists in your regex, i.e. "X" is where your regex will never match the string.
  3. Match your regex in the current chunk,extract matches and proceed to next chunk.

Example:

This is string with multline numbers -2000
2223434
34356666
444564646
. These numbers can occur at 34345
567567 places, and on 67
87878 pages . The problem is to find a good
way to extract these more than 100
0 regexes without memory hogging.

In this text, assume the desired pattern is numeric strings e.g /d+/s match digits multiline,
Then instead of processing and loading whole file, you can chose a chunk creating pattern, say FULL STOP in this case . and only read and process till this pattern, then move to next chunk.

CHUNK#1:

This is string with multline numbers -2000
2223434
34356666
444564646
.

CHUNK#2:

These numbers can occur at 34345
567567 places, and on 67
87878 pages

and so on.

EDIT:
Adding @Ranty's suggestion from the comments as well:

Or simply read by some amount of lines, say 20. When you find the
match within, clear up to the match end and append another 20 lines.
No need for figuring frequently occurring 'X'.

How can I find and replace text in a larger file (150MB-250MB) with regular expressions in C#?

If you can load the whole string data into a single string variable, there is no need to first match and then append text to matches in a loop. You can use a single Regex.Replace operation:

string text = File.ReadAllText(srcFile);
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
sw.Write(myregex.Replace(text, "$&\f\f"));
}

Details:

  • string text = File.ReadAllText(srcFile); - reads the srcFile file to the text variable (match would be confusing)
  • myregex.Replace(text, "$&\f\f") - replaces all occurrences of myregex matches with themselves ($& is a backreference to the whole match value) while appending two \f chars right after each match.

Is there a fast way to parse through a large file with regex?

Memory Mapped Files and Task Parallel Library for help.

  1. Create persisted MMF with multiple random access views. Each view corresponds to a particular part of a file
  2. Define parsing method with parameter like IEnumerable<string>, basically to abstract a set of not parsed lines
  3. Create and start one TPL task per one MMF view with Parse(IEnumerable<string>) as a Task action
  4. Each of worker tasks adds a parsed data into the shared queue of BlockingCollection type
  5. An other Task listen to BC (GetConsumingEnumerable()) and processes all data which already parsed by worker Tasks

See Pipelines pattern on MSDN

Must say this solution is for .NET Framework >=4

superfast regexmatch in large text file

Check if this matches your requirement:

with open('largetextfile.txt') as f:
for line in f:
if line.startswith('1234567'):
print line


Related Topics



Leave a reply



Submit