Extremely Large Single-Line File Parse

Extremely Large Single-Line File Parse

That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader. If you only need one character of lookahead you only need to maintain a single intermediate state, something like:

enum ReadState
{
Start,
SawOpen
}

using (var sr = new StreamReader(@"path\to\clinic.txt"))
using (var sw = new StreamWriter(@"path\to\output.txt"))
{
var rs = ReadState.Start;
while (true)
{
var r = sr.Read();
if (r < 0)
{
if (rs == ReadState.SawOpen)
sw.Write('<');
break;
}

char c = (char) r;
if ((c == '\r') || (c == '\n'))
continue;

if (rs == ReadState.SawOpen)
{
if (c == 'C')
sw.WriteLine();

sw.Write('<');
rs = ReadState.Start;
}

if (c == '<')
{
rs = ReadState.SawOpen;
continue;
}

sw.Write(c);
}
}

How to find a pattern and surrounding content in a very large SINGLE line file?

You could try the -o option:

-o, --only-matching
Show only the part of a matching line that matches PATTERN.

and use a regular expression to match your pattern and the 3 preceding/following characters i.e.

grep -o -P ".{3}pattern.{3}" very_large_file 

In the example you gave, it would be

echo "1234567890abcdefghijklmnopqrstuvwxyz" > tmp.txt
grep -o -P ".{3}890abc.{3}" tmp.txt

Reading Very Large One Liner Text File

I used the gmpy2 module to convert the string to a number.

start = time.clock()  
z=open('Number.txt','r+')
data=z.read()
global a
a=gmpy2.mpz(data)
end = time.clock()
secs = (end - start)
print("Number read in","%s" % (secs),"seconds.", file=f)
print("Number read in","%s" % (secs),"seconds.")
f.flush()
del end,secs,start,z,data

It worked in 3 seconds, much slower, but at least it gave me an integer value.

Thank you all for your invaluable answers, however I'm going to mark this one as soon as possible.

Large one-line XML file parsing: most efficient approach?

Since you have what are basically document fragments rather than normal documents, you could use the underlying XmlReader classes to process it:

// just a test string... XmlTextReader can take a Stream as first argument instead
var elements = @"<E2ETraceEvent/><E2ETraceEvent/>";

using (var reader = new XmlTextReader(elements, XmlNodeType.Element, null))
{
while (reader.Read())
{
Console.WriteLine(reader.Name);
}
}

This will read the XML file one element at a time, and won't keep the whole document in memory. Whatever you do in the read loop is specific to your use case :)

Parse a very large text file with Python?

It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?

To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

Efficient way to parse through huge file

The biggest gain is likely to come from calling split only once per line

size, path = line.strip().split("\t")
# or ...split("\t", 3)[0:2] if there are extra fields to ignore

You can at least simplify your code by treating the input file as an iterator and using the csv module. This might give you a speed-up as well, as it eliminates the need for an explicit call to split:

with open(filepath, "r") as open_file:
reader = csv.reader(open_file, delimiter="\t")
writer = csv.writer(streamed_file)
for size, path in reader:
is_dir = os.path.isdir(path)
writer.writerow([is_dir, size, path])

How to read large files (a single continuous string) in Java?

So first and foremost, based on comments on your question, as Joachim Sauer stated:

If there are no newlines, then there is only a single line and thus only one line number.

So your usecase is faulty, at best.

Let's move past that, and assume maybe there are new line characters - or better yet, assume that the . character you're splitting on is intended to be a newline psudeo-replacement.

Scanner is not a bad approach here, though there are others. Since you provided a Scanner, lets continue with that, but you want to make sure you're wrapping it around a BufferedReader. You clearly don't have a lot of memory, and a BufferedReader allows your to read 'chunks' of a file, as buffered by the BufferedReader, while utilizing the functionality of the Scanner completely obscure to you as a caller that the buffering is happening:

Scanner sc = new Scanner(new BufferedReader(new FileReader(new File("a.txt")), 10*1024));

What this is basically doing, is letting the Scanner function as you expect, but allowing you to buffer 10MB at a time, minimizing your memory footprint. Now, you just keep calling

sc.useDelimiter("\\.");
for(int i = 0; sc.hasNext(); i++) {
String psudeoLine = sc.next();
//store line 'i' in your database for this psudeo-line
//DO NOT store psudeoLine anywhere else - you don't have memory for it
}

Since you don't have enough memory, the clear thing to iterate (and re-iterate) is don't store any part of the file within your JVM's heapspace after reading it. Read it, use it how you need it, and allow it to be marked for JVM garbage collection. In your case, you mention you want to store the psudeo lines in a database, so you want to read the psudeo-line, store it in the database, and just discard it.

There are other things to point out here, such as configuring your JVM arguments, but I hesitate to even mention it because just setting your JVM memory high is a bad idea too - another brute force approach. There's nothing wrong with setting your JVM memory max heap size higher, but learning memory management is better if you're still learning how to write software. You'll get in less trouble later when you get into professional development.

Also, I mentioned Scanner and BufferedReader because you mentioned that in your question, but I think checking out java.nio.file.Path.lines() as pointed out by deHaar is also a good idea. This basically does the same thing as the code I've explicitly laid out, with the caveat that it still only does 1 line at a time without the ability to change what you're 'splitting' on. So if your text file has 1 single line in it, this will still cause you a problem and you will still need something like a scanner to fragment the line out.

Parsing huge logfiles in Node.js - read in line-by-line

I searched for a solution to parse very large files (gbs) line by line using a stream. All the third-party libraries and examples did not suit my needs since they processed the files not line by line (like 1 , 2 , 3 , 4 ..) or read the entire file to memory

The following solution can parse very large files, line by line using stream & pipe. For testing I used a 2.1 gb file with 17.000.000 records. Ram usage did not exceed 60 mb.

First, install the event-stream package:

npm install event-stream

Then:

var fs = require('fs')
, es = require('event-stream');

var lineNr = 0;

var s = fs.createReadStream('very-large-file.csv')
.pipe(es.split())
.pipe(es.mapSync(function(line){

// pause the readstream
s.pause();

lineNr += 1;

// process line here and call s.resume() when rdy
// function below was for logging memory usage
logMemoryUsage(lineNr);

// resume the readstream, possibly from a callback
s.resume();
})
.on('error', function(err){
console.log('Error while reading file.', err);
})
.on('end', function(){
console.log('Read entire file.')
})
);

Sample Image

Please let me know how it goes!



Related Topics



Leave a reply



Submit