Read Large Files in Java

Read large files in Java

First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).

Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).

If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,

Java Read Large Text File With 70million line of text

1) I am sure there is no difference speedwise, both use FileInputStream internally and buffering

2) You can take measurements and see for yourself

3) Though there's no performance benefits I like the 1.7 approach

try (BufferedReader br = Files.newBufferedReader(Paths.get("test.txt"), StandardCharsets.UTF_8)) {
for (String line = null; (line = br.readLine()) != null;) {
//
}
}

4) Scanner based version

    try (Scanner sc = new Scanner(new File("test.txt"), "UTF-8")) {
while (sc.hasNextLine()) {
String line = sc.nextLine();
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
}

5) This may be faster than the rest

try (SeekableByteChannel ch = Files.newByteChannel(Paths.get("test.txt"))) {
ByteBuffer bb = ByteBuffer.allocateDirect(1000);
for(;;) {
StringBuilder line = new StringBuilder();
int n = ch.read(bb);
// add chars to line
// ...
}
}

it requires a bit of coding but it can be really faster because of ByteBuffer.allocateDirect. It allows OS to read bytes from file to ByteBuffer directly, without copying

6) Parallel processing would definitely increase speed. Make a big byte buffer, run several tasks that read bytes from file into that buffer in parallel, when ready find first end of line, make a String, find next...

How to read large files (a single continuous string) in Java?

So first and foremost, based on comments on your question, as Joachim Sauer stated:

If there are no newlines, then there is only a single line and thus only one line number.

So your usecase is faulty, at best.

Let's move past that, and assume maybe there are new line characters - or better yet, assume that the . character you're splitting on is intended to be a newline psudeo-replacement.

Scanner is not a bad approach here, though there are others. Since you provided a Scanner, lets continue with that, but you want to make sure you're wrapping it around a BufferedReader. You clearly don't have a lot of memory, and a BufferedReader allows your to read 'chunks' of a file, as buffered by the BufferedReader, while utilizing the functionality of the Scanner completely obscure to you as a caller that the buffering is happening:

Scanner sc = new Scanner(new BufferedReader(new FileReader(new File("a.txt")), 10*1024));

What this is basically doing, is letting the Scanner function as you expect, but allowing you to buffer 10MB at a time, minimizing your memory footprint. Now, you just keep calling

sc.useDelimiter("\\.");
for(int i = 0; sc.hasNext(); i++) {
String psudeoLine = sc.next();
//store line 'i' in your database for this psudeo-line
//DO NOT store psudeoLine anywhere else - you don't have memory for it
}

Since you don't have enough memory, the clear thing to iterate (and re-iterate) is don't store any part of the file within your JVM's heapspace after reading it. Read it, use it how you need it, and allow it to be marked for JVM garbage collection. In your case, you mention you want to store the psudeo lines in a database, so you want to read the psudeo-line, store it in the database, and just discard it.

There are other things to point out here, such as configuring your JVM arguments, but I hesitate to even mention it because just setting your JVM memory high is a bad idea too - another brute force approach. There's nothing wrong with setting your JVM memory max heap size higher, but learning memory management is better if you're still learning how to write software. You'll get in less trouble later when you get into professional development.

Also, I mentioned Scanner and BufferedReader because you mentioned that in your question, but I think checking out java.nio.file.Path.lines() as pointed out by deHaar is also a good idea. This basically does the same thing as the code I've explicitly laid out, with the caveat that it still only does 1 line at a time without the ability to change what you're 'splitting' on. So if your text file has 1 single line in it, this will still cause you a problem and you will still need something like a scanner to fragment the line out.

Java read huge file ( ~100GB ) efficiently

If this is a binary file, then reading in "lines" does not make a lot of sense.

If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.

And repeat.

Tips:

  • Use a bounded buffer in case you can read lines faster than you can process them.
  • Recycle the byte[] objects to reduce garbage generation.

If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().


The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.

If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().


Does reading in chunks work?

BufferedReader or BufferedInputStream both read in chunks, under the covers.

What will be the optimum buffer size?

That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.

Any formula for that?

No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.

Reading very large text files in java

InputStreamReader is a facility to convert a raw InputStream (stream of bytes) to a stream of characters, according to some charset. FIleInputStream is a stream of bytes (it extends InputStream) from a given file. You can use InputStreamReader to read text, for instance, from a socket as well, as socket.getInputStream() also gives an InputStream.

InputStreamReader is a Reader, the abstract class for a stream of characters. Using an InputStreamReader alone would be inefficient, as each "readLine" would actually read from the file. When you decorate with a BufferedReader, it will read a chunk of bytes and keep it in memory, and use it for subsequent reads.

About the size: the documentation does not state the default value:

https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html

The buffer size may be specified, or the default size may be used. The
default is large enough for most purposes.

You must check the source file to find the value.

https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/BufferedReader.java

This is the implementation in the OpenJDK:

 private static int defaultCharBufferSize = 8192;

The Oracle's closed source JDK implementation may be different.

Modify content of large file

TL; DR;

Do not read and write the same file concurrently.

The issue

Your code starts reading, and then immediately truncates the file it is reading.

 reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);

The first line opens a read handle to the file.
The second line opens a write handle to the same file.
It is not very clear if you look at the documentation of FileWriter constructor, but when you do not use a constructor that allows you to specify the append parameter, then the value is false by default, meaning, you immediately truncate the file if it already exists.

At this point (line 2) you have just erased the file you were about to read. So you end up with an empty file.

What about using append=true

Well, then the file is not erased when it is created, which is "good". So you program starts reading the first line, and outputs (to the same file) the filtered version.

So each time a line is read, another is appended.

No wonder your program will never reach the end of the file : each time it advances a line, it creates another line to process. Generally speaking, you'll never reach end of file (well of course if the file is a single line to begin with, you might but that's a corner case).

The solution

Write to a temporary file, and IF (and only IF) you succed, then swap the files if you really need too.

An advantage of this solution : if for whatever reason your processe crahses, you'll have the original file untouched and you could retry later, which is usually a good thing. Your process is "repeatable".

A disadvantage : you'll need twice the space at some point. (Although you could compress the temp file and reduce this factor but still).

About out of memory issues

When working with arbitrarily large files, the path you chose (using buffered readers and writers) is the right one, because you only use one line-worth of memory at a time.

Therefore it generally avoids memory usage issues (unless of course, you have a file without line breaks, in which case it makes no difference at all).

Other solutions, involving reading the whole file at once, then performing the search/replace in memory, then writing the contents back do not scale that well, so it's good you avoided this kind of computation.

Not related but important

Check out the try with resources syntax to properly close your resources (reader / writer). Here you forgot to close the reader, and you are not closing the writer appropriately anyway (that is : in a finally clause).

Another thing : I'm pretty sure no java program written by a mere mortal will beat tools like sed or awk that are available on most unix platforms (and some more). Maybe you'd want to check if rolling your own in java is worth what is a shell one-liner.



Related Topics



Leave a reply



Submit