Java Read Large Text File with 70Million Line of Text

Java Read Large Text File With 70million line of text

1) I am sure there is no difference speedwise, both use FileInputStream internally and buffering

2) You can take measurements and see for yourself

3) Though there's no performance benefits I like the 1.7 approach

try (BufferedReader br = Files.newBufferedReader(Paths.get("test.txt"), StandardCharsets.UTF_8)) {
for (String line = null; (line = br.readLine()) != null;) {
//
}
}

4) Scanner based version

    try (Scanner sc = new Scanner(new File("test.txt"), "UTF-8")) {
while (sc.hasNextLine()) {
String line = sc.nextLine();
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
}

5) This may be faster than the rest

try (SeekableByteChannel ch = Files.newByteChannel(Paths.get("test.txt"))) {
ByteBuffer bb = ByteBuffer.allocateDirect(1000);
for(;;) {
StringBuilder line = new StringBuilder();
int n = ch.read(bb);
// add chars to line
// ...
}
}

it requires a bit of coding but it can be really faster because of ByteBuffer.allocateDirect. It allows OS to read bytes from file to ByteBuffer directly, without copying

6) Parallel processing would definitely increase speed. Make a big byte buffer, run several tasks that read bytes from file into that buffer in parallel, when ready find first end of line, make a String, find next...

Fastest way to write huge data in text file Java

You might try removing the BufferedWriter and just using the FileWriter directly. On a modern system there's a good chance you're just writing to the drive's cache memory anyway.

It takes me in the range of 4-5 seconds to write 175MB (4 million strings) -- this is on a dual-core 2.4GHz Dell running Windows XP with an 80GB, 7200-RPM Hitachi disk.

Can you isolate how much of the time is record retrieval and how much is file writing?

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;

public class FileWritingPerfTest {


private static final int ITERATIONS = 5;
private static final double MEG = (Math.pow(1024, 2));
private static final int RECORD_COUNT = 4000000;
private static final String RECORD = "Help I am trapped in a fortune cookie factory\n";
private static final int RECSIZE = RECORD.getBytes().length;

public static void main(String[] args) throws Exception {
List<String> records = new ArrayList<String>(RECORD_COUNT);
int size = 0;
for (int i = 0; i < RECORD_COUNT; i++) {
records.add(RECORD);
size += RECSIZE;
}
System.out.println(records.size() + " 'records'");
System.out.println(size / MEG + " MB");

for (int i = 0; i < ITERATIONS; i++) {
System.out.println("\nIteration " + i);

writeRaw(records);
writeBuffered(records, 8192);
writeBuffered(records, (int) MEG);
writeBuffered(records, 4 * (int) MEG);
}
}

private static void writeRaw(List<String> records) throws IOException {
File file = File.createTempFile("foo", ".txt");
try {
FileWriter writer = new FileWriter(file);
System.out.print("Writing raw... ");
write(records, writer);
} finally {
// comment this out if you want to inspect the files afterward
file.delete();
}
}

private static void writeBuffered(List<String> records, int bufSize) throws IOException {
File file = File.createTempFile("foo", ".txt");
try {
FileWriter writer = new FileWriter(file);
BufferedWriter bufferedWriter = new BufferedWriter(writer, bufSize);

System.out.print("Writing buffered (buffer size: " + bufSize + ")... ");
write(records, bufferedWriter);
} finally {
// comment this out if you want to inspect the files afterward
file.delete();
}
}

private static void write(List<String> records, Writer writer) throws IOException {
long start = System.currentTimeMillis();
for (String record: records) {
writer.write(record);
}
// writer.flush(); // close() should take care of this
writer.close();
long end = System.currentTimeMillis();
System.out.println((end - start) / 1000f + " seconds");
}
}

Java read URLConnection with many lines efficiently

For this particular case, I would cache the file locally using java you can have a low memory transfer of the file to your computer, then you can go through it line by line without loading the file into memory as well and pull out the data you need or loading it all at once.

EDIT: Made changes on variable names i pulled this from my code and forgot to neutralize the variables. Also FileChannel transferTo/transferFrom can be much more efficient as there is potentially less copies and depending on operation could go from the SocketBuffer -> Disk. FileChannel API

    String urlString = "http://update.domain.com/file.json" // File URL Path
Path diskSaveLocation = Paths.get("file.json"); // This will be just help place it in your working directory

final URL url = new URL(fileUrlString);
final URLConnection conn = url.openConnection();
final long fileLength = conn.getContentLength();
System.out.println(String.format("Downloading file... %s, Size: %d bytes.", fileUrlString, fileLength));
try(
FileOutputStream stream = new FileOutputStream(diskSaveLocation.toFile(), false);
FileChannel fileChannel = stream.getChannel();
ReadableByteChannel inChannel = Channels.newChannel(conn.getInputStream());
) {
long read = 0;
long readerPosition = 0;
while ((read = fileChannel.transferFrom(inChannel, readerPosition, fileLength)) >= 0 && readerPosition < fileLength) {
readerPosition += read;
}
if (fileLength != Files.size(diskSaveLocation)) {
Files.delete(diskSaveLocation);
System.out.println(String.format("File... %s did not download correctly, deleting file artifact!", fileUrlString));
}
}
System.out.println(String.format("File Download... %s completed!", fileUrlString));
((HttpURLConnection) conn).disconnect();

You can now read this same file using a NIO2 method that allows you to read line by line without loading into memory. Using Scanner or RandomAccessFile methods you can prevent reading lines into the heap. If you want to read the whole file in you can also do so locally from the cached file using many of the methods from Javas Files utility methods.

Java Read Large Text File With 70million line of text

How to read a huge HTML file in Java?

@user811433, I did some testing with Apache Commons IO reading a log file with size around 800MB and no error occurred in the execution.

This method opens an InputStream for the file. When you have finished
with the iterator you should close the stream to free internal
resources. This can be done by calling the LineIterator.close() or
LineIterator.closeQuietly(LineIterator) method.

In case you process line by line like a Stream, The recommended usage pattern is something like this:

File file = new File("C:\\Users\\lucas\\Desktop\\file-with-800MB.log");

LineIterator it = FileUtils.lineIterator(file, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line, here just sysout...
System.out.println( line );
}
} finally {
LineIterator.closeQuietly(it);
}

Some extra references, here and here



Related Topics



Leave a reply



Submit