How to Read Large Text Files Line by Line, Without Loading It into Memory

How can I read large text files partially (out-of-core)?

I provided this answer because Keith's, while succinct, doesn't close the file explicitly

with open("log.txt") as infile:
for line in infile:
do_something_with(line)

Scala: Reading a huge zipped text file line by line without loading into memory

If the file is Gzipped, java's GzipInputStream gives you streaming access:

   val lines: Iterator[String] = Source
.fromInputStream(new GzipInputStream(new FileInputStream("foo.gz")))
.getLines

If it is a zip archive as your question suggests, that's more complicated. Zip archives are more like folders than individual files. You'd have to read the table of content first, and then scan through the entries to find one you want to read (or to read all of them). Something like this

How do I properly read large text files in Python so I dont clog up memory?

Usually the best thing to do is to iterate over the file directly. The file handler will act as a generator, producing lines one at a time rather than aggregating them all into memory at once into a list (as fh.readlines() does):

with open("somefile") as fh:
for line in fh:
# do something

Furthermore, file handles allow you to read specific amounts of data if you so choose:

with open("somefile") as fh:
number_of_chars = fh.read(15) # 15 is the number of characters in a StringIO style handler
while number_of_chars:
# do something with number_of_chars
number_of_chars = fh.read(15)

Or, if you want to read a specific number of lines:

with open('somefile') as fh:
while True:
chunk_of_lines = [fh.readline() for i in range(5)] # this will read 5 lines at a time
if not chunk_of_lines:
break
# do something else here

Where fh.readline() is analogous to calling next(fh) in a for loop.

The reason a while loop is used in the latter two examples is because once the file has been completely iterated through, fh.readline() or fh.read(some_integer) will yield an empty string, which acts as False and will terminate the loop

How to read a large file - line by line?

The correct, fully Pythonic way to read a file is the following:

with open(...) as f:
for line in f:
# Do something with 'line'

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.

There should be one -- and preferably only one -- obvious way to do it.

How to find difference (line-based) in sorted large text files in Java without loading them in full into memory?

I thought this might be an interesting problem, so I put something together to illustrate how a difference application might work.

I had a file of words for a different application. So, I grabbed the first 100 words and reduced the size of each down to something I could test with easily.

Word List 1

aback
abandon
abandoned
abashed
abatement
abbey
abbot
abbreviate
abdomen
abducted
aberrant
aberration
abetted
abeyance

Word List 2

aardvark
aback
abacus
abandon
abatement
abbey
abbot
abbreviate
abdicate
abdomen
aberrant
aberration

My example application produces two different outputs. Here's the first output from my test run, the full difference output.

Differences between /word1.txt and /word2.txt
-----------------------------------------------------

------ Inserted ----- | aardvark
aback | aback
------ Inserted ----- | abacus
abandon | abandon
abandoned | ------ Deleted ------
abashed | ------ Deleted ------
abatement | abatement
abbey | abbey
abbot | abbot
abbreviate | abbreviate
------ Inserted ----- | abdicate
abdomen | abdomen
abducted | ------ Deleted ------
aberrant | aberrant
aberration | aberration
abetted | ------ Deleted ------
abeyance | ------ Deleted ------

Now, for two really long files, where most of the text will match, this output would be hard to read. So, I also created an abbreviated output.

Differences between /word1.txt and /word2.txt
-----------------------------------------------------

------ Inserted ----- | aardvark
--------------- 1 line is the same --------------
------ Inserted ----- | abacus
--------------- 1 line is the same --------------
abandoned | ------ Deleted ------
abashed | ------ Deleted ------
-------------- 4 lines are the same -------------
------ Inserted ----- | abdicate
--------------- 1 line is the same --------------
abducted | ------ Deleted ------
-------------- 2 lines are the same -------------
abetted | ------ Deleted ------
abeyance | ------ Deleted ------

With these small test files, there's not much difference between the two reports.

With two large text files, the abbreviated report would be a lot easier to read.

Here's the example code.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class Difference {

public static void main(String[] args) {
String file1 = "/word1.txt";
String file2 = "/word2.txt";

try {
new Difference().compareFiles(file1, file2);
} catch (IOException e) {
e.printStackTrace();
}
}

private void compareFiles(String file1, String file2)
throws IOException {
int columnWidth = 25;
int pageWidth = columnWidth + columnWidth + 3;
boolean isFullReport = true;

System.out.println(getTitle(file1, file2));
System.out.println(getDashedLine(pageWidth));
System.out.println();

URL url1 = getClass().getResource(file1);
URL url2 = getClass().getResource(file2);

BufferedReader br1 = new BufferedReader(new InputStreamReader(
url1.openStream()));
BufferedReader br2 = new BufferedReader(new InputStreamReader(
url2.openStream()));

int countEqual = 0;
String line1 = br1.readLine();
String line2 = br2.readLine();

while (line1 != null && line2 != null) {
int result = line1.compareTo(line2);
if (result == 0) {
countEqual++;
if (isFullReport) {
System.out.println(getFullEqualsLine(columnWidth,
line1, line2));
}
line1 = br1.readLine();
line2 = br2.readLine();
} else if (result < 0) {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
line1, ""));
line1 = br1.readLine();
} else {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
"", line2));
line2 = br2.readLine();
}
}

printEqualsLine(pageWidth, countEqual, isFullReport);

while (line1 != null) {
System.out.println(getDifferenceLine(columnWidth,
line1, ""));
line1 = br1.readLine();
}

while (line2 != null) {
System.out.println(getDifferenceLine(columnWidth,
"", line2));
line2 = br2.readLine();
}

br1.close();
br2.close();
}

private void printEqualsLine(int pageWidth, int countEqual,
boolean isFullReport) {
if (!isFullReport && countEqual > 0) {
System.out.println(getEqualsLine(countEqual, pageWidth));
}
}

private String getTitle(String file1, String file2) {
return "Differences between " + file1 + " and " + file2;
}

private String getEqualsLine(int count, int length) {
String lines = "lines are";
if (count == 1) {
lines = "line is";
}
String output = " " + count + " " + lines +
" the same ";
return getTextLine(length, output);
}

private String getFullEqualsLine(int columnWidth, String line1,
String line2) {
String format = "%-" + columnWidth + "s";
return String.format(format, line1) + " | " +
String.format(format, line2);
}

private String getDifferenceLine(int columnWidth, String line1,
String line2) {
String format = "%-" + columnWidth + "s";
String deleted = getTextLine(columnWidth, " Deleted ");
String inserted = getTextLine(columnWidth, " Inserted ");

if (line1.isEmpty()) {
return inserted + " | " + String.format(format, line2);
} else {
return String.format(format, line1) + " | " + deleted;
}
}

private String getTextLine(int length, String output) {
int half2 = (length - output.length()) / 2;
int half1 = length - output.length() - half2;
output = getDashedLine(half1) + output;
output += getDashedLine(half2);
return output;
}

private String getDashedLine(int count) {
String output = "";
for (int i = 0; i < count; i++) {
output += "-";
}
return output;
}

}


Related Topics



Leave a reply



Submit