How can I read large text files partially (out-of-core)?
I provided this answer because Keith's, while succinct, doesn't close the file explicitly
with open("log.txt") as infile:
for line in infile:
do_something_with(line)
Scala: Reading a huge zipped text file line by line without loading into memory
If the file is Gzipped, java's GzipInputStream
gives you streaming access:
val lines: Iterator[String] = Source
.fromInputStream(new GzipInputStream(new FileInputStream("foo.gz")))
.getLines
If it is a zip archive as your question suggests, that's more complicated. Zip archives are more like folders than individual files. You'd have to read the table of content first, and then scan through the entries to find one you want to read (or to read all of them). Something like this
How do I properly read large text files in Python so I dont clog up memory?
Usually the best thing to do is to iterate over the file directly. The file handler will act as a generator, producing lines one at a time rather than aggregating them all into memory at once into a list (as fh.readlines()
does):
with open("somefile") as fh:
for line in fh:
# do something
Furthermore, file handles allow you to read specific amounts of data if you so choose:
with open("somefile") as fh:
number_of_chars = fh.read(15) # 15 is the number of characters in a StringIO style handler
while number_of_chars:
# do something with number_of_chars
number_of_chars = fh.read(15)
Or, if you want to read a specific number of lines:
with open('somefile') as fh:
while True:
chunk_of_lines = [fh.readline() for i in range(5)] # this will read 5 lines at a time
if not chunk_of_lines:
break
# do something else here
Where fh.readline()
is analogous to calling next(fh)
in a for loop.
The reason a while
loop is used in the latter two examples is because once the file has been completely iterated through, fh.readline()
or fh.read(some_integer)
will yield an empty string, which acts as False
and will terminate the loop
How to read a large file - line by line?
The correct, fully Pythonic way to read a file is the following:
with open(...) as f:
for line in f:
# Do something with 'line'
The with
statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f
treats the file object f
as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.
There should be one -- and preferably only one -- obvious way to do it.
How to find difference (line-based) in sorted large text files in Java without loading them in full into memory?
I thought this might be an interesting problem, so I put something together to illustrate how a difference application might work.
I had a file of words for a different application. So, I grabbed the first 100 words and reduced the size of each down to something I could test with easily.
Word List 1
aback
abandon
abandoned
abashed
abatement
abbey
abbot
abbreviate
abdomen
abducted
aberrant
aberration
abetted
abeyance
Word List 2
aardvark
aback
abacus
abandon
abatement
abbey
abbot
abbreviate
abdicate
abdomen
aberrant
aberration
My example application produces two different outputs. Here's the first output from my test run, the full difference output.
Differences between /word1.txt and /word2.txt
-----------------------------------------------------
------ Inserted ----- | aardvark
aback | aback
------ Inserted ----- | abacus
abandon | abandon
abandoned | ------ Deleted ------
abashed | ------ Deleted ------
abatement | abatement
abbey | abbey
abbot | abbot
abbreviate | abbreviate
------ Inserted ----- | abdicate
abdomen | abdomen
abducted | ------ Deleted ------
aberrant | aberrant
aberration | aberration
abetted | ------ Deleted ------
abeyance | ------ Deleted ------
Now, for two really long files, where most of the text will match, this output would be hard to read. So, I also created an abbreviated output.
Differences between /word1.txt and /word2.txt
-----------------------------------------------------
------ Inserted ----- | aardvark
--------------- 1 line is the same --------------
------ Inserted ----- | abacus
--------------- 1 line is the same --------------
abandoned | ------ Deleted ------
abashed | ------ Deleted ------
-------------- 4 lines are the same -------------
------ Inserted ----- | abdicate
--------------- 1 line is the same --------------
abducted | ------ Deleted ------
-------------- 2 lines are the same -------------
abetted | ------ Deleted ------
abeyance | ------ Deleted ------
With these small test files, there's not much difference between the two reports.
With two large text files, the abbreviated report would be a lot easier to read.
Here's the example code.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class Difference {
public static void main(String[] args) {
String file1 = "/word1.txt";
String file2 = "/word2.txt";
try {
new Difference().compareFiles(file1, file2);
} catch (IOException e) {
e.printStackTrace();
}
}
private void compareFiles(String file1, String file2)
throws IOException {
int columnWidth = 25;
int pageWidth = columnWidth + columnWidth + 3;
boolean isFullReport = true;
System.out.println(getTitle(file1, file2));
System.out.println(getDashedLine(pageWidth));
System.out.println();
URL url1 = getClass().getResource(file1);
URL url2 = getClass().getResource(file2);
BufferedReader br1 = new BufferedReader(new InputStreamReader(
url1.openStream()));
BufferedReader br2 = new BufferedReader(new InputStreamReader(
url2.openStream()));
int countEqual = 0;
String line1 = br1.readLine();
String line2 = br2.readLine();
while (line1 != null && line2 != null) {
int result = line1.compareTo(line2);
if (result == 0) {
countEqual++;
if (isFullReport) {
System.out.println(getFullEqualsLine(columnWidth,
line1, line2));
}
line1 = br1.readLine();
line2 = br2.readLine();
} else if (result < 0) {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
line1, ""));
line1 = br1.readLine();
} else {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
"", line2));
line2 = br2.readLine();
}
}
printEqualsLine(pageWidth, countEqual, isFullReport);
while (line1 != null) {
System.out.println(getDifferenceLine(columnWidth,
line1, ""));
line1 = br1.readLine();
}
while (line2 != null) {
System.out.println(getDifferenceLine(columnWidth,
"", line2));
line2 = br2.readLine();
}
br1.close();
br2.close();
}
private void printEqualsLine(int pageWidth, int countEqual,
boolean isFullReport) {
if (!isFullReport && countEqual > 0) {
System.out.println(getEqualsLine(countEqual, pageWidth));
}
}
private String getTitle(String file1, String file2) {
return "Differences between " + file1 + " and " + file2;
}
private String getEqualsLine(int count, int length) {
String lines = "lines are";
if (count == 1) {
lines = "line is";
}
String output = " " + count + " " + lines +
" the same ";
return getTextLine(length, output);
}
private String getFullEqualsLine(int columnWidth, String line1,
String line2) {
String format = "%-" + columnWidth + "s";
return String.format(format, line1) + " | " +
String.format(format, line2);
}
private String getDifferenceLine(int columnWidth, String line1,
String line2) {
String format = "%-" + columnWidth + "s";
String deleted = getTextLine(columnWidth, " Deleted ");
String inserted = getTextLine(columnWidth, " Inserted ");
if (line1.isEmpty()) {
return inserted + " | " + String.format(format, line2);
} else {
return String.format(format, line1) + " | " + deleted;
}
}
private String getTextLine(int length, String output) {
int half2 = (length - output.length()) / 2;
int half1 = length - output.length() - half2;
output = getDashedLine(half1) + output;
output += getDashedLine(half2);
return output;
}
private String getDashedLine(int count) {
String output = "";
for (int i = 0; i < count; i++) {
output += "-";
}
return output;
}
}
Related Topics
Installing Pip Is Not Working in Python ≪ 3.6
How to Use Subprocess.Popen to Connect Multiple Processes by Pipes
How to Use a Decimal Step Value For Range()
Tkinter - Executing Functions Over Time
How to Search and Replace Text in a File
Flask View Raises Typeerror: 'Bool' Object Is Not Callable
How to Save/Restore a Model After Training
Pygame - How to Display Text With Font & Color
Apply Multiple Functions to Multiple Groupby Columns
Should Import Statements Always Be At the Top of a Module
Py2Exe - Generate Single Executable File
How to Use a Variable Inside a Regular Expression