Only read specific line numbers from a large file in Python?
Here are some options:
- Go over the file at least once and keep track of the file offsets of the lines you are interested in. This is a good approach if you might be seeking these lines multiple times and the file wont be changed.
- Consider changing the data format. For example csv instead of json (see comments).
- If you have no other alternative, use the traditional:
def get_lines(..., linenums: list):
with open(...) as f:
for lno, ln in enumerate(f):
if lno in linenums:
yield ln
On a 4GB file this took ~6s for linenums = [n // 4, n // 2, n - 1]
where n = lines_in_file
.
Read lines by number from a large file
The trick is to use connection AND open it before read.table
:
con<-file('filename')
open(con)
read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)
You may also try scan
, it is faster and gives more control.
How to read specific lines from a file (by line number)?
If the file to read is big, and you don't want to read the whole file in memory at once:
fp = open("file")
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
fp.close()
Note that i == n-1
for the n
th line.
In Python 2.6 or later:
with open("file") as fp:
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
How to read a large file - line by line?
The correct, fully Pythonic way to read a file is the following:
with open(...) as f:
for line in f:
# Do something with 'line'
The with
statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f
treats the file object f
as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.
There should be one -- and preferably only one -- obvious way to do it.
How can I read large text files line by line, without loading them into memory?
Use a for
loop on a file object to read it line-by-line. Use with open(...)
to let a context manager ensure that the file is closed after reading:
with open("log.txt") as infile:
for line in infile:
print(line)
How To Read Line Numbers x Through (x+y) From A Very Large File
When you write:
lines.skip(startLine)
you create a new stream, but you don't save a reference to it, so you lose the operation.
I suspect you want something like:
return lines.skip(startLine)
.limit(100000)
.map(fileReader::populateMyModel)
.collect(toList());
Reading a particular line by line number in a very large file
pack N
creates a 32-bit integer. The maximum 32-bit integer is 4GB, so using that to store indexes into a file that's 100GB in size won't work.
Some builds use 64-bit integers. On those, you could use j
.
Some builds use 32-bit integers. tell
returns a floating-point number on those, allowing you to index files up to 8,388,608 GB in size losslessly. On those, you should use F
.
Portable code would look as follows:
use Config qw( %Config );
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
...
print $index_file pack($off_t, $offset);
...
Note: I'm assuming the index file is only used by the same Perl that built it (or at least one with with the same integer size, seek size and machine endianness). Let me know if that assumption doesn't hold for you.
Related Topics
Azure Put Blob Authentication Fails in R
Referring to Variables by Name in a Dplyr Function Returns Object Not Found Error
Delete Rows with Blank Values in One Particular Column
Assign Headers Based on Existing Row in Dataframe in R
Piecewise Regression with R: Plotting the Segments
Change Background Color of R Plot
Round Vector of Numerics to Integer While Preserving Their Sum
Reshape Wide Format, to Multi-Column Long Format
Writing R Function with If Enviornment
Extract Nested List Elements Using Bracketed Numbers and Names
How to Plot the Results of a Mixed Model
How to Apply Function Over Each Matrix Element's Indices
Use R to Convert PDF Files to Text Files for Text Mining
Plotting the Average Values for Each Level in Ggplot2