file.tell() inconsistency
Using open files as an iterator uses a read-ahead buffer to increase efficiency. As a result, the file pointer advances in large steps across the file as you loop over the lines.
From the File Objects documentation:
In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the
next()
method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combiningnext()
with other file methods (likereadline()
) does not work right. However, usingseek()
to reposition the file to an absolute position will flush the read-ahead buffer.
If you need to rely on .tell()
, don't use the file object as an iterator. You can turn .readline()
into an iterator instead (at the price of some performance loss):
for line in iter(f.readline, ''):
print f.tell()
This uses the iter()
function sentinel
argument to turn any callable into an iterator.
Python file.tell gives wrong value location
The cause is (rather obscurely) explained in the docs for a file object's next()
method:
When a file is used as an iterator, typically in a for loop (for example,
for line in f: print line), the next() method is called repeatedly.
This method returns the next input line, or raises StopIteration when
EOF is hit. In order to make a for loop the most efficient way of looping
over the lines of a file (a very common operation), the next() method
uses a hidden read-ahead buffer. As a consequence of using a read-ahead
buffer, combining next() with other file methods (like readline()) does
not work right. However, using seek() to reposition the file to an
absolute position will flush the read-ahead buffer.
The values returned by tell()
reflect how far this hidden read-ahead buffer has gotten, which will typically be up to a few thousand bytes beyond the characters your program has actually retrieved.
There's no portable way around this. If you need to mix tell()
with reading lines, then use the file's readline()
method instead. The tradeoff is that, in return for getting usable tell()
results, iterating over a large file with readline()
is typically significantly slower than using for line in file_object:
.
Code
Concretely, change the loop to this:
line = self.fh.readline()
while line:
if p.search(line):
self.porSnipStartFPtr = self.fh.tell()
sys.stdout.write("found regPorSnip")
line = fh.readline()
I'm not sure that's what you really want, though: tell()
is capturing the position of the start of the next line. If want the position of the start of the line, then you need to change the logic, like so:
pos = self.fh.tell()
line = self.fh.readline()
while line:
if p.search(line):
self.porSnipStartFPtr = pos
sys.stdout.write("found regPorSnip")
pos = self.fh.tell()
line = fh.readline()
or do it with a "loop and a half":
while True:
pos = self.fh.tell()
line = self.fh.readline()
if not line:
break
if p.search(line):
self.porSnipStartFPtr = pos
sys.stdout.write("found regPorSnip")
Processing large files in chunks: inconsistent seek with readline
You were so close! A relatively simple change to your final code (reading in the data as bytes
and not str
) makes it all (almost) work.
The main issue was because reading from binary files counts bytes, but reading from text files counts text, and you did your first counting in bytes and your second in characters, leading to your assumptions about what data had already been read to be wrong. It's nothing about an internal, hidden buffer.
Other changes:
- The code needs to split on
b'\n'
instead of usingbytes.splitlines()
, and only remove blank lines after the relevant detection code. - Unless the size of the file changes (in which case your existing code will break anyway),
chunkify
can be replaced by a simpler, faster loop that's functionally identical without having to keep the file open.
This gives the final code:
from os import stat
def chunkify(pfin, buf_size=1024**2):
file_end = stat(pfin).st_size
i = -buf_size
for i in range(0, file_end - buf_size, buf_size):
yield i, buf_size, False
leftover = file_end % buf_size
if leftover == 0: # if the last section is buf_size in size
leftover = buf_size
yield i + buf_size, leftover, True
def process_batch(pfin, chunk_start, chunk_size, is_last, leftover):
with open(pfin, 'rb') as f:
f.seek(chunk_start)
chunk = f.read(chunk_size)
# Add previous leftover to current chunk
chunk = leftover + chunk
batch = chunk.split(b'\n')
# If this chunk is not the last one,
# pop the last item as that will be an incomplete sentence
# We return this leftover to use in the next chunk
if not is_last:
leftover = batch.pop(-1)
return [s.decode('utf-8') for s in filter(None, batch)], leftover
if __name__ == '__main__':
fin = r'ep+gutenberg+news+wiki.txt'
lines_n = 0
left = b''
for start, size, last in chunkify(fin):
lines, left = process_batch(fin, start, size, last, left)
if not lines:
continue
for line in lines:
print(line)
print('\n')
numberlines = len(lines)
lines_n += numberlines
print(lines_n)
What are the differences among `next(f)`, `f.readline()` and `f.next()` in Python?
Quoting official Python documentation,
A
file
object is its own iterator, for exampleiter(f)
returnsf
(unless f is closed). When a file is used as an iterator, typically in afor
loop (for example,for line in f: print line.strip()
), thenext()
method is called repeatedly. This method returns the next input line, or raisesStopIteration
when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing). In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), thenext()
method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combiningnext()
with other file methods (likereadline()
) does not work right.
Basically, when the next
function is called on a Python's file object, it fetches a certain number of bytes from the file and processes them and returns only the current line (end of current line is determined by the newline character). So, the file pointer is moved. It will not be at the same position where the current returned line ends. So, calling readline
on it will give inconsistent result. That is why mixing both of them are not allowed.
Unexpected file pointer position using ftell() in python traversing through for loop
tell()
does not work when you iterate over a file object.
Due to some optimizations for faster reads, the actual current potion in the file does not make sense once you start iterating.
Python 3 provides more help here:
OSError: telling position disabled by next() call
Using readline()
works better:
from __future__ import print_function
f1 = open('sample3.txt')
line = f1.readline()
while line:
print(line)
print("postion of the file pointer", f1.tell() )
line = f1.readline()
Java : File.exists() inconsistencies when setting user.dir
Setting user.dir
is unsupported. It should be considered a read-only property.
For example the evaluation of Bug 4117557 in the Sun Bug Parade contains this text:
"user.dir", which is initialized during jvm startup, should be used as an
informative/readonly system property, try to customize it via command line
-Duser.dir=xyz will end up at implementation dependend/unspecified behavior.
While this text is about setting it on the command line, setting it via setProperty()
is most likely equally undefined.
When you can reproduce the problem without setting user.dir
manually, then you've found a genuine problem.
The environment is inconsistent, please check the package plan carefully
I had faced the same problem. Simply running
conda install anaconda
solved the problem for me.
Related Topics
Python 2 CSV Writer Produces Wrong Line Terminator on Windows
How to Change Effective Process Name in Python
Prevent Plot from Showing in Jupyter Notebook
Validating Detailed Types in Python Dataclasses
Pythonic Way to Create Union of All Values Contained in Multiple Lists
How to Use a Custom Comparison Function in Python 3
Constructing a Co-Occurrence Matrix in Python Pandas
How to Install Pip for Python 3 on MAC Os X
Why Use Os.Path.Join Over String Concatenation
How to Print Out Status Bar and Percentage