File.Tell() Inconsistency

file.tell() inconsistency

Using open files as an iterator uses a read-ahead buffer to increase efficiency. As a result, the file pointer advances in large steps across the file as you loop over the lines.

From the File Objects documentation:

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.

If you need to rely on .tell(), don't use the file object as an iterator. You can turn .readline() into an iterator instead (at the price of some performance loss):

for line in iter(f.readline, ''):
print f.tell()

This uses the iter() function sentinel argument to turn any callable into an iterator.

Python file.tell gives wrong value location

The cause is (rather obscurely) explained in the docs for a file object's next() method:

When a file is used as an iterator, typically in a for loop (for example,
for line in f: print line), the next() method is called repeatedly.
This method returns the next input line, or raises StopIteration when
EOF is hit. In order to make a for loop the most efficient way of looping
over the lines of a file (a very common operation), the next() method
uses a hidden read-ahead buffer. As a consequence of using a read-ahead
buffer, combining next() with other file methods (like readline()) does
not work right. However, using seek() to reposition the file to an
absolute position will flush the read-ahead buffer.

The values returned by tell() reflect how far this hidden read-ahead buffer has gotten, which will typically be up to a few thousand bytes beyond the characters your program has actually retrieved.

There's no portable way around this. If you need to mix tell() with reading lines, then use the file's readline() method instead. The tradeoff is that, in return for getting usable tell() results, iterating over a large file with readline() is typically significantly slower than using for line in file_object:.

Code

Concretely, change the loop to this:

line = self.fh.readline()
while line:
if p.search(line):
self.porSnipStartFPtr = self.fh.tell()
sys.stdout.write("found regPorSnip")
line = fh.readline()

I'm not sure that's what you really want, though: tell() is capturing the position of the start of the next line. If want the position of the start of the line, then you need to change the logic, like so:

pos = self.fh.tell()
line = self.fh.readline()
while line:
if p.search(line):
self.porSnipStartFPtr = pos
sys.stdout.write("found regPorSnip")
pos = self.fh.tell()
line = fh.readline()

or do it with a "loop and a half":

while True:
pos = self.fh.tell()
line = self.fh.readline()
if not line:
break
if p.search(line):
self.porSnipStartFPtr = pos
sys.stdout.write("found regPorSnip")

Processing large files in chunks: inconsistent seek with readline

You were so close! A relatively simple change to your final code (reading in the data as bytes and not str) makes it all (almost) work.

The main issue was because reading from binary files counts bytes, but reading from text files counts text, and you did your first counting in bytes and your second in characters, leading to your assumptions about what data had already been read to be wrong. It's nothing about an internal, hidden buffer.

Other changes:

  • The code needs to split on b'\n' instead of using bytes.splitlines(), and only remove blank lines after the relevant detection code.
  • Unless the size of the file changes (in which case your existing code will break anyway), chunkify can be replaced by a simpler, faster loop that's functionally identical without having to keep the file open.

This gives the final code:

from os import stat

def chunkify(pfin, buf_size=1024**2):
file_end = stat(pfin).st_size

i = -buf_size
for i in range(0, file_end - buf_size, buf_size):
yield i, buf_size, False

leftover = file_end % buf_size
if leftover == 0: # if the last section is buf_size in size
leftover = buf_size
yield i + buf_size, leftover, True

def process_batch(pfin, chunk_start, chunk_size, is_last, leftover):
with open(pfin, 'rb') as f:
f.seek(chunk_start)
chunk = f.read(chunk_size)

# Add previous leftover to current chunk
chunk = leftover + chunk
batch = chunk.split(b'\n')

# If this chunk is not the last one,
# pop the last item as that will be an incomplete sentence
# We return this leftover to use in the next chunk
if not is_last:
leftover = batch.pop(-1)

return [s.decode('utf-8') for s in filter(None, batch)], leftover

if __name__ == '__main__':
fin = r'ep+gutenberg+news+wiki.txt'

lines_n = 0
left = b''
for start, size, last in chunkify(fin):
lines, left = process_batch(fin, start, size, last, left)

if not lines:
continue

for line in lines:
print(line)
print('\n')

numberlines = len(lines)
lines_n += numberlines

print(lines_n)

What are the differences among `next(f)`, `f.readline()` and `f.next()` in Python?

Quoting official Python documentation,

A file object is its own iterator, for example iter(f) returns f (unless f is closed). When a file is used as an iterator, typically in a for loop (for example, for line in f: print line.strip()), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing). In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right.

Basically, when the next function is called on a Python's file object, it fetches a certain number of bytes from the file and processes them and returns only the current line (end of current line is determined by the newline character). So, the file pointer is moved. It will not be at the same position where the current returned line ends. So, calling readline on it will give inconsistent result. That is why mixing both of them are not allowed.

Unexpected file pointer position using ftell() in python traversing through for loop

tell() does not work when you iterate over a file object.
Due to some optimizations for faster reads, the actual current potion in the file does not make sense once you start iterating.

Python 3 provides more help here:

OSError: telling position disabled by next() call

Using readline() works better:

from __future__ import print_function

f1 = open('sample3.txt')
line = f1.readline()
while line:
print(line)
print("postion of the file pointer", f1.tell() )
line = f1.readline()

Java : File.exists() inconsistencies when setting user.dir

Setting user.dir is unsupported. It should be considered a read-only property.

For example the evaluation of Bug 4117557 in the Sun Bug Parade contains this text:

"user.dir", which is initialized during jvm startup, should be used as an
informative/readonly system property, try to customize it via command line
-Duser.dir=xyz will end up at implementation dependend/unspecified behavior.

While this text is about setting it on the command line, setting it via setProperty() is most likely equally undefined.

When you can reproduce the problem without setting user.dir manually, then you've found a genuine problem.

The environment is inconsistent, please check the package plan carefully

I had faced the same problem. Simply running

conda install anaconda

solved the problem for me.



Related Topics



Leave a reply



Submit