Python: Read Lines from Compressed Text Files

python: read lines from compressed text files

Have you tried using gzip.GzipFile? Arguments are similar to open.

Reading lines from gzipped text file in Python and get number of original compressed bytes read

You should be open to read the underlying file (in binary mode) f = open('filename.gz', 'rb'). Then open gzip file on top of of that. g = gzip.GzipFile(fileobj=f). You perform your read operations from g, and to tell how far you are, you cat f.tell() ask for position in the compressed file.

EDIT2: BTW. of course you can also use tell() on the GzipFile instance to tell see how far along (bytes read) the uncompressed files you are.

EDIT: Now I see that is only partial answer to your problem. You'd also need the total. There I am afraid you are a bit out of luck. Esp. for files over 4GB as you've noted. gzip keeps uncompressed size in the last four bytes, so you could jump there and read them and jump back (GzipFile does not seem to expose this information itself), but since it's four bytes, you can only store 4GB as the biggest number, rest just gets truncated to the lower 4B of the value. In that case, I am afraid you won't know, until go to the end.

Anyways, above hint gives you current position compressed and uncompressed, hope that allows you to at least somewhat achieve what you've set out to do.

Reading large compressed files

Instead of using for line in vcf.readlines(), you can do:

line = vcf.readline()
while line:
# Do stuff
line = vcf.readline()

This will only load one single line into memory at once

Reading line by line the contents of an 80GB .gz file without uncompressing it

You can use zcat to stream the uncompressed contents into grep or whatever filter you want, without incurring space overhead. E.g.

zcat bigfile.gz | grep PATTERN_I_NEED > much_smaller_sample

Also, if it's just grep you're streaming to, you can use zgrep e.g.

zgrep PATTERN_I_NEED bigfile.gz > much_smaller_sample

but zgrep doesn't support 100% of the features of grep on some systems.

Read from a gzip file in python

Try gzipping some data through the gzip libary like this...

import gzip
content = "Lots of content here"
f = gzip.open('Onlyfinnaly.log.gz', 'wb')
f.write(content)
f.close()

... then run your code as posted ...

import gzip
f=gzip.open('Onlyfinnaly.log.gz','rb')
file_content=f.read()
print file_content

This method worked for me as for some reason the gzip library fails to read some files.

How to efficiently read the last line of very big gzipped log file?

If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).

If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.

Read a large zipped text file line by line in python

Python file objects provide iterators, which will read line by line. file.readlines() reads them all and returns a list - which means it needs to read everything into memory. The better approach (which should always be preferred over readlines()) is to just loop over the object itself, E.g:

import zipfile
with zipfile.ZipFile(...) as z:
with z.open(...) as f:
for line in f:
print line

Note my use of the with statement - file objects are context managers, and the with statement lets us easily write readable code that ensures files are closed when the block is exited (even upon exceptions). This, again, should always be used when dealing with files.

How do you (line by line) read multiple .gz files that are inside a zipped folder in Python without creating temporary files?

You can do something like this

from zipfile import ZipFile
import gzip

with ZipFile("storage.zip") as zf:
files = zf.namelist()
for file in files:
with zf.open(file) as f:
with gzip.open(f, 'rt') as g:
for line in g.readlines():
print(line)

Only read specific line numbers from a large file in Python?

Here are some options:

  1. Go over the file at least once and keep track of the file offsets of the lines you are interested in. This is a good approach if you might be seeking these lines multiple times and the file wont be changed.
  2. Consider changing the data format. For example csv instead of json (see comments).
  3. If you have no other alternative, use the traditional:
def get_lines(..., linenums: list):
with open(...) as f:
for lno, ln in enumerate(f):
if lno in linenums:
yield ln

On a 4GB file this took ~6s for linenums = [n // 4, n // 2, n - 1] where n = lines_in_file.



Related Topics



Leave a reply



Submit