How to Read a Gzip File Line by Line

python: read lines from compressed text files

Have you tried using gzip.GzipFile? Arguments are similar to open.

Reading line by line the contents of an 80GB .gz file without uncompressing it

You can use zcat to stream the uncompressed contents into grep or whatever filter you want, without incurring space overhead. E.g.

zcat bigfile.gz | grep PATTERN_I_NEED > much_smaller_sample

Also, if it's just grep you're streaming to, you can use zgrep e.g.

zgrep PATTERN_I_NEED bigfile.gz > much_smaller_sample

but zgrep doesn't support 100% of the features of grep on some systems.

Reading lines from gzipped text file in Python and get number of original compressed bytes read

You should be open to read the underlying file (in binary mode) f = open('filename.gz', 'rb'). Then open gzip file on top of of that. g = gzip.GzipFile(fileobj=f). You perform your read operations from g, and to tell how far you are, you cat f.tell() ask for position in the compressed file.

EDIT2: BTW. of course you can also use tell() on the GzipFile instance to tell see how far along (bytes read) the uncompressed files you are.

EDIT: Now I see that is only partial answer to your problem. You'd also need the total. There I am afraid you are a bit out of luck. Esp. for files over 4GB as you've noted. gzip keeps uncompressed size in the last four bytes, so you could jump there and read them and jump back (GzipFile does not seem to expose this information itself), but since it's four bytes, you can only store 4GB as the biggest number, rest just gets truncated to the lower 4B of the value. In that case, I am afraid you won't know, until go to the end.

Anyways, above hint gives you current position compressed and uncompressed, hope that allows you to at least somewhat achieve what you've set out to do.

How to efficiently read the last line of very big gzipped log file?

If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).

If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.

How do you (line by line) read multiple .gz files that are inside a zipped folder in Python without creating temporary files?

You can do something like this

from zipfile import ZipFile
import gzip

with ZipFile("storage.zip") as zf:
files = zf.namelist()
for file in files:
with zf.open(file) as f:
with gzip.open(f, 'rt') as g:
for line in g.readlines():
print(line)

How do I read a gzip file line by line?

You should be able to simply loop over the gzip reader like you do with regular streams (according to the docs)

infile = open("file.log.gz")
gz = Zlib::GzipReader.new(infile)
gz.each_line do |line|
puts line
end


Related Topics



Leave a reply



Submit