python: read lines from compressed text files
Have you tried using gzip.GzipFile? Arguments are similar to open
.
Reading line by line the contents of an 80GB .gz file without uncompressing it
You can use zcat
to stream the uncompressed contents into grep
or whatever filter you want, without incurring space overhead. E.g.
zcat bigfile.gz | grep PATTERN_I_NEED > much_smaller_sample
Also, if it's just grep you're streaming to, you can use zgrep
e.g.
zgrep PATTERN_I_NEED bigfile.gz > much_smaller_sample
but zgrep
doesn't support 100% of the features of grep
on some systems.
Reading lines from gzipped text file in Python and get number of original compressed bytes read
You should be open to read the underlying file (in binary mode) f = open('filename.gz', 'rb')
. Then open gzip file on top of of that. g = gzip.GzipFile(fileobj=f)
. You perform your read operations from g
, and to tell how far you are, you cat f.tell()
ask for position in the compressed file.
EDIT2: BTW. of course you can also use tell()
on the GzipFile
instance to tell see how far along (bytes read) the uncompressed files you are.
EDIT: Now I see that is only partial answer to your problem. You'd also need the total. There I am afraid you are a bit out of luck. Esp. for files over 4GB as you've noted. gzip keeps uncompressed size in the last four bytes, so you could jump there and read them and jump back (GzipFile
does not seem to expose this information itself), but since it's four bytes, you can only store 4GB as the biggest number, rest just gets truncated to the lower 4B of the value. In that case, I am afraid you won't know, until go to the end.
Anyways, above hint gives you current position compressed and uncompressed, hope that allows you to at least somewhat achieve what you've set out to do.
How to efficiently read the last line of very big gzipped log file?
If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).
If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.
How do you (line by line) read multiple .gz files that are inside a zipped folder in Python without creating temporary files?
You can do something like this
from zipfile import ZipFile
import gzip
with ZipFile("storage.zip") as zf:
files = zf.namelist()
for file in files:
with zf.open(file) as f:
with gzip.open(f, 'rt') as g:
for line in g.readlines():
print(line)
How do I read a gzip file line by line?
You should be able to simply loop over the gzip reader like you do with regular streams (according to the docs)
infile = open("file.log.gz")
gz = Zlib::GzipReader.new(infile)
gz.each_line do |line|
puts line
end
Related Topics
How to 'Unload' ('Un-Require') a Ruby Library
Rails: How to Check If a Column Has a Value
How to Connect to Browser Using Ruby Selenium Webdriver
How to Get the Number of Elements Having Same Attribute in HTML in Watir
How to Inherit from Nilclass or How to Simulate Similar Function
Why Will a Range Not Work When Descending
How to Make Ruby's Restclient Gem Respect Content_Type on Post
Your Ruby Version Is 2.2.4, But Your Gemfile Specified 2.3.0
How to Increment/Decrement a Character in Ruby for All Possible Values
Extending a Class Method in a Module
How to Extend Redcarpet to Support Auto Linking User Mentions
Scraping/Parsing Google Search Results in Ruby
Rails Initializes Extremely Slow on Ruby 1.9.1
What Do 'Def +@' and 'Def -@' Mean
What's the Best Background Job Management Library for Rails