python: read lines from compressed text files
Have you tried using gzip.GzipFile? Arguments are similar to open
.
Reading lines from gzipped text file in Python and get number of original compressed bytes read
You should be open to read the underlying file (in binary mode) f = open('filename.gz', 'rb')
. Then open gzip file on top of of that. g = gzip.GzipFile(fileobj=f)
. You perform your read operations from g
, and to tell how far you are, you cat f.tell()
ask for position in the compressed file.
EDIT2: BTW. of course you can also use tell()
on the GzipFile
instance to tell see how far along (bytes read) the uncompressed files you are.
EDIT: Now I see that is only partial answer to your problem. You'd also need the total. There I am afraid you are a bit out of luck. Esp. for files over 4GB as you've noted. gzip keeps uncompressed size in the last four bytes, so you could jump there and read them and jump back (GzipFile
does not seem to expose this information itself), but since it's four bytes, you can only store 4GB as the biggest number, rest just gets truncated to the lower 4B of the value. In that case, I am afraid you won't know, until go to the end.
Anyways, above hint gives you current position compressed and uncompressed, hope that allows you to at least somewhat achieve what you've set out to do.
Reading large compressed files
Instead of using for line in vcf.readlines()
, you can do:
line = vcf.readline()
while line:
# Do stuff
line = vcf.readline()
This will only load one single line into memory at once
Reading line by line the contents of an 80GB .gz file without uncompressing it
You can use zcat
to stream the uncompressed contents into grep
or whatever filter you want, without incurring space overhead. E.g.
zcat bigfile.gz | grep PATTERN_I_NEED > much_smaller_sample
Also, if it's just grep you're streaming to, you can use zgrep
e.g.
zgrep PATTERN_I_NEED bigfile.gz > much_smaller_sample
but zgrep
doesn't support 100% of the features of grep
on some systems.
Read from a gzip file in python
Try gzipping some data through the gzip libary like this...
import gzip
content = "Lots of content here"
f = gzip.open('Onlyfinnaly.log.gz', 'wb')
f.write(content)
f.close()
... then run your code as posted ...
import gzip
f=gzip.open('Onlyfinnaly.log.gz','rb')
file_content=f.read()
print file_content
This method worked for me as for some reason the gzip library fails to read some files.
How to efficiently read the last line of very big gzipped log file?
If you have no control over the generation of the gzip file, then there is no way to read the last line of the uncompressed data without decoding all of the lines. The time it takes will be O(n), where n is the size of the file. There is no way to make it O(1).
If you do have control on the compression end, then you can create a gzip file that facilitates random access, and you can also keep track of random access entry points to enable jumping to the end of the file.
Read a large zipped text file line by line in python
Python file objects provide iterators, which will read line by line. file.readlines()
reads them all and returns a list - which means it needs to read everything into memory. The better approach (which should always be preferred over readlines()
) is to just loop over the object itself, E.g:
import zipfile
with zipfile.ZipFile(...) as z:
with z.open(...) as f:
for line in f:
print line
Note my use of the with
statement - file objects are context managers, and the with statement lets us easily write readable code that ensures files are closed when the block is exited (even upon exceptions). This, again, should always be used when dealing with files.
How do you (line by line) read multiple .gz files that are inside a zipped folder in Python without creating temporary files?
You can do something like this
from zipfile import ZipFile
import gzip
with ZipFile("storage.zip") as zf:
files = zf.namelist()
for file in files:
with zf.open(file) as f:
with gzip.open(f, 'rt') as g:
for line in g.readlines():
print(line)
Only read specific line numbers from a large file in Python?
Here are some options:
- Go over the file at least once and keep track of the file offsets of the lines you are interested in. This is a good approach if you might be seeking these lines multiple times and the file wont be changed.
- Consider changing the data format. For example csv instead of json (see comments).
- If you have no other alternative, use the traditional:
def get_lines(..., linenums: list):
with open(...) as f:
for lno, ln in enumerate(f):
if lno in linenums:
yield ln
On a 4GB file this took ~6s for linenums = [n // 4, n // 2, n - 1]
where n = lines_in_file
.
Related Topics
Converting Between Datetime and Pandas Timestamp Objects
Django.Db.Utils.Operationalerror Could Not Connect to Server
How to Use 'Cv2.Findcontours' in Different Opencv Versions
Is There a Python Module to Solve Linear Equations
How to Add an Empty Column to a Dataframe
Cannot List Ftp Directory Using Ftplib - But Ftp Client Works
Unicodeencodeerror: 'Ascii' Codec Can't Encode Character '\Xe9' - -When Using Urlib.Request Python3
My Py2App App Will Not Open. What's the Problem
Command Executed with Paramiko Does Not Produce Any Output
Python: Why Does My List Change When I'm Not Actually Changing It
Why Can't I Repeat the 'For' Loop for CSV.Reader
How to Get All Combinations of Length N in Python
Python: My Function Returns "None" After It Does What I Want It To
How to Access a Function Inside a Function
Sparse Matrix Slicing Using List of Int
Using Python's Multiprocessing Module to Execute Simultaneous and Separate Seawat/Modflow Model Runs