Reading Tar File Contents Without Untarring It, in Python Script

reading tar file contents without untarring it, in python script

you can use getmembers()

>>> import  tarfile
>>> tar = tarfile.open("test.tar")
>>> tar.getmembers()

After that, you can use extractfile() to extract the members as file object. Just an example

import tarfile,os
import sys
os.chdir("/tmp/foo")
tar = tarfile.open("test.tar")
for member in tar.getmembers():
f=tar.extractfile(member)
content=f.read()
print "%s has %d newlines" %(member, content.count("\n"))
print "%s has %d spaces" % (member,content.count(" "))
print "%s has %d characters" % (member, len(content))
sys.exit()
tar.close()

With the file object f in the above example, you can use read(), readlines() etc.

Read *.tar.gz file in python without extracting

When you call tarfile.open,

tarfile.open('arhivename.tar.gz', encoding='utf-8')

The encoding parameter controls the encoding of the filenames, not the encoding of the file contents. It doesn't make sense for the encoding parameter to control the encoding of the file contents, because different files inside the tar file can be encoded differently. So, a tar file really just contains binary data.

You can decode this data by wrapping the file with the UTF-8 stream reader from the codecs module:

import codecs
utf8reader = codecs.getreader('utf-8')
for name in tar.getmembers():
fp = utf8reader(tar.extractfile(name))

Read .gz files inside .tar files without extracting

You need to use tar.extractfile(member) instead of tarfile.extractfile(member). tarfile is the class, and doesn't know about the tar file you opened. tar is the tarfile object, which references the .tar file you opened.

To do it right, use next() instead of getmembers() or getnames(), so that you don't have to read the entire tar file twice:

with tarfile.open(sys.argv[1]) as tar:
while ent := tar.next():
if ent.name.endswith(".gz"):
print(gzip.GzipFile(fileobj=tar.extractfile(ent)).read())

How do I list contents of a tar file without extracting it in python?

You can use TarFile.getnames() like this:

#!/usr/bin/env python3

import tarfile
tarf = tarfile.open('foo.tar.gz', 'r:gz')
print(tarf.getnames())

http://docs.python.org/3.3/library/tarfile.html#tarfile.TarFile.getnames

And if you want mtime values you can use getmembers().

print([(member.name, member.mtime) for member in tarf.getmembers()])

Python read file within tar archive

Try this:

import tarfile
tar = tarfile.open("docs.tar.gz")
f = tar.extractfile("docs.json")

# do something like f.read()
# since your file is json, you'll probably want to do this:

import json
json.loads(f.read())

Read .tar.gz file in Python

The docs tell us that None is returned by extractfile() if the member is a not a regular file or link.

One possible solution is to skip over the None results:

tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f is not None:
content = f.read()

Reading file from concatinated ( tar ) file directly without untarring the tar file

The tarfile module gives you access to tarballs. It won't be random access, but you can read out any files you need and put them in a temporary directory, or just store them in strings.

Extracting compressed gz file from tar archive in python

You can use gzip.decompress:

import tarfile, os, gzip
import sys
tar = tarfile.open("arXiv_src_9107_001a.tar")
n = 0
for member in tar.getmembers():
#Skip directory labeled at the top
if(n==0):
n=1
continue
f=tar.extractfile(member)
print(member)
content=f.read()
expanded = gzip.decompress(content)
# do whatever with expanded here
tar.close()


Related Topics



Leave a reply



Submit