Does Zgrep Unzip a File Before Searching

Does zgrep unzip a file before searching?

This source of zgrep uncompresses the file with zcat and pipes the result to grep.

So, no, it does not use a temporary file, but yes, it decompresses (but not fully) before searching.

Does zgrep unzip a file before searching?

This source of zgrep uncompresses the file with zcat and pipes the result to grep.

So, no, it does not use a temporary file, but yes, it decompresses (but not fully) before searching.

How to seach for a string in .gz file?

The statement

    if 'Alas!':

merely checks if the string value 'Alas!' is "truthy" (it is, by definition); you want to check if the variable line contains this substring;

    if 'Alas!' in line:

Another problem is that you are opening the output file multiple times, overwriting any results from previous input files. You want to open it only once, at the beginning (or open for appending; but repeatedly opening and closing the same file is unnecessary and inefficient).

A better design altogether might be to simply print to standard output, and let the user redirect the output to a file if they like. (Also, probably accept the input files as command-line arguments, rather than hardcoding a fugly complex relative path.)

A third problem is that the input line already contains a newline, but print() will add another. Either strip the newline before printing, or tell print not to supply another (or switch to write which doesn't add one).

import gzip
import glob

with open('file1.txt', 'w') as o:
for file in glob.glob('myfiles/all*/input.gz'):
with gzip.open(file, 'rt') as f:
for line in f:
if 'Alas!' in line:
print(line, file=o, end='')

Demo: https://ideone.com/rTXBSS

grep on zipped files without zgrep

This seems to be a bug in zgrep. Try xzgrep.

$ xzgrep -q hello *; echo $?
0
$ zgrep -q hello *;echo $?
1
$ grep -q hello *;echo $?
0

You can also use zcat and grep together, if files are always gzipped.

$ zcat * | grep -q hello; echo $?

How to use grep command on zip files

zipgrep will work with zip files only.
If you want to grep all files, not only zipped files, then you could use ugrep, which allows to do that with -z flag.

how to search for a particular string from a .gz file?

gunzip -c mygzfile.gz | grep "string to be searched"

But this would only work if the .gz file contains text file which is true in your case.

How to grep on the content of a zipped non-standard textfile

In this post on grepping non-standard text files, I found the answer:

unzip -c zipfile.zip error.log | grep -a "A.c.c.e.s.s"

Now I have something to start from.

Thanks, everyone, for your cooperation.

How to zgrep the last line of a gz file without tail

The easiest solution would be to alter your log rotation to create smaller files.

The second easiest solution would be to use a compression tool that supports random access.

Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzip does not add such markers either by default or by option.

Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzip or another utility that is unaware of these markers.

You can learn more at this question about random access in various compression formats.

There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:

  • BGZF - Blocked, Bigger & Better GZIP! – gzip with random access (like dictzip)
  • Random access to BZIP2? – An investigation (result: can't be done, though I do it below)
  • Random access to blocked XZ format (BXZF) – xz with improved random access support

Experiments with xz

xz (an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.

File creation

xz can concatenate multiple archives together, in which case each archive would have its own block. The GNU split can do this easily:

split -b 50M --filter 'xz -c' big.log > big.log.sp.xz

This tells split to break big.log into 50MB chunks (before compression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.

To do this without GNU, you'd need a loop:

split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*

Parsing

You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:

SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1

Side note

Version 5.1.1 introduced support for the --block-size flag:

xz --block-size=50M big.log

However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.

Experiments with gzip

gzip also supports concatenation. I (briefly) tried mimicking this process for gzip without any luck. gzip --verbose --list doesn't give enough information and it appears the headers are too variable to find.

This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).

I did apt-get install dictzip and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dz archive that neither dictunzip nor gunzip could understand.

Experiments with bzip2

bzip2 has headers we can find. This is still a bit messy, but it works.

Creation

This is just like the xz procedure above:

split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2

I should note that this is considerably slower than xz (48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0 vs 15M for xz), at least for my test log file.

Parsing

This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.

My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)

GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
|grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1

This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.

Because this has to query the compressed file twice and has an extra scan (the grep call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2 really is.

 

Perspective

Given how fast xz is, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzip or bzip2 on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0 is preferable to bzip2 in all scenarios.


————— No Random Access —————— ——————— Random Access ———————
FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
————————— ————————————————————————————— —————————————————————————————
(original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
gzip 79M 0.0109 0:59 0:22
dictzip 605M 0.0839 1:36 (fail)
xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00

Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from split plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).



Related Topics



Leave a reply



Submit