how to search for a particular string from a .gz file?
gunzip -c mygzfile.gz | grep "string to be searched"
But this would only work if the .gz file contains text file which is true in your case.
How to seach for a string in .gz file?
The statement
if 'Alas!':
merely checks if the string value 'Alas!'
is "truthy" (it is, by definition); you want to check if the variable line
contains this substring;
if 'Alas!' in line:
Another problem is that you are opening the output file multiple times, overwriting any results from previous input files. You want to open it only once, at the beginning (or open for appending; but repeatedly opening and closing the same file is unnecessary and inefficient).
A better design altogether might be to simply print to standard output, and let the user redirect the output to a file if they like. (Also, probably accept the input files as command-line arguments, rather than hardcoding a fugly complex relative path.)
A third problem is that the input line already contains a newline, but print()
will add another. Either strip the newline before printing, or tell print
not to supply another (or switch to write
which doesn't add one).
import gzip
import glob
with open('file1.txt', 'w') as o:
for file in glob.glob('myfiles/all*/input.gz'):
with gzip.open(file, 'rt') as f:
for line in f:
if 'Alas!' in line:
print(line, file=o, end='')
Demo: https://ideone.com/rTXBSS
unix commands to search a string in .gz file
You can use zgrep
, if installed:
zgrep -e search_pattern file.gz
or you can use zcat
with any regular filter:
zcat file.gz | grep search_pattern
find string inside a gzipped file in a folder
zgrep will look in gzipped files, has a -R recursive option, and a -H show me the filename option:
zgrep -R --include=*.gz -H "pattern match" .
OS specific commands as not all arguments work across the board:
Mac 10.5+: zgrep -R --include=\*.gz -H "pattern match" .
Ubuntu 16+: zgrep -i -H "pattern match" *.gz
Recursive grep for gz files search string from an output string
you can pipe the find results through a second grep:
find . -name "*.gz" -exec zgrep -H "PATTERN1" {} \; | grep "PATTERN2"
How to extract a specific text from gz file?
Another using zgrep
and positive lookbehind:
$ zgrep -oP "(?<=^[ACTGN]{4})[ACTGN]{6}" foo.gz
TNACGG
CNACCT
Explained:
zgrep
:man zgrep
: search possibly compressed files for a regular expression-o
Print only the matched (non-empty) parts of a matching line-P
Interpret the pattern as a Perl-compatible regular expression (PCRE).(?<=^[ACTGN]{4})
positive lookbehind[ACTGN]{6}
match 6 named characters that are preceeded by abovefoo.gz
my test file
grep several strings from gz file
use zgrep
to search into compressed files. There are also other commands like bzgrep
(for bzip2 files), xzgrep
etc for compressed files.
zgrep -f match_strings.txt file.gz
-f
is the flag for reading the patterns from a specified file.
quickest way to select/copy lines containing string from huge txt.gz file
Untested, but likely pretty close to this with GNU Parallel.
First make output directory so as not to overwrite any valuable data:
mkdir -p output
Now declare a function that does one file and export it to subprocesses so jobs started by GNU Parallel can find it:
doit(){
echo Processing $1
gzcat "$1" | awk '
/^[ST]\|/ || /^#D=/ || /^##/ {next} # ignore lines starting S|, T| etc
/^H\|/ {print ","} # prefix "H|" with ","
/^Q\|/ {print ",,"} # prefix "Q|" with ",,"
1 # print all other lines
' | gzip > output/"$1"
}
export -f doit
Now process all txt.gz
files in parallel and show progress bar too:
parallel --bar doit ::: *txt.gz
Unix script to search within a compressed .gz file
The essence of how to accomplish this is to get the names of the files within the tarball to search over, and extract their content to be searched, while not extracting anything else. Because we don't want to write to the file system, we can use the -O
flag to instead extract to standard-out.
tar -tzf file.tar.gz | grep '\.txt' | xargs tar -Oxzf file.tar.gz | grep -B 3 "string-or-regex"
will concatenate all of the files in the .tar.gz with names ending in ".txt", and grep
them for the given string, also outputting the 3 previous lines. It won't tell you which file in the tarball any match came from, and the "three previous lines" may in fact come from the previous file.
You can instead do:
for file in $(tar -tzf file.tar.gz | grep '\.txt'); do
tar -Oxzf file.tar.gz "$file" | grep -B 3 --label="$file" -H "string-or-regex"
done
which will respect file boundaries, and report the file names, but be much less efficient.
(-z
tells tar
it is gzip
compressed. -t
lists the contents. -x
extracts. -O
redirects to standard output rather than the file system. Older tar
s may not have the -O
or -z
flag, and will want the flags without -
: e.g. tar tz file.tar.gz
)
Okay, so you have an unusable grep. We can fix that with awk!
#!/usr/bin/awk -f
BEGIN { context=3; }
{ add_buffer($0) }
/pattern/ { print_buffer() }
function add_buffer(line)
{
buffer[NR % context]=line
}
function print_buffer()
{
for(i = max(1, NR-context+1); i <= NR; i++) {
print buffer[i % context]
}
}
function max(a,b)
{
if (a > b) { return a } else { return b }
}
This will not coalesce adjacent matches, unlike grep -B, and can thus repeat lines that
are within 3 lines of two different matches.
Related Topics
Linux Desktop Shortcut and Icon from Install
Linux Service Can't Load Library Path in The /Etc/Ld.So.Conf.D
How to Block Push to Master Branch on Remote
Docker Run Hello-World Still Fails, Permission Denied
Killing Process in Shell Script
How Can Beaglebone Black Be Used as Mass Storage Device
Is Wget or Similar Programs Always Available on Posix Systems
How to Build Git with Static Linking
What Does '-Oom-Kill-Disable' Do for a Docker Container
In a Sigill Handler, How to Skip The Offending Instruction
How to Accept Multiple Tcp Connections in Perl
Linux: How to Lock The Pages of a Process in Memory
Difference Between Dts and Acpi
Linux Kernel: Kernel Version String Appended with Either ''+" or "-Dirty"