Quickest Way to Select/Copy Lines Containing String from Huge Txt.Gz File

quickest way to select/copy lines containing string from huge txt.gz file

Untested, but likely pretty close to this with GNU Parallel.

First make output directory so as not to overwrite any valuable data:

mkdir -p output

Now declare a function that does one file and export it to subprocesses so jobs started by GNU Parallel can find it:

doit(){
echo Processing $1
gzcat "$1" | awk '
/^[ST]\|/ || /^#D=/ || /^##/ {next} # ignore lines starting S|, T| etc
/^H\|/ {print ","} # prefix "H|" with ","
/^Q\|/ {print ",,"} # prefix "Q|" with ",,"
1 # print all other lines
' | gzip > output/"$1"
}
export -f doit

Now process all txt.gz files in parallel and show progress bar too:

parallel --bar doit ::: *txt.gz

How to get few lines from a .gz compressed file without uncompressing

zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.

Switch to gzip -cd in place of zcat and your command should work fine:

 gzip -cd CONN.20111109.0057.gz | head

Explanation

   -c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.

-d --decompress --uncompress
Decompress.

how to search for a particular string from a .gz file?

gunzip -c mygzfile.gz | grep "string to be searched"

But this would only work if the .gz file contains text file which is true in your case.

Read .gz files from a list and print lines

You are passing the complete list of filtered log files name at once, that's why you are getting error, iterate the list pass or read file one by one and then search the file

import re
import os
import glob
import gzip
from datetime import datetime, timedelta

date_entry = raw_input('Give a date in format YEAR, MONTH, DAY \n')
date = datetime.strptime(re.sub("\s+", "", date_entry), "%Y,%m,%d").date()

path = "/applis/tacacs/log/"

list_of_files = [
file for file in glob.glob(path + '*.gz')
if date == datetime.fromtimestamp(os.path.getmtime(file)).date()
]

print("Files found: ")
print(list_of_files)
Adresse_IP = raw_input('IP Address \n')

for fname in list_of_files: #iterate log file names to open it one by one
with gzip.open(fname, 'r') as file: #open single file
for line in file: #iterate all lines
if re.search(Adresse_IP, line): #search line
print(line) #print line if match

How to delete from a text file, all lines that contain a specific string?

To remove the line and print the output to standard out:

sed '/pattern to match/d' ./infile

To directly modify the file – does not work with BSD sed:

sed -i '/pattern to match/d' ./infile

Same, but for BSD sed (Mac OS X and FreeBSD) – does not work with GNU sed:

sed -i '' '/pattern to match/d' ./infile

To directly modify the file (and create a backup) – works with BSD and GNU sed:

sed -i.bak '/pattern to match/d' ./infile

How to read first N lines of a file?

Python 3:

with open("datafile") as myfile:
head = [next(myfile) for x in range(N)]
print(head)

Python 2:

with open("datafile") as myfile:
head = [next(myfile) for x in xrange(N)]
print head

Here's another way (both Python 2 & 3):

from itertools import islice

with open("datafile") as myfile:
head = list(islice(myfile, N))
print(head)

How to extract a specific text from gz file?

Another using zgrep and positive lookbehind:

$ zgrep -oP "(?<=^[ACTGN]{4})[ACTGN]{6}" foo.gz
TNACGG
CNACCT

Explained:

  • zgrep : man zgrep: search possibly compressed files for a regular expression
  • -o Print only the matched (non-empty) parts of a matching line
  • -P Interpret the pattern as a Perl-compatible regular expression (PCRE).
  • (?<=^[ACTGN]{4}) positive lookbehind
  • [ACTGN]{6} match 6 named characters that are preceeded by above
  • foo.gz my test file


Related Topics



Leave a reply



Submit