Grep from Tar.Gz Without Extracting [Faster One]

Performing grep operation in tar files without extracting

the tar command has a -O switch to extract your files to standard output. So you can pipe those output to grep/awk

tar xvf  test.tar -O | awk '/pattern/{print}'

tar xvf test.tar -O | grep "pattern"

eg to return file name one pattern found

tar tf myarchive.tar | while read -r FILE
do
if tar xf test.tar $FILE -O | grep "pattern" ;then
echo "found pattern in : $FILE"
fi
done

How to grep for a pattern in the files in tar archive without filling up disk space

Here's my take on this:

while read filename; do tar -xOf file.tar "$filename" | grep 'pattern' | sed "s|^|$filename:|"; done < <(tar -tf file.tar | grep -v '/$')

Broken out for explanation:

  • while read filename; do -- it's a loop...
  • tar -xOf file.tar "$filename" -- this extracts each file...
  • | grep 'pattern' -- here's where you put your pattern...
  • | sed "s|^|$filename:|"; - prepend the filename, so this looks like grep. Salt to taste.
  • done < <(tar -tf file.tar | grep -v '/$') -- end the loop, get the list of files as to fead to your while read.

One proviso: this breaks if you have OR bars (|) in your filenames.

Hmm. In fact, this makes a nice little bash function, which you can append to your .bashrc file:

targrep() {

local taropt=""

if [[ ! -f "$2" ]]; then
echo "Usage: targrep pattern file ..."
fi

while [[ -n "$2" ]]; do

if [[ ! -f "$2" ]]; then
echo "targrep: $2: No such file" >&2
fi

case "$2" in
*.tar.gz) taropt="-z" ;;
*) taropt="" ;;
esac

while read filename; do
tar $taropt -xOf "$2" \
| grep "$1" \
| sed "s|^|$filename:|";
done < <(tar $taropt -tf $2 | grep -v '/$')

shift

done
}

Grep Pattern From TAR Output

With GNU tar:

tar -xvzf your_file.tgz --wildcards "*/index.php"

Update

tar -tvzf your_file.tgz --wildcards "httpdocs/*/index.php" --exclude="httpdocs/*/*/index.php"

Fastest way to find whether a file exists in a number of gzipped tarballs?

There aren't many shortcuts, tar files are sequential in nature, the best you can do is to process each tar file at most once (and possibly multiple files in parallel). With GNU tar when searching a tar file you can do:

tar --wildcards -tzf file.tgz pattern [pattern...]
parallel -tk --tag tar --wildcards -tzvf ::: file*.tgz ::: "pattern"

using multiple patterns, matching file names will be displayed and exit code 0 if any are found. Remember to use "**" for a glob to match across directories.

However, if you're only looking for a single pattern per tar file, this really won't be measurably faster than your existing approach. GNU tar has optimizations for seekable tar files, but compression will counteract any benefits. Tar files can be incremental, split, updated and even contain multiple copies of the same file, there is no alternative to scanning the whole file (even though most tar files are not so complex).

If this is a recurring task, you might consider keeping an index file when the archives are created:

tar -czvf file.tgz files [...]  > file.idx 

or if you use GNU tar, add: --index-file=file.idx instead, one -v is filenames only, with -vv the index file will contain the full details as would be shown by -tv. (There does not appear to be a --index-file0 nul delimited option at this time.)

(In case it is useful, there are also alternatives to tar for this, see https://serverfault.com/questions/59795/is-there-a-smarter-tar-or-cpio-out-there-for-efficiently-retrieving-a-file-store )

Search large tar.gz file for keywords,copy and delete

If you are trying to search for a keyword in the files and extract only those, and since your file sizes are huge, it might take time if the keyword is somewhere at the middle.

The best advice I can give is probably use a powerful combination of a Inverted index lookup tool such as Solr(based on Lucene Indes) and Apache Tika - a content analysis toolkit.

Using these tools you can index the tar.gz files and when you search for a keyword, relevant documents containig the keyword will be returned.



Related Topics



Leave a reply



Submit