Performing grep operation in tar files without extracting
the tar
command has a -O
switch to extract your files to standard output. So you can pipe those output to grep/awk
tar xvf test.tar -O | awk '/pattern/{print}'
tar xvf test.tar -O | grep "pattern"
eg to return file name one pattern found
tar tf myarchive.tar | while read -r FILE
do
if tar xf test.tar $FILE -O | grep "pattern" ;then
echo "found pattern in : $FILE"
fi
done
How to grep for a pattern in the files in tar archive without filling up disk space
Here's my take on this:
while read filename; do tar -xOf file.tar "$filename" | grep 'pattern' | sed "s|^|$filename:|"; done < <(tar -tf file.tar | grep -v '/$')
Broken out for explanation:
while read filename; do
-- it's a loop...tar -xOf file.tar "$filename"
-- this extracts each file...| grep 'pattern'
-- here's where you put your pattern...| sed "s|^|$filename:|";
- prepend the filename, so this looks like grep. Salt to taste.done < <(tar -tf file.tar | grep -v '/$')
-- end the loop, get the list of files as to fead to yourwhile read
.
One proviso: this breaks if you have OR bars (|
) in your filenames.
Hmm. In fact, this makes a nice little bash function, which you can append to your .bashrc
file:
targrep() {
local taropt=""
if [[ ! -f "$2" ]]; then
echo "Usage: targrep pattern file ..."
fi
while [[ -n "$2" ]]; do
if [[ ! -f "$2" ]]; then
echo "targrep: $2: No such file" >&2
fi
case "$2" in
*.tar.gz) taropt="-z" ;;
*) taropt="" ;;
esac
while read filename; do
tar $taropt -xOf "$2" \
| grep "$1" \
| sed "s|^|$filename:|";
done < <(tar $taropt -tf $2 | grep -v '/$')
shift
done
}
Grep Pattern From TAR Output
With GNU tar:
tar -xvzf your_file.tgz --wildcards "*/index.php"
Update
tar -tvzf your_file.tgz --wildcards "httpdocs/*/index.php" --exclude="httpdocs/*/*/index.php"
Fastest way to find whether a file exists in a number of gzipped tarballs?
There aren't many shortcuts, tar
files are sequential in nature, the best you can do is to process each tar file at most once (and possibly multiple files in parallel). With GNU tar
when searching a tar file you can do:
tar --wildcards -tzf file.tgz pattern [pattern...]
parallel -tk --tag tar --wildcards -tzvf ::: file*.tgz ::: "pattern"
using multiple patterns, matching file names will be displayed and exit code 0 if any are found. Remember to use "**" for a glob to match across directories.
However, if you're only looking for a single pattern per tar file, this really won't be measurably faster than your existing approach. GNU tar
has optimizations for seekable tar files, but compression will counteract any benefits. Tar files can be incremental, split, updated and even contain multiple copies of the same file, there is no alternative to scanning the whole file (even though most tar files are not so complex).
If this is a recurring task, you might consider keeping an index file when the archives are created:
tar -czvf file.tgz files [...] > file.idx
or if you use GNU tar, add: --index-file=file.idx
instead, one -v
is filenames only, with -vv
the index file will contain the full details as would be shown by -tv
. (There does not appear to be a --index-file0
nul delimited option at this time.)
(In case it is useful, there are also alternatives to tar
for this, see https://serverfault.com/questions/59795/is-there-a-smarter-tar-or-cpio-out-there-for-efficiently-retrieving-a-file-store )
Search large tar.gz file for keywords,copy and delete
If you are trying to search for a keyword in the files and extract only those, and since your file sizes are huge, it might take time if the keyword is somewhere at the middle.
The best advice I can give is probably use a powerful combination of a Inverted index lookup tool such as Solr(based on Lucene Indes) and Apache Tika - a content analysis toolkit.
Using these tools you can index the tar.gz files and when you search for a keyword, relevant documents containig the keyword will be returned.
Related Topics
Anyway Change the Cursor "Vertical Line" Instead of a Box
Enable/Disable Tasks in Crontab by Bash/Shell
Undefined Reference to 'Clock_Gettime' Although '-Lrt' Is Given
How Were the Weightings in the Linux Load Computation Chosen
Does Gcc, Icc, or Microsoft's C/C++ Compiler Support or Know Anything About Numa
How to Sleep in the Linux Kernel Space
Code Snippet Managers for Linux Desktops
Loading U-Boot in Memory Instead of Flashing It
Language-Agnostic Properly-Tabbing Code Editors for Linux
Linux Perf Reporting Cache Misses for Unexpected Instruction
Creating a System Call in Linux
How to Tell If a File Is Older Than 30 Minutes from /Bin/Sh
Change Directory and Execute File in One Command
Linux Bash: Move Multiple Different Files into Same Directory
Getting Error on Supervison on Supervisorctl Error (No Such Process)