How to Grep for a Pattern in the Files in Tar Archive Without Filling Up Disk Space

How to grep for a pattern in the files in tar archive without filling up disk space

Here's my take on this:

while read filename; do tar -xOf file.tar "$filename" | grep 'pattern' | sed "s|^|$filename:|"; done < <(tar -tf file.tar | grep -v '/$')

Broken out for explanation:

while read filename; do -- it's a loop...
tar -xOf file.tar "$filename" -- this extracts each file...
| grep 'pattern' -- here's where you put your pattern...
| sed "s|^|$filename:|"; - prepend the filename, so this looks like grep. Salt to taste.
done < <(tar -tf file.tar | grep -v '/$') -- end the loop, get the list of files as to fead to your while read.

One proviso: this breaks if you have OR bars (|) in your filenames.

Hmm. In fact, this makes a nice little bash function, which you can append to your .bashrc file:

targrep() {

  local taropt=""

  if [[ ! -f "$2" ]]; then
    echo "Usage: targrep pattern file ..."
  fi

  while [[ -n "$2" ]]; do    

    if [[ ! -f "$2" ]]; then
      echo "targrep: $2: No such file" >&2
    fi

    case "$2" in
      *.tar.gz) taropt="-z" ;;
      *) taropt="" ;;
    esac

    while read filename; do
      tar $taropt -xOf "$2" \
       | grep "$1" \
       | sed "s|^|$filename:|";
    done < <(tar $taropt -tf $2 | grep -v '/$')

  shift

  done
}

Performing grep operation in tar files without extracting

the tar command has a -O switch to extract your files to standard output. So you can pipe those output to grep/awk

tar xvf  test.tar -O | awk '/pattern/{print}'

tar xvf  test.tar -O | grep "pattern"

eg to return file name one pattern found

tar tf myarchive.tar | while read -r FILE
do
    if tar xf test.tar $FILE  -O | grep "pattern" ;then
        echo "found pattern in : $FILE"
    fi
done

Getting contents of a particular file in the tar archive

This is usually documented in man pages, try running this command:

man tar

Unfortunately, Linux has not the best set of man pages. There is an online copy of tar manpage from this OS: http://linux.die.net/man/1/tar and it is terrible. But it links to info man command which is command to access the "info" system widely used in GNU world (many programs in linux user-space are from GNU projects, for example gcc). There is an exact link to section of online info tar about extracting specific files: http://www.gnu.org/software/tar/manual/html_node/extracting-files.html#SEC27

I may also recommend documentation from BSD (e.g. FreeBSD) or opengroup.org. Utilities can be different in detail but behave same in general.

For example, there is some rather old but good man from opengroup (XCU means 'Commands and Utilities' of the Single UNIX Specification, Version 2, 1997):
http://pubs.opengroup.org/onlinepubs/7908799/xcu/tar.html

tar key [file...]

The following operands are supported:

key --
The key operand consists of a function letter followed immediately by zero or more modifying letters. The function letter is one of the following:

x --
Extract the named file or files from the archive. If a named file matches a directory whose contents had been written onto the archive, this directory is (recursively) extracted. If a named file in the archive does not exist on the system, the file is created with the same mode as the one in the archive, except that the set-user-ID and set-group-ID modes are not set unless the user has appropriate privileges. If the files exist, their modes are not changed except as described above. The owner, group, and modification time are restored (if possible). If no file operand is given, the entire content of the archive is extracted. Note that if several files with the same name are in the archive, the last one overwrites all earlier ones.

And to fully understand command tar xf test.tar $FILE you should also read about f option:

f --
Use the first file operand (or the second, if b has already been specified) as the name of the archive instead of the system-dependent default.

So, test.tar in your command will be used by f key as archive name; then x will use second argument ($FILE) as name of file or directory to extract from archive.

How can I grep for a text pattern in a zipped text file?

zgrep on Linux. If you're on Windows, you can download GnuWin which contains a Windows port of zgrep.

Listing(or counting) files in .tar/.tar.gz archives: What is the time complexity?

Depends on your storage!

uncompressed tar

For tape archives (you know, "tar"s), linear to byte length, in any case, because fast-forwarding is still linear in time to the length you need to fast-forward.

For small files on modern storage: the same; you don't ask your SSD for 20 Bytes of storage. You get 4kB at once; in theory, this means you could pretty instantly skip over that 1GB file. In practice, my experience tells me that doesn't happen; I honestly don't know why. To me, the "next_block_after" function should just skip forward. shrugs

compressed tar

yes, in general you'll have to uncompress to know how long the content is to seek somewhere. I don't think there's a compression format that keeps some kind of table with "intermediate" lengths to speed up seeking.

grep json value of a key name. (busybox without option -P)

With busybox awk:

busybox awk  -F '[:,]' '/"one"/ {gsub("[[:blank:]]+", "", $2); print $2}'

-F '[:,]' sets the field separator as : or ,
/"one"/ {gsub("[[:blank:]]+", "", $2); print $2} macthes if the line contains "one", if so strips off all horizontal whitespace(s) from second field and then printing the field

If you want to strip off the quotes too:

busybox awk  -F '[:,]' '/"one"/ {gsub("[[:blank:]\"]+", "", $2); print $2}'

Example:

$ cat file.json 
{
  "one": "apple",
  "two": "banana"
}

$ busybox awk  -F '[:,]' '/"one"/ {gsub("[[:blank:]]+", "", $2); print $2}' file.json 
"apple"

$ busybox awk  -F '[:,]' '/"one"/ {gsub("[[:blank:]\"]+", "", $2); print $2}' file.json 
apple

How to Grep for a Pattern in the Files in Tar Archive Without Filling Up Disk Space