How to Use Awk for a Compressed File

How to use awk for a compressed file

You need to read them compressed files like this:

awk '{ ... }' <(gzip -dc input1.vcf.gz) <(gzip -dc input2.vcf.gz)

Try this:

awk 'FNR==NR { sub(/AA=\.;/,""); array[$1,$2]=$8; next } ($1,$2) in array { print $0 ";" array[$1,$2] }' <(gzip -dc input1.vcf.gz) <(gzip -dc input2.vcf.gz) | gzip > output.vcf.gz

Use awk on zipped files getting by find commands

Find all file in current dir recursively start with GAUR and end with .zip, read output by line,create directory, unzip file and redirect the output into awk print 2. and 3. col into a file in the current directory /gaur/original file path (sed cut the .zip extension from the file name) without .zip ending.

find -name 'GAUR*.zip' | while read line ; do mkdir -p gaur/$(dirname $line) && unzip -p $line | awk -F"|" '{ print $2","$3 }' > ./gaur/$(echo $line | sed 's/.zip$//g') ; done

You have to unzip the file first then you able to run awk on the file. So i made this ugly one liner to do this. But it hard to modify so I would use regular shell script for this.

AWK to process compressed files and printing original (compressed) file names

Assuming you are looping over all the files and piping their decompression directly into awk something like the following will work.

for file in *.gz; do
gunzip -c "$file" | awk -v origname="$file" '.... {print origname " whatever"}'
done

Edit: To use a list of filenames from some source other than a direct glob something like the following can be used.

$ ls *.awk
a.awk e.awk
$ while IFS= read -d '' filename; do
echo "$filename";
done < <(find . -name \*.awk -printf '%P\0')
e.awk
a.awk

To use xargs instead of the above loop will require the body of the command to be in a pre-written script file I believe which can be called with xargs and the filename.

awk for many compressed files

The find '-exec' can be used to invoke (and pass arguments) to a single program. The challenge here is that two commands (cat|awk) need to be combined with a pipe. Two possible path: construct a shell command OR use the more flexible xargs.

# Using the 'shell -c' command
find . -iname '*.fastq.gz' -exec sh -c "zcat {} | awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}'" \;

# OR, using process substitution
find . -iname '*.fastq.gz' -exec bash -c "awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}' <(zcat {})" \;

See many references to find/xargs in stack overflow

How to use awk script to generate a file

awk index starts with 1 and $0 represents full record. So column numbers would be 1, 3, 6.

You may use this awk:

awk 'BEGIN{FS=OFS=","} !$6{$6=$1} {print $1, $3, $6}' file

Time,MsgType,RTime
7:20:13,A,7:20:13
7:20:13,C,7:20:14
7:20:14,E,7:20:15
7:20:16,A,7:20:17
7:20:17,C,7:20:17
7:20:17,D,7:20:18
7:20:18,F,7:20:18

Getting FILENAME in awk for multiple compressed files

Your command is parsing stdin provided by the output of your previous command, so filename is not available. One way to deal with it is this:

for f in *.tsv.gz; do
zcat "$f" | awk -F, -v f="$f" '$1=="aaa" || $1=="bbb"{print f (NF?", ":"") $0}'
done

Use zcat and sed or awk to edit compressed .gz text file

You can't bypass compression, but you can chain the decompress/edit/recompress together in an automated fashion:

for f in /dir/*; do
cp "$f" "$f~" &&
gzip -cd "$f~" | sed '2~4s/^.\{6\}//' | gzip > "$f"
done

If you're quite confident in the operation, you can remove the backup files by adding rm "$f~" to the end of the loop body.

Split a large, compressed file into multiple outputs using AWK and BASH

This little perl script does the job nicely

  • keeping all destination files open for performance
  • doing error elementary handling
  • Edit now also pipes output through gzip on the fly

There is a bit of a kludge with $fh because apparently using the hash entry directly doesn't work

#!/usr/bin/perl
use strict;
use warnings;

my $suffix = ".txt.gz";

my %pipes;
while (my ($id, $line) = split /\t/,(<>),2)
{
exists $pipes{$id}
or open ($pipes{$id}, "|gzip -9 > '$id$suffix'")
or die "can't open/create $id$suffix, or cannot spawn gzip";

my $fh = $pipes{$id};
print $fh $line;
}

print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n"

Oh, use it like

zcat input.gz | ./myscript.pl

How to replace a value to another value in a specific column on a gzipped file using awk?

You could check only for X in the first column and check if the row number is greater than 1.

Then you can replace X at the start of the string using ^X with 23.

awk 'NR > 1 && $1=="X" {sub(/^X/,"23")}1' > out.txt


Related Topics



Leave a reply



Submit