Awk One Liner Select Only Rows Based on Value of a Column

awk one liner select only rows based on value of a column

awk -F"\t" '$2 == "victorian" { print $1"\t"$3 }' file.in

Using awk to select rows with a specific value in column greater than x

Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:

awk '$3 > 98 {
match (FILENAME,/input.file$/)
print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file

Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.

match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.

For exmaple, if you had input files:

$ ls -1 *input.file
v1input.file
v2input.file

Both with your example content:

$ cat v1input.file
A chr11 98.80 83 1 0 2 84

B chr7 95.45 22 1 0 40 61

C chr7 88.89 27 0 1 46 72

D chr6 100.00 20 0 0 1 20

Running the awk command above would results in two output files:

$ ls -1 *output.file
v1output.file
v2output.file

Containing the records where the third-field was greater than 98:

$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

awk: Selecting rows based on the first 3 different values of a given column

Your question is a little confusing. But if I understand right you want to print out any row that has a day of the week matching one of whatever the first 3 distinct values the script finds in the file. You could do that with awk like so

BEGIN { FS="," }

{
if(dayCount < 3 && !($2 in days)) { days[$2] = 1; ++dayCount }
if ($2 in days) { print }
}

extract rows with specific value in a specific column using bash

Please read Correct Bash and shell script variable capitalization and https://mywiki.wooledge.org/Quotes to understand some of the issues in your script and copy/paste any shell script you write into https://www.shellcheck.net/ until you get the fundamentals down.

Regarding But this produces empty files - sure, for any give command cmd with

for f in *; do
cmd "$f" > "out$f"
done

you're creating an output file for each input file in the shell loop so if any input file doesn't match $4==12 in your awk script (the cmd in this case) you'll still get an output file, it'll just be empty. If you don't want that you could do:

tmp=$(mktemp)
for f in *; do
cmd "$f" > "$tmp" &&
mv -- "$tmp" "out$f"
done

and write cmd to exit with a succ/fail status like grep does when it finds a match (trivial in awk), or you could check the size of "$tmp" before the mv:

tmp=$(mktemp)
for f in *; do
cmd "$f" > "$tmp" &&
[[ -s "$tmp" ]] &&
mv -- "$tmp" "out$f"
done

You don't need a shell loop or other commands for this, though, just 1 call to awk to process all of your files at once. Using any awk in any shell on every Unix box do only this

awk -v file='random' -F'\t' '
FNR == 1 {
close(out)
f = FILENAME
sub(".*/","",f)
out = "/home/user/extra/" file "/" f
}
$4 == 12 {
print > out
}
' /home/user/data/*.txt

If you want a string instead of numeric comparison so that 12. doesn't match 12 then do $4 == "12" instead of $4 == 12.

In the above file is a poor choice of variable name to hold the name of a directory but I left it alone to avoid changing anything I didn't have to.

Filter out rows from column A based on values in column B

With an array. I assume that there are no duplicates in the first column.

awk -F ',' 'NR>1{
array[$1]++; array[$2]--
}
END{
for(i in array){ if(array[i]==1){ print i } }
}' file

As one line:

awk -F ',' 'NR>1{ array[$1]++; array[$2]-- } END{for(i in array){ if(array[i]==1){ print i } } }' file

Output:


esther@example.com
daisy@example.com
bill@example.com

Loop through files in a directory and select rows based on column value using awk for large files

Assuming:

  • The input filenames have the form as "*.txt".
  • The columns are separated by a tab character.
  • Each of five values are compared with the target column (COL_ABC or COL_DEF) one by one and individual
    result files are created according to the value. Then 15 x 5 = 75 files will be created. (If this is not what you want, please let me know.)

Then would you please try:

awk -F"\t" '
BEGIN {
values["a123"] # assign values
values["b234"]
values["c345"]
values["d456"]
values["e567"]
}
FNR==1 { # header line
for (i in values) { # loop over values
if (outfile[i] != "") close(outfile[i]) # close previous file
outfile[i] = "result_" i "_" FILENAME # filename to create
print > outfile[i] # print the header
}
abc = def = 0 # reset the indexes
for (i = 1; i <= NF; i++) { # loop over the column names
if ($i == "COL_ABC") abc = i # "COL_ABC" is found: assign abc to the index
else if ($i == "COL_DEF") def = i # "COL_DEF" is found: assign def to the index
}
next
}
{
for (i in values) {
if (abc > 0 && $abc == i || def > 0 && $def == i)
print > outfile[i] # abc_th column or def_th column matches i
}
}
' *.txt

If your 15 text files are located in the directory, e.g. /path/to/the/dir/ and you want to specify the directory as an argument, change the *.txt in the last line to /path/to/the/dir/*.txt.

Getting all values of various rows which have the same value in one column with awk

You may use this gnu awk:

awk '
BEGIN {
FS = OFS = ","
printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR > 1 {
++counts[$3] # Increment count of lines.
name[$3] = $2
map[$3] = ($3 in map ? map[$3] " " : "") $1
}
END {
# Iterate over all third-column values.
PROCINFO["sorted_in"]="@ind_str_asc";
for (k in counts)
print name[k], k, counts[k], map[k]
}
' test-file.csv

Output:

contact,mail,count,nodes
Dieter,dieter@anything.com,1,CCCC
Hans,hans@anything.com,2,BBBB CCDDA
Peter,peter@anything.com,2,AAAA ABABA

how to select rows with max value based on one column and group by second column using awk?

Credits to sofan's idea over in the comments, with some extra manipulations the below logic will do the trick as the OP wants.

sort -r -n -k2 < file| awk '!x[$1 FS $3]++' | sort -k1
  • Idea is to sort the file in ascending order first (sort -n -k2) and reverse it (-r) on column 2 (which now will be descending order)
  • awk '!x[$1 FS $3]++ does the grouping of that content unique by column 1 and 3 and sort -k1 sorts/groups it by column 1 contents

Select row and element in awk

To print the second line:

awk 'FNR == 2 {print}'

To print the second field:

awk '{print $2}'

To print the third field of the fifth line:

awk 'FNR == 5 {print $3}'

Here's an example with a header line and (redundant) field descriptions:

awk 'BEGIN {print "Name\t\tAge"}  FNR == 5 {print "Name: "$3"\tAge: "$2}'

There are better ways to align columns than "\t\t" by the way.

Use exit to stop as soon as you've printed the desired record if there's no reason to process the whole file:

awk 'FNR == 2 {print; exit}'

Filtering Rows Based On Number of Columns with AWK

You need to use the NF (number of fields) variable to control the actions, such as in the following transcript:

$ echo '0333 foo
> bar
> 23243 qux' | awk 'NF==2{print}{}'
0333 foo
23243 qux

This will print the line if the number of fields is two, otherwise it will do nothing. The reason I have the (seemingly) strange construct NF==2{print}{} is because some implementations of awk will print by default if no rules are matched for a line. The empty command {} guarantees that this will not happen.

If you're lucky enough to have one of those that doesn't do this, you can get away with:

awk 'NF==2'

but the first solution above will work in both cases.



Related Topics



Leave a reply



Submit