Awk asking combine two files
This solution doesn't hardcode that there are 3 extra fields in File2
awk '
BEGIN { FS = OVS = "\t" }
NR == FNR {
key = $1
$1 = ""
store[key] = $0
num_extra_fields = NF-1
next
}
FNR == 1 {
printf "%s", $0
for (i=1; i <= num_extra_fields; i++)
printf "%sValue%d", OFS, i+(NF-1)
print ""
next
}
$1 in store {
print $0 store[key]
next
}
{
for (i=1; i <= num_extra_fields; i++)
$(++NF)="-"
print
}
' file2 file1
The output looks a bit odd due to how stackoverflow displays tabs
Key Value1 Value2 Value3 Value4
A 10000 - - -
B 20000 20000 10000 50000
C 30000 20000 10000 50000
D 40000 - - -
To fix your code, you need to keep track of the keys in file2 that update the results. Change
{ for (i=2; i <= NF; i++) result[key] = result[key] FS $i }
to
{ updated[key]=1; for (i=2; i <= NF; i++) result[key] = result[key] FS $i }
and, in the END block, change
for (key in result) print result[key]
to
for (key in result) {
if (!(key in updated)) result[key] = result[key] FS "-" FS "-" FS "-"
print result[key]
}
How to merge two files using AWK?
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1
Explanation: (Partly based on another question. A bit late though.)
FNR
refers to the record number (typically the line number) in the current file and NR
refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands}
means that the commands inside the brackets only executed while processing the first file (file2
now).
FS
refers to the field separator and $1
, $2
etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3
means that a dictionary(/array) (named a
) is filled with $1
key and $2 FS $3
value.
;
separates the commands
next
means that any other commands are ignored for the current line. (The processing continues on the next line.)
$0
is the whole line
{print $0, a[$1]}
simply prints out the whole line and the value of a[$1]
(if $1
is in the dictionary, otherwise only $0
is printed). Now it is only executed for the 2nd file (file1
now), because of FNR==NR{...;next}
.
Merge two files together by awk
If you're ok to use GNU tools, you merge both files with join
, sort
and column
commands:
$ join -1 1 -2 6 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 1.3 1.4" <(sort file1) <(sed 's/ \+/ /g' file2 | sort -k6) | column -t
APC 5 112838773 ENST00000257430 15 c.000A>C p.Gln1062Ter p.Thr102 E
APC 5 1128395514 ENST00000257430 15 c.001A>C p.Glu1309AspfsT p.His103 A
APC 5 112835056 ENST00000507379 13 c.001A>C p.Val599Phe p.His103 A
join
command merge files based on first column of file1 and 6th column of file2. This command expects sorted input on both files which is done with <(...)
. The option -o
lists all column you want to display.
Note the sed
command removesduplicate white spaces in order to have the right column number for sort
(on the 2nd file).
At last column -t
is nicely displaying fields in a column style.
Merge two files using awk in linux
If your actual Input_file(s) are same as shown sample then following awk
may help you in same.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt
EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt
how to merge two file with awk?
You can use join
command:
$ join file1.txt file2.txt
How can I merge two files by column with awk?
The way to efficiently do what your updated question describes:
Suppose I have a directory with pairs of files, ending with two
extensions: .ext1 and .ext2. Those files have parameters included in
their names, for example file_0_par1_par2.ext1 has its pair,
file_0_par1_par2.ext2. Each file contains 5 values. I have a function
to extract its serial number and its parameters from its name. My goal
is to write, on a single csv file (file_out.csv), the values present
in the files along with the parameters extracted from their names.
for file1 in *.ext1 ; do
for file2 in *.ext2 ; do
# for each file ending with .ext2, verify if it is file1's corresponding pair
# I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
# extract file_number, and par1, par2 based on some conditions, then append to the csv file
paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
do
echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv"
done
fi
done
done
would be (untested):
for file1 in *.ext1; do
base="${file1%.*}"
file2="${base}.ext2"
paste -d ',' "$file1" "$file2" |
awk -v base="$base" '
BEGIN { split(base,b,/_/); FS=OFS="," }
{ print b[3], b[4], $1, $2, b[2] }
'
done > 'file_out.csv'
Doing base="${file1%.*}"; file2="${base}.ext2"
itself would be N^2 times (given N pairs of files) more efficient than for file2 in *.ext2 ; do if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
and doing | awk '...'
itself would be an order of magnitude more efficient than | while IFS="," read -r var1 var2; do echo ...; done
(see why-is-using-a-shell-loop-to-process-text-considered-bad-practice) so you can expect to see a huge improvement in performance over your existing script.
How can we combine two files based on a condition in awk command?
In awk:
$ awk '
NR==FNR { # process location.txt
a[$1]=$2 OFS $3 # hash using $1 as key
next # next record
}
$4 in a { # process data.txt
print $0,a[$4] # output record and related location
}' location.txt data.txt # mind the file order
2004-03-31 03:38:15.757551 2 1 122.153 -3.91901 11.04 2.03397 21.5 23
2004-02-28 00:59:16.02785 3 2 19.9884 37.0933 45.08 2.69964 24.5 20
2004-02-28 01:03:16.33393 11 3 19.3024 38.4629 45.08 2.68742 19.5 19
2004-02-28 01:06:16.013453 17 4 19.1652 38.8039 45.08 2.68742 22.5 15
2004-02-28 01:06:46.778088 18 5 19.175 38.8379 45.08 2.69964 24.5 12
2004-02-28 01:08:45.992524 22 6 19.1456 38.9401 45.08 2.68742 19.5 12
Using AWK to merge two files based on multiple conditions
Please try this (GNU sed):
awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'
This is the time BEGIN
block kicks in. Also OFS
kicks in.
When we are printing out many fields which separated by same thing, we can set OFS
, and simply put comma between the things we want to print.
There's no need to check key in arr
when you've assigned value for a key in the array,
by default, when arr[somekey]
isn't assigned before, it's empty
/""
, and it evaluates to false
in awk (0
in scalar context), and a non-empty string is evaluates to true
(There's no literally true
and false
in awk
).
(You used wrong array
name, the $2,$6,$7
is the key in the array arr
here. It's confusing to use key
as array name.)
You can test some simple concept like this:
awk 'BEGIN{print arr["newkey"]}'
You don't need a input file to execute BEGIN
block.
Also, you can use quotes sometimes, to avoid confusion and underlying problem.
Update:
Your files actually ends in \n
, if you can't be sure what the line ending is, use this:
awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
or this (This one will ignore empty lines):
awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
Also, it's better to convert first to avoid such situations, by:
sed -i 's/\r//' files
Or you can use dos2unix
command:
dos2unix file
It's a handy commandline tool do above thing only.
You can install it if you don't have it in your system yet.
Once converted, you don't need to assign RS
in normal situations.
Related Topics
What Is an Absolute Pathname VS a Relative Pathname
How to View Thread Id of a Process Which Has Opened a Socket Connection
Linux Script- Date Manipulations
How to Get Gcc to Skip Errors, But Still Output Them
How to Make a Bash String of Command with Redirect and Pipe
Significance of Address 0X8048080
Cuda 5.0: Replacement for Cutil.H
What Is the Aligment Requirements for Sys_Brk
Gnuplot Doesn't Work Through Ssh Command
Average of Multiple Files Without Considering Missing Values
How to Check If "S" Permission Bit Is Set on Linux Shell? or Perl
Build a Linux Module Without Source Code
Create a Sudo User in Script with No Prompt for Password, Change to User Without Interrupting Script
After Segfault: Is There a Way, to Check If Pointer Is Still Valid
Diff (Gnu Diffutils) 3.6 Exclude Directory
Re-Encoding Only Images of a PDF? (Or, Ghostscript Fails on 8-Bit Rgb While Optimizing)
I'm Having Difficulty Understanding the Shellshock Vulnerability Verification