How to Filter Data Between 2 Dates with Awk in a Bash Script

using awk to check between two dates

The key observation is that you can compare your timestamps using alphanumeric comparisons and get the correct answer - that is the beauty of ISO 8601 notation.

Thus, adapting your code slightly - and formatting to avoid scroll bars:

awk 'BEGIN {
FS = "\n"
RS = ""
OFS = ";"
ORS = "\n"
t1 = "2010-03-23T07:45:00"
t2 = "2010-03-23T08:00:00"
m1 = "eventTimestamp: " t1
m2 = "eventTimestamp: " t2
}
$1 ~ /eventTimestamp:/ && $4 ~ /SMS-MO-FSM(-INFO)?$/ {
if ($1 >= m1 && $1 <= m2) print $1, $2, $3, $4;
}' "$@"

Obviously, you could put this into a script file - you wouldn't want to type it often. And getting the date range entered accurately and conveniently is one of the hard parts. Note that I've adjusted the time range to match the data.

When run on the sample data, it outputs one record:

eventTimestamp: 2010-03-23T07:56:19.186;result: Allowed;protocol: SMS;payload: SMS-MO-FSM

How to select date range in awk

The answer is that awk does not have any knowledge of what a date is. Awk knows numbers and strings and can only compare those. So when you want to select dates and times you have to ensure that the date-format you compare is sortable and there are many formats out there:

| type       | example                   | sortable |
|------------+---------------------------+----------|
| ISO-8601 | 2019-11-19T10:05:15 | string |
| RFC-2822 | Tue, 19 Nov 2019 10:05:15 | not |
| RFC-3339 | 2019-11-19 10:05:15 | string |
| Unix epoch | 1574157915 | numeric |
| AM/PM | 2019-11-19 10:05:15 am | not |
| MM/DD/YYYY | 11/19/2019 10:05:15 | not |
| DD/MM/YYYY | 19/11/2019 10:05:15 | not |

So you would have to convert your non-sortable formats into a sortable format, mainly using string manipulations. A template awk program that would achieve what you want is written down here:

# function to convert a string into a sortable format
function convert_date(str) {
return sortable_date
}
# function to extract the date from the record
function extract_date(str) {
return extracted_date
}
# convert the range
(FNR==1) { t1 = convert_date(begin); t2 = convert_date(end) }
# extract the date from the record
{ date_string = extract_date($0) }
# convert the date of the record
{ t = convert_date(date_string) }
# make the selection
(t1 <= t && t < t2) { print }

most of the time, this program can be heavily reduced. If the above is stored in extract_date_range.awk, you could run it as:

$ awk -f extract_date_range.awk begin="date-in-know-format" end="date-in-known-format" logfile

note: the above assumes single-line log-entries. With a minor adaptation, you can process multi-line log-entries.


In the original problem, the following formats were presented:

EEE MMM dd yy HH:mm         # not sortable
EEE MMM dd HH:mm # not sortable
yyyy-MM-dd hh:mm # sortable
dd MMM yyyy HH:mm:ss # not sortable

From the above, all but the second format can be easily converted to a sortable format. The second format misses the Year by which we would have to do an elaborate check making use of the day of the week. This is extremely difficult and never 100% bullet proof.

Excluding the second format, we can write the following functions:

BEGIN {
datefmt1="^[a-Z][a-Z][a-Z] [a-Z][a-Z][a-Z] [0-9][0-9] [0-9][0-9] [0-9][0-9]:[0-9][0-9]"
datefmt3="^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] [0-9][0-9]:[0-9][0-9]"
datefmt4="^[0-9][0-9] [a-Z][a-Z][a-Z] [0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9]"
}
# convert the range
(FNR==1) { t1 = convert_date(begin); t2 = convert_date(end) }
# extract the date from the record
{ date_string = extract_date($0) }
# skip if date string is empty
(date_string == "") { next }
# convert the date of the record
{ t = convert_date(date_string) }
# make the selection
(t1 <= t && t < t2) { print }

# function to extract the date from the record
function extract_date(str, date_string) {
date_string=""
if (match(datefmt1,str)) { date_string=substr(str,RSTART,RLENGTH) }
else if (match(datefmt3,str)) { date_string=substr(str,RSTART,RLENGTH) }
else if (match(datefmt4,str)) { date_string=substr(str,RSTART,RLENGTH) }
return date_string
}
# function to convert a string into a sortable format
# converts it in the format YYYYMMDDhhmmss
function convert_date(str, a,fmt, YYYY,MM,DD,T, sortable_date) {
sortable_date=""
if (match(datefmt1,str)) {
split(str,a,"[ ]")
YYYY=(a[4] < 70 ? "19" : "20")a[4]
MM=get_month(a[2]); DD=a[3]
T=a[5]; gsub(/[^0-9]/,T)"00"
sortable_date = YYYY MM DD T
}
else if (match(datefmt3,str)) {
sortable_date = str"00"
gsub(/[^0-9]/,sortable_date)
}
else if (match(datefmt4,str)) {
split(str,a,"[ ]")
YYYY=a[3]
MM=get_month(a[2]); DD=a[1]
T=a[4]; gsub(/[^0-9]/,T)"00"
sortable_date = YYYY MM DD T
}
return sortable_date
}
# function to convert Jan->01, Feb->02, Mar->03 ... Dec->12
function get_month(str) {
return sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",str)+2)/3)
}

ISO 8601 was published on 06/05/88 and most recently amended on 12/01/04.

Awk between two dates in a logfile - almost working

You should transform date to a format YYYYMMDD so it can be lexicographilly ordered. You can do it with gawk and regex, or by doing substrings operations with awk. Here is the gawk way

more text_B_14_FEB_03.dt | grep TMYO | gawk 'match($5, "([0-9]+)/([0-9]+)/([0-9]+)", ary) {B
=ary[3] ary[2] ary[1]; if (B < 20140213 && B> 20130104) print }'

awk filter values from array based on date validation and print correct output if there is match with text at START and END including match

awk -v d="$(date --date="7 days ago" "+%Y%m%d")" 'BEGIN{ i=999999 }$6 < d && i >=$3{ if(i>$3){ if (i!=999999) print "END"; print "START" }; print $0; i=$3 }END{ print "END"}' file1

output:

START
A B 25320 FX M.1 20200429
A B 25320 FX M.1 20200421
A B 25320 FX M.1 20200429
A B 25320 FX M.1 20200423
END
START
A B 25276 FX M.1 20200421
A B 25276 FX M.1 20200328
A B 25276 FX M.1 20200328
A B 25276 FX M.1 20200328
A B 25276 FX M.1 20200328
A B 25276 FX M.1 20200328
A B 25276 FX M.1 20200423
A B 25276 FX M.1 20200423
A B 25276 FX M.1 20200423
A B 25276 FX M.1 20200423
A B 25276 FX M.1 20200423
A B 25276 FX M.1 20200423
END
START
A B 25172 FX M.1 20200421
END
START
A B 25060 FX M.1 20200421
END

Filter lines containing date between a range in csv file in shell

In awk:

$ cat program.awk
function mkdt(str) { # functionize dt conversion
split(str, a, "[/ ]") # split dt
return sprintf( "%s-%02d-%02d %s\n" ,a[3], a[2], a[1], a[4]) # zeropad and reorganize
}
mkdt($3) > mkdt(start) && mkdt($3) < mkdt(end) # compare and print

Run it:

$ awk -v start="10/2/2016 23:00" -v end="11/2/2016 20:45" -F, -f program.awk temp.csv
ABHA_BSC,11DPM12-1-7-C1,10/2/2016 23:15,6623893225,42756482355,Juniper_GBE_ABHA_BSC-1-7-C1_JIZAN-1-7-C1_JIZ1AH1-01 | (SOUTHERN_ABHA_ABH0027-MX480-1 TO SOUTHERN_JIZAN_JIZ0005-MX104-1),1GbE
ABHA_BSC,11DPM12-1-7-C1,10/2/2016 23:30,6781639211,44625787536,Juniper_GBE_ABHA_BSC-1-7-C1_JIZAN-1-7-C1_JIZ1AH1-01 | (SOUTHERN_ABHA_ABH0027-MX480-1 TO SOUTHERN_JIZAN_JIZ0005-MX104-1),1GbE
ABHA_BSC,11DPM12-1-7-C1,10/2/2016 23:45,6586403766,41882620412,Juniper_GBE_ABHA_BSC-1-7-C1_JIZAN-1-7-C1_JIZ1AH1-01 | (SOUTHERN_ABHA_ABH0027-MX480-1 TO SOUTHERN_JIZAN_JIZ0005-MX104-1),1GbE
ABHA_BSC,11DPM12-1-7-C11,10/2/2016 23:15,8440733035,54114599426,Juniper_GBE_ABHA_BSC-1-7-C11_JIZAN-1-7-C11_JIZ1AH1-03 | (SOUTHERN_ABHA_ABH0027-MX480-2 TO SOUTHERN_JIZAN_JIZ0005-MX104-2),1GbE
ABHA_BSC,11DPM12-1-7-C11,10/2/2016 23:30,8051347485,49383381691,Juniper_GBE_ABHA_BSC-1-7-C11_JIZAN-1-7-C11_JIZ1AH1-03 | (SOUTHERN_ABHA_ABH0027-MX480-2 TO SOUTHERN_JIZAN_JIZ0005-MX104-2),1GbE

I only zeropad the day and month (1/1/2016 -> 2016-01-01), not the hours or minutes. There is no sanity checking for missing or distorted datetimes. Add = to comparisons if needed (ie. > -> >=).

How to filter csv file by date column using awk whenever date format constraint does not match date format column?

You can use a regex to match the start of your field, i.e. match the first 10 characters (YYYY-MM-DD) of the field.

today=$(date '+%Y-%m-%d')
awk -v regex="^$today" -F';' '$25 ~ regex' input.csv > today.csv

This passes the value of the $today variable with -v to awk and prepends a ^ to match the start of the field.

Awk to find lines within date range in a file with custom date format

With awk. 0101 is January 1st and 0210 February 10th.

awk -v start="0101" -v stop="0210" \
'BEGIN{m["Jan"]="01"; m["Feb"]="02"; m["Mar"]="03"; m["Apr"]="04"}
{original = $0; $1 = m[$1]; $2 = sprintf("%.2d", $2)}
$1$2 >= start && $1$2 <= stop {print original}' file

Output:


Jan 5 11:34:00 log messages here
Jan 13 16:21:00 log messages here
Feb 1 01:14:00 log messages here
Feb 10 16:32:00 more messages

filter dates within a text file

Using GNU awk for time functions:

$ cat tst.awk
BEGIN {
tgtDays = 10
tgtSecs = tgtDays * 24 * 60 * 60

endTime = strftime("%Y %m %d 12 00 00")
endSecs = mktime(endTime,1)
}
{
mthNr = (index("JanFebMarAprMayJunJulAugSepOctNovDec",$4)+2)/3
begTime = sprintf("%04d %02d %02d 12 00 00", $7, mthNr, $5)
begSecs = mktime(begTime,1)
}
(endSecs - begSecs) < tgtSecs


$ awk -f tst.awk sample.txt
system system_data8 Thu Jul 29 22:36:38 2021

Note that in the above we replace the time of day in both the input data and the current time with noon because when determining how many days between 2 dates by converting a timestamp to seconds since the epoch first then dividing by the number of seconds in a day you have to use the same time each day because otherwise your "number of days" calculation can/will be thrown off by the time each day.

For example look at the following that's trying to determine if 2 dates which ARE 10 days apart are less than 10 days apart:

$ cat diffDatesDemo.awk
BEGIN {
tgtDays = 10
tgtSecs = tgtDays * 24 * 60 * 60

begTime = "2021/08/01 09:00:00"
endTime = "2021/08/11 08:00:00"

begDate = gensub(/([ :][0-9]{2}){3}$/,"",1,begTime)
endDate = gensub(/([ :][0-9]{2}){3}$/,"",1,endTime)

print "Is", begTime, "less than", tgtDays, "days before", endTime "?"

####
print "\nWrong: Compare 2 timestamps including date plus time of day:"
begSecs = mktime(gensub("[/:]"," ","g",begTime),1)
endSecs = mktime(gensub("[/:]"," ","g",endTime),1)

print begDate, "->", endDate, "is", ((endSecs - begSecs) < tgtSecs ? "<" : ">="), tgtDays, "days"
####

####
print "\nRight: Compare 2 dates at the same time each day:"
begSecs = mktime(gensub("[/:]"," ","g",begDate)" 12 00 00",1)
endSecs = mktime(gensub("[/:]"," ","g",endDate)" 12 00 00",1)

print begDate, "->", endDate, "is", ((endSecs - begSecs) < tgtSecs ? "<" : ">="), tgtDays, "days"
####
}


$ awk -f diffDatesDemo.awk
Is 2021/08/01 09:00:00 less than 10 days before 2021/08/11 08:00:00?

Wrong: Compare 2 timestamps including date plus time of day:
2021/08/01 -> 2021/08/11 is < 10 days

Right: Compare 2 dates at the same time each day:
2021/08/01 -> 2021/08/11 is >= 10 days

I also used the UTC flag for mktime() above to make sure that any local DST changes didn't impact the number of days calculation.



Related Topics



Leave a reply



Submit