Python Regex to Match Dates

Python regex to match dates

Instead of using regex, it is generally better to parse the string as a datetime.datetime object:

In [140]: datetime.datetime.strptime("11/12/98","%m/%d/%y")
Out[140]: datetime.datetime(1998, 11, 12, 0, 0)

In [141]: datetime.datetime.strptime("11/12/98","%d/%m/%y")
Out[141]: datetime.datetime(1998, 12, 11, 0, 0)

You could then access the day, month, and year (and hour, minutes, and seconds) as attributes of the datetime.datetime object:

In [143]: date.year
Out[143]: 1998

In [144]: date.month
Out[144]: 11

In [145]: date.day
Out[145]: 12

To test if a sequence of digits separated by forward-slashes represents a valid date, you could use a try..except block. Invalid dates will raise a ValueError:

In [159]: try:
.....: datetime.datetime.strptime("99/99/99","%m/%d/%y")
.....: except ValueError as err:
.....: print(err)
.....:
.....:
time data '99/99/99' does not match format '%m/%d/%y'

If you need to search a longer string for a date,
you could use regex to search for digits separated by forward-slashes:

In [146]: import re
In [152]: match = re.search(r'(\d+/\d+/\d+)','The date is 11/12/98')

In [153]: match.group(1)
Out[153]: '11/12/98'

Of course, invalid dates will also match:

In [154]: match = re.search(r'(\d+/\d+/\d+)','The date is 99/99/99')

In [155]: match.group(1)
Out[155]: '99/99/99'

To check that match.group(1) returns a valid date string, you could then parsing it using datetime.datetime.strptime as shown above.

Python regex for date and numbers, find the date format

You have an invalid date format near '%Y-%m-%d' since it should have been '%d/%m/%Y' looking at your provided date: birthday on 20/12/2018 (dd/mm/yyyy)

Change this:

date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()

With this:

date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()

Your Fix:

import datetime
from datetime import date
import re
s = "birthday on 20/12/2018"
match = re.search(r'\d{2}/\d{2}/\d{4}', s)
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
print (date)

But:

Why get into all the trouble? When they're easier and elegant ways out there.

Using dparser:

import dateutil.parser as dparser
dt_1 = "birthday on 20/12/2018"
print("Date: {}".format(dparser.parse(dt_1,fuzzy=True).date()))

OUTPUT:

Date: 2018-12-20

EDIT:

With your edited question which now has multiple dates, you could extract them using regex:

import re
s = "birthday on 20/12/2018 and wedding aniversry on 04/01/1997 and dob is on 09/07/1897"
pattern = r'\d{2}/\d{2}/\d{4}'
print("\n".join(re.findall(pattern,s)))

OUTPUT:

20/12/2018
04/01/1997
09/07/1897

OR

Using dateutil:

from dateutil.parser import parse

for s in s.split():
try:
print(parse(s))
except ValueError:
pass

OUTPUT:

2018-12-20 00:00:00
1997-04-01 00:00:00
1897-09-07 00:00:00

match dates using python regular expressions

You can use the datetime module to parse dates:

import datetime

print datetime.datetime.strptime('2010-08-27', '%Y-%m-%d')
print datetime.datetime.strptime('2010-15-27', '%Y-%m-%d')

output:

2010-08-27 00:00:00
Traceback (most recent call last):
File "./x.py", line 6, in <module>
print datetime.datetime.strptime('2010-15-27', '%Y-%m-%d')
File "/usr/lib/python2.7/_strptime.py", line 325, in _strptime
(data_string, format))
ValueError: time data '2010-15-27' does not match format '%Y-%m-%d'

So catching ValueError will tell you if the date matches:

def valid_date(datestring):
try:
datetime.datetime.strptime(datestring, '%Y-%m-%d')
return True
except ValueError:
return False

To allow for various formats you could either test for all possibilities, or use re to parse out the fields first:

import datetime
import re

def valid_date(datestring):
try:
mat=re.match('(\d{2})[/.-](\d{2})[/.-](\d{4})$', datestring)
if mat is not None:
datetime.datetime(*(map(int, mat.groups()[-1::-1])))
return True
except ValueError:
pass
return False

Python Regex Date

I agree with jonrsharpe that the way to do this is to combine regex with datetime. I used a simple regex that is going to match anything that could be a date in the format, then try to parse them with datetime.

import re
import datetime

def yield_valid_dates(dateStr):
for match in re.finditer(r"\d{1,2}-\d{1,2}-\d{4}", dateStr):
try:
date = datetime.datetime.strptime(match.group(0), "%m-%d-%Y")
yield date
# or you can yield match.group(0) if you just want to
# yield the date as the string it was found like 05-04-1999
except ValueError:
# date couldn't be parsed by datetime... invalid date
pass


testStr = """05-04-1999 here is some filler text in between the two dates 4-5-2016 then finally an invalid
date 32-2-2016 here is also another invalid date, there is no 32d day of the month 6-32-2016. You can also not
include the leading zeros like 4-2-2016 and it will still be detected"""

for date in yield_valid_dates(testStr):
print(date)

This prints the three valid dates:

1999-05-04 00:00:00
2016-04-05 00:00:00
2016-04-02 00:00:00

Matching dates with regular expressions in Python?

Here's one way to make a regular expression that will match any date of your desired format (though you could obviously tweak whether commas are optional, add month abbreviations, and so on):

years = r'((?:19|20)\d\d)'
pattern = r'(%%s) +(%%s), *%s' % years

thirties = pattern % (
"September|April|June|November",
r'0?[1-9]|[12]\d|30')

thirtyones = pattern % (
"January|March|May|July|August|October|December",
r'0?[1-9]|[12]\d|3[01]')

fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4))

feb = r'(February) +(?:%s|%s)' % (
r'(?:(0?[1-9]|1\d|2[0-8])), *%s' % years, # 1-28 any year
r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours) # 29 leap years only

result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
r = re.compile(result)
print result

Then we have:

>>> r.match('January 30, 2001') is not None
True
>>> r.match('January 31, 2001') is not None
True
>>> r.match('January 32, 2001') is not None
False
>>> r.match('February 32, 2001') is not None
False
>>> r.match('February 29, 2001') is not None
False
>>> r.match('February 28, 2001') is not None
True
>>> r.match('February 29, 2000') is not None
True
>>> r.match('April 30, 1908') is not None
True
>>> r.match('April 31, 1908') is not None
False

And what is this glorious regexp, you may ask?

>>> print result
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))

(I initially intended to do a tongue-in-cheek enumeration of the possible dates, but I basically ended up hand-writing that whole gross thing except for the multiples of four, anyway.)

RegEx for matching a datetime followed by spaces and any chars

Don't bother with a regular expression. You know the format of the line. Just split it:

from datetime import datetime

for l in lines:
line_date, line_time, rest_of_line = l.split(maxsplit=2)
print([line_date, line_time, rest_of_line])
# ['2018-09-08', '10:34:49', '10.0 MiB path/of/a/directory']

Take special note of the use of the maxsplit argument. This prevents it from splitting the size or the path. We can do this because we know the date has one space in the middle and one space after it.

If the size will always have one space in the middle and one space following it, we can increase it to 4 splits to separate the size, too:

for l in lines:
line_date, line_time, size_quantity, size_units, line_path = l.split(maxsplit=4)
print([line_date, line_time, size_quantity, size_units, line_path])
# ['2018-09-08', '10:34:49', '10.0', 'MiB', 'path/of/a/directory']

Note that extra contiguous spaces and spaces in the path don't screw it up:

l = "2018-09-08 10:34:49     10.0   MiB    path/of/a/direct       ory"
line_date, line_time, size_quantity, size_units, line_path = l.split(maxsplit=4)
print([line_date, line_time, size_quantity, size_units, line_path])
# ['2018-09-08', '10:34:49', '10.0', 'MiB', 'path/of/a/direct ory']

You can concatenate parts back together if needed:

line_size = size_quantity + ' ' + size_units


If you want the timestamp for something, you can parse it:

# 'T' could be anything, but 'T' is standard for the ISO 8601 format
timestamp = datetime.strptime(line_date + 'T' + line_time, '%Y-%m-%dT%H:%M:%S')

Match dates and hour using Python Regular Expressions

If you want a mandatory time in your format, you could use:

^([0]?[1-9]|[1|2][0-9]|[3][0|1])[.\/-]([0]?[1-9]|[1][0-2])[.\/-]([0-9]{4}|[0-9]{2})\s+([01]?\d|2[0-3]):([0-5]?\d)(:[0-5]?\d)?$

If you want the time to be optional, you could use:

^([0]?[1-9]|[1|2][0-9]|[3][0|1])[.\/-]([0]?[1-9]|[1][0-2])[.\/-]([0-9]{4}|[0-9]{2})(\s+([01]?\d|2[0-3]):([0-5]?\d)(:[0-5]?\d)?)?$

Of course capture groups can be changed depending on what data you need to extract, if any.

That being said, I would recommend a date library to handle things like this is most cases, although sometimes you might want regex e.g. for form validation in frameworks that only accept regex.

Regular expression to match range of dates with months included

This REGEX validate date range that respect this format MONTH YEAR (MONTH YEAR | PRESENT)

import re
# just for complexity adding to valid range in first line
text = """
February 2016 - March 2019 February 2017 - March 2019
September 2015 to August 2019
April 2015 to present
September 2018 - present
George Mason University august 2019
Stratusburg university February 2018
Some text and month followed by year
"""
# writing the REGEX in one line will make it very UGLY
MONTHS_RE = ['Jan(?:uary)?', 'Feb(?:ruary)', 'Mar(?:ch)', 'Apr(?:il)?', 'May', 'Jun(?:e)?', 'Aug(?:ust)?', 'Sep(?:tember)?',
'(?:Nov|Dec)(?:ember)?']
# to match MONTH NAME and capture it (Jan(?:uary)?|Feb(?:ruary)...|(?:Nov|Dec)(?:ember)?)
RE_MONTH = '({})'.format('|'.join(MONTHS_RE))
# THIS MATCHE MONTH FOLLOWED BY YEAR{2 or 4} I will use two times in Final REGEXP
RE_DATE = '{RE_MONTH}(?:[\s]+)(\d{{2,4}})'.format(RE_MONTH=RE_MONTH)
# FINAL REGEX
RE_VALID_RANGE = re.compile('{RE_DATE}.+?(?:{RE_DATE}|(present))'.format(RE_DATE=RE_DATE), flags=re.IGNORECASE)


# if you want to extract both valid an invalide
valid_ranges = []
invalid_ranges = []
for line in text.split('\n'):
if line:
groups = re.findall(RE_VALID_RANGE, line)
if groups:
# If you want to do something with range
# all valid ranges are here my be 1 or 2 depends on the number of valid range in one line
# every group have 4 elements because there is 4 capturing group
# if M2,Y2 are not empty present is empty or the inverse only one of them is there (because of (?:{RE_DATE}|(present)) )
M1, Y1, M2, Y2, present = groups[0] # here use loop if you want to verify the values even more
valid_ranges.append(line)
else:
invalid_ranges.append(line)

print('VALID: ', valid_ranges)
print('INVALID:', invalid_ranges)


# this yields only valid ranges if there is 2 in one line will yield two valid ranges
# if you are dealing with lines this is not what you want
valid_ranges = []
for match in re.finditer(RE_VALID_RANGE, text):
# if you want to check the ranges
M1, Y1, M2, Y2, present = match.groups()
valid_ranges.append(match.group(0)) # the text is returned here
print('VALID USING <finditer>: ', valid_ranges)

OUPUT:

VALID:  ['February 2016 - March 2019 February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']
INVALID: ['George Mason University august 2019', 'Stratusburg university February 2018', 'Some text and month followed by year']
VALID USING <finditer>: ['February 2016 - March 2019', 'February 2017 - March 2019', 'September 2015 to August 2019', 'April 2015 to present', 'September 2018 - present']

I hate writing long regular expression in a single str variable I love to break it to understand what It does when I read my code after six Months. Note how the first line is divided to two valid range string using finditer

If you want just to extract ranges you can use this:

valid_ranges = re.findall(RE_VALID_RANGE, text)

But this returns the groups ([M1, Y1, M2, Y2, present)..] not the Text :

[('February', '2016', 'March', '2019', ''), ('February', '2017', 'March', '2019', ''), ('September', '2015', 'August', '2019', ''), ('April', '2015', '', '', 'present'), ('September', '2018', '', '', 'present')]

date matching using python regex

[1-31] matches 1-3 and 1 which is basically 1, 2 or 3. You cannot match a number rage unless it's a subset of 0-9. Same applies to [1981-2011] which matches exactly one character that is 0, 1, 2, 8 or 9.

The best solution is simply matching any number and then checking the numbers later using python itself. A date such as 31-02-2012 would not make any sense - and making your regex check that would be hard. Making it also handle leap years properly would make it even harder or impossible. Here's a regex matching anything that looks like a dd-mm-yyyy date: \b\d{1,2}[-/:]\d{1,2}[-/:]\d{4}\b

However, I would highly suggest not allowing any of -, : and / as : is usually used for times, / usually for the US way of writing a date (mm/dd/yyyy) and - for the ISO way (yyyy-mm-dd). The EU dd.mm.yyyy syntax is not handled at all.

If the string does not contain anything but the date, you don't need a regex at all - use strptime() instead.

All in all, tell the user what date format you expect and parse that one, rejecting anything else. Otherwise you'll get ambiguous cases such as 04/05/2012 (is it april 5th or may 4th?).



Related Topics



Leave a reply



Submit