Best Way to Identify and Extract Dates from Text Python

Best way to identify and extract dates from text Python?

I was also looking for a solution to this and couldn't find any, so a friend and I built a tool to do this. I thought I would come back and share incase others found it helpful.

datefinder -- find and extract dates inside text

Here's an example:

import datefinder

string_with_dates = '''
Central design committee session Tuesday 10/22 6:30 pm
Th 9/19 LAB: Serial encoding (Section 2.2)
There will be another one on December 15th for those who are unable to make it today.
Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm
He will be flying in Sept. 15th.
We expect to deliver this between late 2021 and early 2022.
'''

matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)

Extracting date from a string in Python

If the date is given in a fixed form, you can simply use a regular expression to extract the date and "datetime.datetime.strptime" to parse the date:

import re
from datetime import datetime

match = re.search(r'\d{4}-\d{2}-\d{2}', text)
date = datetime.strptime(match.group(), '%Y-%m-%d').date()

Otherwise, if the date is given in an arbitrary form, you can't extract it easily.

python - extract dates from text by giving as parameter the date of reference which is not the current date

dateutil.parser.parse accepts a default parameter which you can use to specify a reference date:

import datetime as DT
import dateutil.parser as DP

today = DT.datetime(2016, 4, 13)
for text in ('today', 'tomorrow', 'this Sunday', 'Wednesday next week',
'next week Wednesday',
'next thursday', 'next tuesday in June', '11/28',
'Concert this Saturday'
"lunch with Andrew @ Mon Mar 7, 2016",
'meeting on Tuesday, 3/29'):
dp_date = DP.parse(text, default=today, fuzzy=True)
print('{:35} --> {}'.format(text, dp_date))

yields

today                               --> 2016-04-13 00:00:00
tomorrow --> 2016-04-13 00:00:00 should be 2016-04-14
this Sunday --> 2016-04-17 00:00:00
Wednesday next week --> 2016-04-13 00:00:00
next week Wednesday --> 2016-04-13 00:00:00
next thursday --> 2016-04-14 00:00:00
next tuesday in June --> 2016-06-14 00:00:00 should be 2016-06-07
11/28 --> 2016-11-28 00:00:00
Concert this Saturday --> 2016-04-16 00:00:00
lunch with Andrew @ Mon Mar 7, 2016 --> 2016-03-07 00:00:00
meeting on Tuesday, 3/29 --> 2016-03-29 00:00:00

Note, however, that not all phrases are parsed correctly.

How to correctly extract various Date formats from Text in Python

I resolved the issue.
Actually there were some encoding issue in my text content.

dateContent = dateContent.replace(u'\u200b', '')

Replacing \u200b with empty character fixed the issue.
Datefinder Module does rest of the work of finding all the different Date Formats.

How to extract specific date format from a plain text (python)?

This has been asked a dozen times. Imo the best way is to use a library, e.g. datefinder:

import datefinder
text = "Psg January 1, 2020 hsjkfsdlkfhshdfh January 2, 1908 hdhahhajshjdjoi December 31, 2019 fafsfafagherhea"
matches = datefinder.find_dates(text)

for match in matches:
print(match)

Which yields

2020-01-01 00:00:00
1908-01-02 00:00:00
2019-12-31 00:00:00

How to extract date from string using Python 3.x

There are two things that prevent datefinder to parse correctly your samples:

  1. the bill amount: numbers are interpreted as years, so if they have 3 or 4 digits it creates a date
  2. characters defined as delimiters by datefinder might prevent to find a suitable date format (in this case ':')

The idea is to first sanitize the text by removing the parts of the text that prevent datefinder to identify all the dates. Unfortunately, this is a bit of try and error as the regex used by this package is too big for me to analyze thoroughly.

def extract_duedate(text):
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)

return list(datefinder.find_dates(text))[-1]

Rs[\d,\. ]+ will remove the bill amount so it is not mistaken as part of a date. It will match strings of the form 'Rs[.][ ][12,]345[.67]' (actually more variations but this is just to illustrate).

Obviously, this is a raw example function.
Here are the results I get:

1 : 2017-07-03 00:00:00
2 : 2017-06-06 00:00:00 # Wrong result: first date instead of today
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00

There is one problem on the sample 2: 'today' is not recognized alone by datefinder

Example:

>>> list(datefinder.find_dates('Rs 219 is due today'))
[datetime.datetime(219, 7, 13, 0, 0)]
>>> list(datefinder.find_dates('is due today'))
[]

So, to handle this case, we could simply replace the token 'today' by the current date as a first step. This would give the following function:

def extract_duedate(text):
if 'today' in text:
text = text.replace('today', datetime.date.today().isoformat())

# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)

return list(datefinder.find_dates(text))[-1]

Now the results are good for all samples:

1 : 2017-07-03 00:00:00
2 : 2017-07-18 00:00:00 # Well, this is the date of my test
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00

If you need, you can let the function return all dates and they should all be correct.



Related Topics



Leave a reply



Submit