Is There a Fast Parser for Date

Is there a fast parser for date

Given

## the following two (here three) lines are all of fasttime's R/time.R
fastPOSIXct <- function(x, tz=NULL, required.components = 3L)
.POSIXct(if (is.character(x)) .Call("parse_ts", x, required.components)
else .Call("parse_ts", as.character(x), required.components), tz)

hence

## so we suggest to just use it, and convert later
fastDate <- function(x, tz=NULL)
as.Date(fastPOSIXct(x, tz=tz))

which at least beats as.Date():

R> library(microbenchmark)
R> library(fasttime)
R> d <- rep("2010-11-12", n=1e4)
R> microbenchmark(fastDate(d), as.Date(d), times=100)
Unit: microseconds
expr min lq mean median uq max neval cld
fastDate(d) 47.469 48.8605 54.3232 55.7270 57.1675 104.447 100 a
as.Date(d) 77.194 79.4120 85.3020 85.2585 87.3135 121.979 100 b

R>

If you wanted to go super fast, you could start with tparse.c to create the date-only subset you want.

Is there a way to improve speed of parsing date for large file?

@TemporalWolf had the excellent suggestion of using ciso8601. I've never heard of it so I figured I'd give it a try.

First, I benchmarked my laptop with your sample line. I made a CSV file with 10 million rows of that exact line and it took about 6 seconds to read everything. Using your date parsing code brought that up to 48 seconds which made sense because you also reported it taking 8x longer. Then I scaled the file down to 1 million rows and I could read it in 0.6 seconds and parse dates in 4.8 seconds so everything looked right.

Then I switched over to ciso8601 and, almost like magic, the time for 1 million rows went from 4.8 seconds to about 1.9 seconds:

import datetime
import ciso8601

t0 = datetime.datetime.now()
i = 0
with open('input.csv') as file:
for line in file:
strings = line.split(",")
d = ciso8601.parse_datetime('%sT%s' % (strings[0], strings[1]))
i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)

Note that your data is almost in iso8601 format already. I just had to stick the date and time together with a "T" in the middle.

Fastest way to parse a date-time string to class Date

Note that as.Date will ignore junk after the date so this takes less than 10 seconds on my not particularly fast laptop:

xx <- rep("10/17/2017 12:00:00 AM", 5000000) # test input
system.time(as.Date(xx, "%m/%d/%Y"))
## user system elapsed
## 9.57 0.20 9.82

Java or Scala fast way to parse dates with many different formats using java.time

No, there is no nice/magic way to do this, for two main reasons:

  1. There are variations and ambiguities in data formats that make a generic parser very difficult. e.g. 11/11/11

  2. You are looking for very high performance, which rules out any brute-force methods. 1us per date means only a few thousand instructions to do the full parsing.

At some level you are going to have to specify what formats are valid and how to interpret them. The best way to do this is probably one or more regular expressions that extract the appropriate fields from all the allowable combinations of characters that might form a date, and then much simpler validation of the individual fields.

Here is an example that deals with all dates you listed:

val DateMatch = """(\d\d\d\d)[-/ ]?((?:\d\d)|(?:\w\w\w))?[-/ ]?(\d\d)?T?(\d\d)?:?(\d\d)?:?(\d\d)?[\.]*(\d+)?(.*)?""".r

date match {
case DateMatch(year, month, day, hour, min, sec, usec, timezone) =>
(year, Option(month).getOrElse("1"), Option(day).getOrElse(1), Option(hour).getOrElse(0), Option(min).getOrElse(0), Option(sec).getOrElse(0), Option(usec).getOrElse(0), Option(timezone).getOrElse(""))
case _ =>
throw InvalidDateException
}

As you can see it is going to get very hairy once all the possible dates are included. But if the regex engine can handle it then it should be efficient because the regex should compile to a state machine that looks at each character once.

Is DateTime.ParseExact() faster than DateTime.Parse()

You are asking for the difference in speed for two methods that are functionally different (close but still different).

Just pick the one that is most appropriate.

And no, performance of library methods is almost never documented.

But I can tell you something about that difference:

  • it is small,
  • it might change in future versions

Inferring date format versus passing a parser

You've identified the two important functions: read_csv prepares a function to parse the date columns using _make_date_converter, and this is always going make a call to_datetime (pandas' primary string-to-date conversion tool).

The answers by @WillAyd and @bmbigbang both seem correct to me in that they identify the cause of the slowness as repeated calls of the lambda function.

Since you ask for more details about pandas' source code, I'll try and examine each read_test call in more detail below to find out where we end up in to_datetime and ultimately why the timings are as you observed for your data.


read_test()

This is very fast because, without any prompts about a possible date format, pandas is going to try and parse the list-like column of strings as though they're approximately in ISO8601 format (which is a very common case).

Plunging into to_datetime, we quickly reach this code branch:

if result is None and (format is None or infer_datetime_format):
result = tslib.array_to_datetime(...)

From here on, it's compiled Cython code all the way.

array_to_datetime iterates through the column of strings to convert each one to datetime format. For each row, we hit _string_to_dts at this line; then we go to another short snippet of inlined code (_cstring_to_dts) which means parse_iso_8601_datetime is called to do the actual parsing of the string to a datetime object.

This function is more than capable of parsing dates in the YYYY-MM-DD format and so there is just some housekeeping to finish the job (the C struct filled by parse_iso_8601_datetime becomes a proper datetime object, some bounds are checked).

As you can see, dateutil.parser.parser is not called at all.


read_test(infer_datetime_format=True)

Let's see why this is almost as fast as read_test().

Asking pandas to infer the datetime format (and passing no format argument) means we land here in to_datetime:

if infer_datetime_format and format is None:
format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)

This calls _guess_datetime_format_for_array, which takes the first non-null value in the column and gives it to _guess_datetime_format. This tries to build a datetime format string to use for future parsing. (My answer here has more detail above the formats it is able recognise.)

Fortunately, the YYYY-MM-DD format is one that can be recognised by this function. Even more fortunately, this particular format has a fast-path through the pandas code!

You can see pandas sets infer_datetime_format back to False here:

if format is not None:
# There is a special fast-path for iso8601 formatted
# datetime strings, so in those cases don't use the inferred
# format because this path makes process slower in this
# special case
format_is_iso8601 = _format_is_iso(format)
if format_is_iso8601:
require_iso8601 = not infer_datetime_format
format = None

This allows the code to take the same path as above to the parse_iso_8601_datetime function.


read_test(date_parser=lambda dt: strptime(dt, '%Y-%m-%d'))

We've provided a function to parse the date with, so pandas executes this code block.

However, this raises as exception internally:

strptime() argument 1 must be str, not numpy.ndarray

This exception is immediately caught, and pandas falls back to using try_parse_dates before calling to_datetime.

try_parse_dates means that instead of being called on an array, the lambda function is called repeatedly for each value of the array in this loop:

for i from 0 <= i < n:
if values[i] == '':
result[i] = np.nan
else:
result[i] = parse_date(values[i]) # parse_date is the lambda function

Despite being compiled code, we pay the penalty of having function calls to Python code. This makes it very slow in comparison to the other approaches above.

Back in to_datetime, we now have an object array filled with datetime objects. Again we hit array_to_datetime, but this time pandas sees a date object and uses another function (pydate_to_dt64) to make it into a datetime64 object.

The cause of the slowdown is really due to the repeated calls to the lambda function.


About your update and the MM/DD/YYYY format

The Series s has date strings in the MM/DD/YYYY format.

This is not an ISO8601 format. pd.to_datetime(s, infer_datetime_format=False) tries to parse the string using parse_iso_8601_datetime but this fails with a ValueError. The error is handled here: pandas is going to use parse_datetime_string instead. This means that dateutil.parser.parse is used to convert the string to datetime. This is why it is slow in this case: repeated use of a Python function in a loop.

There's not much difference between pd.to_datetime(s, format='%m/%d/%Y') and pd.to_datetime(s, infer_datetime_format=True) in terms of speed. The latter uses _guess_datetime_format_for_array again to infer the MM/DD/YYYY format. Both then hit array_strptime here:

if format is not None:
...
if result is None:
try:
result = array_strptime(arg, format, exact=exact, errors=errors)

array_strptime is a fast Cython function for parsing an array of strings to datetime structs given a specific format.

Parsing large amount of dates with pandas - scalability - performance drops faster than linear

UPDATE:

look at this comparison:

In [507]: fn
Out[507]: 'D:\\download\\slow.csv.tar.gz'

In [508]: fn2
Out[508]: 'D:\\download\\slow_filtered.csv.gz'

In [509]: %timeit df = pd.read_csv(fn, parse_dates=['from'], index_col=0)
1 loop, best of 3: 15.7 s per loop

In [510]: %timeit df2 = pd.read_csv(fn2, parse_dates=['from'], index_col=0)
1 loop, best of 3: 399 ms per loop

In [511]: len(df)
Out[511]: 99831

In [512]: len(df2)
Out[512]: 99831

In [513]: df.dtypes
Out[513]:
from object
dtype: object

In [514]: df2.dtypes
Out[514]:
from datetime64[ns]
dtype: object

The only difference between those two DFs is in the row# 36867, which i've manually corrected in the D:\\download\\slow_filtered.csv.gz file:

In [518]: df.iloc[36867]
Out[518]:
from 20124-10-20 10:12:00
Name: 36867, dtype: object

In [519]: df2.iloc[36867]
Out[519]:
from 2014-10-20 10:12:00
Name: 36867, dtype: datetime64[ns]

Conclusion: it took Pandas 39 times longer because of one row with a "bad" date and at the end Pandas left from column in the df DF as a string

OLD answer:

it works pretty fair for me (pandas 0.18.0):

setup:

start_ts = '2000-01-01 00:00:00'

pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**4)}).to_csv('d:/temp/10k.csv', index=False)

pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**5)}).to_csv('d:/temp/100k.csv', index=False)

pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**6)}).to_csv('d:/temp/1m.csv', index=False)

pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**7)}).to_csv('d:/temp/10m.csv', index=False)

dt_parser = lambda x: pd.to_datetime(x, format="%Y-%m-%d %H:%M:%S")

checks:

In [360]: fn = 'd:/temp/10m.csv'

In [361]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 22.6 s per loop

In [362]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 29.9 s per loop

In [363]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 29.9 s per loop

In [364]: fn = 'd:/temp/1m.csv'

In [365]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 2.32 s per loop

In [366]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 3.06 s per loop

In [367]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 3.06 s per loop

In [368]: %timeit pd.read_csv(fn)
1 loop, best of 3: 1.53 s per loop

conclusion: it's a bit faster when i'm using date_parser where i'm specifying the date format, so read_csv don't have to guess it. The difference is approx. 30%



Related Topics



Leave a reply



Submit