Is there a fast parser for date
Given
## the following two (here three) lines are all of fasttime's R/time.R
fastPOSIXct <- function(x, tz=NULL, required.components = 3L)
.POSIXct(if (is.character(x)) .Call("parse_ts", x, required.components)
else .Call("parse_ts", as.character(x), required.components), tz)
hence
## so we suggest to just use it, and convert later
fastDate <- function(x, tz=NULL)
as.Date(fastPOSIXct(x, tz=tz))
which at least beats as.Date()
:
R> library(microbenchmark)
R> library(fasttime)
R> d <- rep("2010-11-12", n=1e4)
R> microbenchmark(fastDate(d), as.Date(d), times=100)
Unit: microseconds
expr min lq mean median uq max neval cld
fastDate(d) 47.469 48.8605 54.3232 55.7270 57.1675 104.447 100 a
as.Date(d) 77.194 79.4120 85.3020 85.2585 87.3135 121.979 100 b
R>
If you wanted to go super fast, you could start with tparse.c to create the date-only subset you want.
Is there a way to improve speed of parsing date for large file?
@TemporalWolf had the excellent suggestion of using ciso8601. I've never heard of it so I figured I'd give it a try.
First, I benchmarked my laptop with your sample line. I made a CSV file with 10 million rows of that exact line and it took about 6 seconds to read everything. Using your date parsing code brought that up to 48 seconds which made sense because you also reported it taking 8x longer. Then I scaled the file down to 1 million rows and I could read it in 0.6 seconds and parse dates in 4.8 seconds so everything looked right.
Then I switched over to ciso8601
and, almost like magic, the time for 1 million rows went from 4.8 seconds to about 1.9 seconds:
import datetime
import ciso8601
t0 = datetime.datetime.now()
i = 0
with open('input.csv') as file:
for line in file:
strings = line.split(",")
d = ciso8601.parse_datetime('%sT%s' % (strings[0], strings[1]))
i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)
Note that your data is almost in iso8601 format already. I just had to stick the date and time together with a "T" in the middle.
Fastest way to parse a date-time string to class Date
Note that as.Date
will ignore junk after the date so this takes less than 10 seconds on my not particularly fast laptop:
xx <- rep("10/17/2017 12:00:00 AM", 5000000) # test input
system.time(as.Date(xx, "%m/%d/%Y"))
## user system elapsed
## 9.57 0.20 9.82
Java or Scala fast way to parse dates with many different formats using java.time
No, there is no nice/magic way to do this, for two main reasons:
There are variations and ambiguities in data formats that make a generic parser very difficult. e.g.
11/11/11
You are looking for very high performance, which rules out any brute-force methods. 1us per date means only a few thousand instructions to do the full parsing.
At some level you are going to have to specify what formats are valid and how to interpret them. The best way to do this is probably one or more regular expressions that extract the appropriate fields from all the allowable combinations of characters that might form a date, and then much simpler validation of the individual fields.
Here is an example that deals with all dates you listed:
val DateMatch = """(\d\d\d\d)[-/ ]?((?:\d\d)|(?:\w\w\w))?[-/ ]?(\d\d)?T?(\d\d)?:?(\d\d)?:?(\d\d)?[\.]*(\d+)?(.*)?""".r
date match {
case DateMatch(year, month, day, hour, min, sec, usec, timezone) =>
(year, Option(month).getOrElse("1"), Option(day).getOrElse(1), Option(hour).getOrElse(0), Option(min).getOrElse(0), Option(sec).getOrElse(0), Option(usec).getOrElse(0), Option(timezone).getOrElse(""))
case _ =>
throw InvalidDateException
}
As you can see it is going to get very hairy once all the possible dates are included. But if the regex engine can handle it then it should be efficient because the regex should compile to a state machine that looks at each character once.
Is DateTime.ParseExact() faster than DateTime.Parse()
You are asking for the difference in speed for two methods that are functionally different (close but still different).
Just pick the one that is most appropriate.
And no, performance of library methods is almost never documented.
But I can tell you something about that difference:
- it is small,
- it might change in future versions
Inferring date format versus passing a parser
You've identified the two important functions: read_csv
prepares a function to parse the date columns using _make_date_converter
, and this is always going make a call to_datetime
(pandas' primary string-to-date conversion tool).
The answers by @WillAyd and @bmbigbang both seem correct to me in that they identify the cause of the slowness as repeated calls of the lambda function.
Since you ask for more details about pandas' source code, I'll try and examine each read_test
call in more detail below to find out where we end up in to_datetime
and ultimately why the timings are as you observed for your data.
read_test()
This is very fast because, without any prompts about a possible date format, pandas is going to try and parse the list-like column of strings as though they're approximately in ISO8601 format (which is a very common case).
Plunging into to_datetime
, we quickly reach this code branch:
if result is None and (format is None or infer_datetime_format):
result = tslib.array_to_datetime(...)
From here on, it's compiled Cython code all the way.
array_to_datetime
iterates through the column of strings to convert each one to datetime format. For each row, we hit _string_to_dts
at this line; then we go to another short snippet of inlined code (_cstring_to_dts
) which means parse_iso_8601_datetime
is called to do the actual parsing of the string to a datetime object.
This function is more than capable of parsing dates in the YYYY-MM-DD format and so there is just some housekeeping to finish the job (the C struct filled by parse_iso_8601_datetime
becomes a proper datetime object, some bounds are checked).
As you can see, dateutil.parser.parser
is not called at all.
read_test(infer_datetime_format=True)
Let's see why this is almost as fast as read_test()
.
Asking pandas to infer the datetime format (and passing no format
argument) means we land here in to_datetime
:
if infer_datetime_format and format is None:
format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)
This calls _guess_datetime_format_for_array
, which takes the first non-null value in the column and gives it to _guess_datetime_format
. This tries to build a datetime format string to use for future parsing. (My answer here has more detail above the formats it is able recognise.)
Fortunately, the YYYY-MM-DD format is one that can be recognised by this function. Even more fortunately, this particular format has a fast-path through the pandas code!
You can see pandas sets infer_datetime_format
back to False
here:
if format is not None:
# There is a special fast-path for iso8601 formatted
# datetime strings, so in those cases don't use the inferred
# format because this path makes process slower in this
# special case
format_is_iso8601 = _format_is_iso(format)
if format_is_iso8601:
require_iso8601 = not infer_datetime_format
format = None
This allows the code to take the same path as above to the parse_iso_8601_datetime
function.
read_test(date_parser=lambda dt: strptime(dt, '%Y-%m-%d'))
We've provided a function to parse the date with, so pandas executes this code block.
However, this raises as exception internally:
strptime() argument 1 must be str, not numpy.ndarray
This exception is immediately caught, and pandas falls back to using try_parse_dates
before calling to_datetime
.
try_parse_dates
means that instead of being called on an array, the lambda function is called repeatedly for each value of the array in this loop:
for i from 0 <= i < n:
if values[i] == '':
result[i] = np.nan
else:
result[i] = parse_date(values[i]) # parse_date is the lambda function
Despite being compiled code, we pay the penalty of having function calls to Python code. This makes it very slow in comparison to the other approaches above.
Back in to_datetime
, we now have an object array filled with datetime
objects. Again we hit array_to_datetime
, but this time pandas sees a date object and uses another function (pydate_to_dt64
) to make it into a datetime64 object.
The cause of the slowdown is really due to the repeated calls to the lambda function.
About your update and the MM/DD/YYYY format
The Series s
has date strings in the MM/DD/YYYY format.
This is not an ISO8601 format. pd.to_datetime(s, infer_datetime_format=False)
tries to parse the string using parse_iso_8601_datetime
but this fails with a ValueError
. The error is handled here: pandas is going to use parse_datetime_string
instead. This means that dateutil.parser.parse
is used to convert the string to datetime. This is why it is slow in this case: repeated use of a Python function in a loop.
There's not much difference between pd.to_datetime(s, format='%m/%d/%Y')
and pd.to_datetime(s, infer_datetime_format=True)
in terms of speed. The latter uses _guess_datetime_format_for_array
again to infer the MM/DD/YYYY format. Both then hit array_strptime
here:
if format is not None:
...
if result is None:
try:
result = array_strptime(arg, format, exact=exact, errors=errors)
array_strptime
is a fast Cython function for parsing an array of strings to datetime structs given a specific format.
Parsing large amount of dates with pandas - scalability - performance drops faster than linear
UPDATE:
look at this comparison:
In [507]: fn
Out[507]: 'D:\\download\\slow.csv.tar.gz'
In [508]: fn2
Out[508]: 'D:\\download\\slow_filtered.csv.gz'
In [509]: %timeit df = pd.read_csv(fn, parse_dates=['from'], index_col=0)
1 loop, best of 3: 15.7 s per loop
In [510]: %timeit df2 = pd.read_csv(fn2, parse_dates=['from'], index_col=0)
1 loop, best of 3: 399 ms per loop
In [511]: len(df)
Out[511]: 99831
In [512]: len(df2)
Out[512]: 99831
In [513]: df.dtypes
Out[513]:
from object
dtype: object
In [514]: df2.dtypes
Out[514]:
from datetime64[ns]
dtype: object
The only difference between those two DFs is in the row# 36867, which i've manually corrected in the D:\\download\\slow_filtered.csv.gz
file:
In [518]: df.iloc[36867]
Out[518]:
from 20124-10-20 10:12:00
Name: 36867, dtype: object
In [519]: df2.iloc[36867]
Out[519]:
from 2014-10-20 10:12:00
Name: 36867, dtype: datetime64[ns]
Conclusion: it took Pandas 39 times longer because of one row with a "bad" date and at the end Pandas left from
column in the df
DF as a string
OLD answer:
it works pretty fair for me (pandas 0.18.0):
setup:
start_ts = '2000-01-01 00:00:00'
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**4)}).to_csv('d:/temp/10k.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**5)}).to_csv('d:/temp/100k.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**6)}).to_csv('d:/temp/1m.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**7)}).to_csv('d:/temp/10m.csv', index=False)
dt_parser = lambda x: pd.to_datetime(x, format="%Y-%m-%d %H:%M:%S")
checks:
In [360]: fn = 'd:/temp/10m.csv'
In [361]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 22.6 s per loop
In [362]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 29.9 s per loop
In [363]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 29.9 s per loop
In [364]: fn = 'd:/temp/1m.csv'
In [365]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 2.32 s per loop
In [366]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 3.06 s per loop
In [367]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 3.06 s per loop
In [368]: %timeit pd.read_csv(fn)
1 loop, best of 3: 1.53 s per loop
conclusion: it's a bit faster when i'm using date_parser
where i'm specifying the date format, so read_csv
don't have to guess it. The difference is approx. 30%
Related Topics
How to Count Sequences of Ones in a Logical Vector
Print a Data Frame with Columns Aligned (As Displayed in R)
How to Create a Pie Chart with Percentage Labels Using Ggplot2
Edit Individual Ggplots in Ggally::Ggpairs: How to Have the Density Plot Not Filled in Ggpairs
Rmarkdown Removes Citation Hyperlink
Differencebetween Scale Transformation and Coordinate System Transformation
Condition Filter in Dplyr Based on Shiny Input
How to Add "Author" Metadata to a PDF Created from R
R Data.Table Join: SQL "Select *" Alike Syntax in Joined Tables
R - Download Filtered Datatable
Fuzzyjoin Two Data Frames Using Data.Table
How to Merge Multiple Data.Frames and Sum and Average Columns at the Same Time in R
Specify Function Parameters in Do.Call
Ggplot Piecharts on a Ggmap: Labels Destroy the Small Plots
How to Draw a Contour Plot When Data Are Not on a Regular Grid
Note or Warning from Package Check When Readme.Md Includes Images