How to Fix Corrupted Dates in R

How can I fix corrupted dates in R?

You need none of the packages you've loaded, nor do you need to use Reduce, as functions we're using here are naturally "vectorized".

Here's a sample of your data. (A good question includes data in an easily copied format such as this.)

x <- c("19/9/1997", "22/9/1997", "23/9/1997", "24/9/1997", "25/9/1997",
"26/9/1997", "29/9/1997", "30/9/1997",
"35440", "35471", "35499", "35591", "35621",
"35652", "35683", "35713")

dates <- as.Date(x, format="%d/%m/%Y")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA

Not surprisingly, the second-half of the dates are not recognized given format="%d/%m/%Y". You mentioned the use of "%m/%d/%Y" in your question, so we can (1) do a literal second-pass for this format (un-utilized with this example, but still relevant for your work?):

dates[is.na(dates)] <- as.Date(x[is.na(dates)], format="%m/%d/%Y")

where [is.na(dates)] only works on the un-converted elements.

(2) If we have more than one other format, you can always use a vector of them and loop over them. (For this, I'll start over, since this loop would replace/augment the first steps above.)

formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
nas <- is.na(dates)
dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA

This still leaves us with NAs for the integer-looking ones. For these you need to specify the origin= to be able to figure it out (as well as converting to an integer). R typically works with an origin of "1970-01-01", which you can confirm with

as.integer(Sys.Date())
# [1] 17787
Sys.Date() - 17787
# [1] "1970-01-01"

but it appears that your dates have an origin of "1900-01-01", I think that's Excel's default storage of dates (but it doesn't matter here):

x[9]    # the first integer-looking element
# [1] "35440"
dates[1] - as.integer(x[9])
# [1] "1900-09-08"

(I'm assuming that your dates are from the same relative period of time.)

From here:

nas <- is.na(dates)
dates[nas] <- as.Date(as.integer(x[nas]), origin="1900-01-01")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" "1997-01-12" "1997-02-12"
# [11] "1997-03-12" "1997-06-12" "1997-07-12" "1997-08-12" "1997-09-12"
# [16] "1997-10-12"

(Working on the indices of only NA elements is relatively efficient in that it only works on and replaces the not-yet-matched entries. If there is nothing left when it gets to another call to as.Date, it does still call it but with an argument of length 0, with which the function works rather efficiently. I don't think adding a conditional of if (any(nas)) ... would help, but if there are further methods you need that might be more "expensive", you can consider it.)

R: Why is class Date lost upon subsetting

This is not a final solution, but I think that can help to understand.

Here your data :

Data <- data.frame(date = 
as.Date(c('2000/01/01', '2012/01/02', '2013/01/03')))

Take this 2 vectors , one typed by default as numeric and the second as Date.

vv <- vector("numeric",3)
vv.Date <- vector("numeric",3)
class(vv.Date) <- 'Date'
vv
[1] 0 0 0
> vv.Date
[1] "1970-01-01" "1970-01-01" "1970-01-01" ## type dates is initialized by the origin 01-01-1970

Now if I try to assign the first element of each vector as you do in the first step of your loop:

vv[1] <- Data$date[1]
vv.Date[1] <- Data$date[1]
vv
[1] 10957 0 0
> vv.Date
[1] "2000-01-01" "1970-01-01" "1970-01-01"

As you see the typed vector is well created. What happen, when you assign a vector by a scalar value , R try internally to convert it to the type of the vector. To return to your example, When you do this :

You a creating a numeric vector (vv), and you try to assign dates to it:

for(i in 1:3){
Data[i, "date3"] <- as.Date(Data[i, "date"])
}

If you type your date3 , for example:

Data$date3 <- vv.Date

then you try again

for(i in 1:3){
Data[i, "date3"] <- as.Date(Data[i, "date"])
}

You will get a good result:

       date      date3
1 2000-01-01 2000-01-01
2 2012-01-02 2012-01-02
3 2013-01-03 2013-01-03

converting dates in R is changing to future dates, not past

Welcome to modern computers, all shaped after the early Unix systems of the 1970s. The start of time, so to speak, is the epoch, aka 1 Jan 1970.

Your problem here, in a nutshell, is the inferior input data. You only supply two years and by a widely followed convention, values less than 70 are taken for the next century. It's all about the epoch.

So you have two choices. You could preprend '19' to the year part and parse via %Y, or you could just take the year value out of the date and reduce it by 100 if need be.

Some example code for the second (and IMHO better) option, makeing 1970 the cutoff date:

> datestr <- "1-Jan-52" 
> d <- as.Date(datestr, '%d-%b-%y')
>
> d
[1] "2052-01-01"
>
> if (as.integer(strftime(d, "%Y")) >= 1970) {
+ dp <- as.POSIXlt(d)
+ dp$year <- dp$year - 100
+ d <- as.Date(dp)
+}
> d
[1] "1952-01-01"
>

You need to go via POSIXlt to get the components easily.

Converting a date in R returns NA


 as.Date.character(gsub("/", "-",td3$date), '%m-%d-%Y')
[1] "2016-05-06" "2016-05-07" "2016-04-13" "2016-04-14"

A script to correct corrupted date values

This is because of the limitations of timestamps : see this post about out of bounds nanosecond timestamp.

Therefore, I suggest correcting the column as a string before turning it into a datetime column, as follows:

import pandas as pd
import re
df = pd.DataFrame({"A": ["2019-10-04", "0019-04-02", "0018-06-08", "2018-07-08"]})

# I look for every date starting with zero and another number and replace by 20
r = re.compile(r"^0[0-9]{1}")
df["A"] = df["A"].apply(lambda x: r.sub('20', x))
# then I transform to datetime
df["A"] = pd.to_datetime(df["A"], format='%Y-%m-%d')
df

Here is the result

          A
0 2019-10-04
1 2019-04-02
2 2018-06-08
3 2018-07-08

You need to make sure that you can only have dates in 20XX (where X is any number) and not dates in 19XX or other before applying this.

Modify the date in a data frame in R


df$NewDate[grepl("today",df$Date)]<-Sys.Date() # Convert today to date
df$NewDate[grepl("yesterday",df$Date)]<-Sys.Date()-1 # covert yesterday to date
df$NewDate[is.na(df$NewDate)]<-df$Date[is.na(df$NewDate)] %>% as.Date(format="%d %b %Y") # Convert explicit dates to date format
class(df$NewDate)<-"Date" # Convert column to Date class

df
Variable1 Variable2 Date NewDate
1 10 a today 10:45 2018-10-31
2 20 b yesterday 3:10 2018-10-30
3 30 c 28 october 2018 5:32 2018-10-28
4 40 d 28 october 2018 8:32 2018-10-28
5 50 d 27 october 2018 5:32 2018-10-27


Related Topics



Leave a reply



Submit