Extract Date from Given String in R

Extract date from given string in r

The first one shows how to fix the code in the question to give the desired answer. The next 2 solutions are the same except they use different regular expressions. The fourth solution shows how to do it with gsub. The fifth breaks the gsub into two sub calls and the sixth uses read.table.

1) Escape parens The problem is that ( and ) have special meaning in regular expressions so you must escape them if you want to match them literally. By using "[(]" as we do below (or writing them as "\\(" ) they are matched literally. The inner parentheses define the capture group as we don't want that group to include the literal parentheses themselves:

strapplyc(string, "[(](.*)[)]", simplify = TRUE)
## [1] "7/4/2011"

2) Match content Another way to do it is to match the data itself rather than the surrounding parentheses. Here "\\d+" matches one or more digits:

strapplyc(string, "\\d+/\\d+/\\d+", simplify = TRUE)
## [1] "7/4/2011"

You could specify the number of digits if you want to be even more specific but it seems unnecessary here if the data looks similar to that in the question.

3) Match 8 or more digits and slashes Given that there are no other sequences of 8 or more characters consisting only of slashes and digits in the rest of the string we could just pick out that:

strapplyc(string, "[0-9/]{8,}", simplify = TRUE)
## [1] "7/4/2011"

4) Remove text before and after Another way of doing it is to remove everything up to the ( and after the ) like this:

gsub(".*[(]|[)].*", "", string)
## [1] "7/4/2011"

5) sub This is the same as (4) except it breaks the gsub into two sub invocations, one removing everything up to ( and the other removing ) onwards. The regular expressions are therefore slightly simpler.

sub(".*\\(", "", sub("\\).*", "", string))

6) read.table This solution uses no regular expressions at all. It defines sep and comment.char in read.table so that the second column of the result of read.table is the required date or dates.

read.table(text = string, sep = "(", comment.char = ")", as.is = TRUE)$V2
## [1] "7/4/2011"

Note: Note that you don't need the c in defining string

string <- c("Posted 69 months ago (7/4/2011)")
string2 <- "Posted 69 months ago (7/4/2011)"
identical(string, string2)
## [1] TRUE

Extract date from a given string in R

You have replaced your dash - with a slash / in your regular expression.

as.Date(str_extract(string, "[0-9]{4}-[0-9]{2}-[0-9]{2}"), format="%Y-%m-%d")
# [1] "2013-08-21"

But you can also replace the [0-9] bits with \d, which represent the same thing. I'm not sure why, but regex pros seem to always use the \d version (note that you'll have to escape the backslash with another backslash):

as.Date(str_extract(string, "\\d{4}-\\d{2}-\\d{2}"), format="%Y-%m-%d")
# [1] "2013-08-21"

Extract DateTime from a string in R

You are right that you should extract the character form of the datetime first. Here is a method that works with that format. It's just using a regular expression and matching 4 digits, then groups of two digits separated by -, T and : where appropriate. You can then use lubridate::ymd_hms as an alternative to as.Date, since it's a good Swiss army knife at different date formats.

library(stringr)
library(lubridate)
string <- "<13>1 2018-04-18T10:29:00.581243+10:00 KOI-QWE-HUJ vmon 2318 - - Some Description..."
string %>%
str_extract("\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}") %>%
ymd_hms()
#> [1] "2018-04-18 10:29:00 UTC"

Created on 2018-05-02 by the reprex package (v0.2.0).

Extracting dates in R, from a string variable with different date formats exhibiting lack of general structure / difficult pattern

Based off of @mnist's comment and a recognized pattern in my subsequent comment, I split the data (let myData denote my data frame and String denote the column of all 1300
string observations) with grepl

myData <- myData %>% filter(grepl("Eff|eff|Ef",String))

Then I again split myData into 2 subsets, with Case 1 (nice case) corresponding to filter(grepl("\\d+/\\d+/\\d+", String)) and Case 2 corresponding to filter(!grepl("\\d+/\\d+/\\d+", String)) respectively. As it turns out, Case 2 (annoying case) only amounts to 3% of the observations (<50 obs) which I suppose I will deal with manually since it is not much.

Turns out Case 1 only had one observation like string8 so I corrected that manually.

Extract date from string in R

1) Assuming that the reason you want to extract that string is so that you can convert it to Date class, remove everything up to and including the underscore and then convert to Date class. This uses the fact that as.Date ignores junk characters at the end. This uses only a simple regular expression and uses no packages.

as.Date(sub(".*_", "", string))
## [1] "2019-12-09"

2) strapplyc To use strapplyc as was attempted in the question to get a string result use this code which is likely sufficient:

library(gsubfn)

strapplyc(string, "....-..-..", simplify = TRUE)
## [1] "2019-12-09"

or you can be even more specific with this pattern:

strapplyc(string, "\\d{4}-\\d{2}-\\d{2}", simplify = TRUE)
## [1] "2019-12-09"

3) trimws Using R 3.6 or later we can use trimws to trim away all non-digits from the beginning and end. This will work as long as there are no digits before or after the date (which is satisfied in the example in the question). This does not use any packages.

trimws(string, whitespace = "\\D")
## [1] "2019-12-09"

4) file_path_sans_ext Use the indicated function to remove the extension and then remove everything up to the underscore. Note that the tools package is included with R so there is nothing to install. The regular expression is the same simple one used in (1).

library(tools)
sub(".*_", "", file_path_sans_ext(string))
## [1] "2019-12-09"

5) Remove everything before and after the date. No packages are used.

gsub(".*_|.csv$", "", string)
## [1] "2019-12-09"

Extract date after string in R

One method can be like this one also. (Assuming that you need either of S/DATE or START as your expected new column name is Start_date). If however all such values aren't required you may easily modify this syntax.

Explanation -

  • In the innermost expr Notes column has been splitted into list by either of these separators : or \n.
  • In this list, blanks are removed then
  • In the modified list item next to Start or S/Date is extracted using sapply which simplifies the list into a vector (if possible)
  • lastly lubridate::dmy is used in outermost expr.
sapply(strsplit(dates$Notes, 
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])

[1] "09/01/2019" "28/08/19" "04/12/2018"

If you'll wrap the above in lubridate::dmy dates will be correctly formatted too

dmy(sapply(strsplit(dates$Notes, 
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))]))

[1] "2019-01-09" "2019-08-28" "2018-12-04"

Further, this can be passed into dplyr pipes, so as to simultaneously create a new column in your dates

dates %>% mutate(Start_Date = dmy(sapply(strsplit(Notes, 
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])))

col1 Notes End_date Start_Date
1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04

Extract dates in a complex string

Solution using base R functions. Works as long as the format is always "yyyymmdd" and the relevant string appears before the first underscore:

file.name<- c("AZAMBUJAI002A20190518T133231_20190518T133919_T22JCM_2021_05_19_01_18_22.tif",
"RINCAODOSSOARES051B20210107T133231_20190518T133919_T22JSM_2021_05_19_01_18_22",
"VILAPALMA33K20181018T133231_20190518T133919_T23JCM_2020_05_19_01_18_22.tif")

Using gsub twice: First (in the inner function) to get rid of everything after the first underscore, and then to extract the sequence of eight numbers ([0-9]{8}:

dates <- gsub(".*([0-9]{8}).*", "\\1", gsub("^([^_]*)_.*", "\\1", file.name))

Finally using as.Date to convert the strings to a R date object (can be re-cast to a string using format):

dates_as_actual_date <- as.Date(dates, format("%Y%m%d"))

dates_as_actual_date is a R date object and looks like this:

[1] "2019-05-18" "2021-01-07" "2018-10-18"



Related Topics



Leave a reply



Submit