Extract date from given string in r
The first one shows how to fix the code in the question to give the desired answer. The next 2 solutions are the same except they use different regular expressions. The fourth solution shows how to do it with gsub
. The fifth breaks the gsub
into two sub
calls and the sixth uses read.table
.
1) Escape parens The problem is that ( and ) have special meaning in regular expressions so you must escape them if you want to match them literally. By using "[(]"
as we do below (or writing them as "\\("
) they are matched literally. The inner parentheses define the capture group as we don't want that group to include the literal parentheses themselves:
strapplyc(string, "[(](.*)[)]", simplify = TRUE)
## [1] "7/4/2011"
2) Match content Another way to do it is to match the data itself rather than the surrounding parentheses. Here "\\d+"
matches one or more digits:
strapplyc(string, "\\d+/\\d+/\\d+", simplify = TRUE)
## [1] "7/4/2011"
You could specify the number of digits if you want to be even more specific but it seems unnecessary here if the data looks similar to that in the question.
3) Match 8 or more digits and slashes Given that there are no other sequences of 8 or more characters consisting only of slashes and digits in the rest of the string we could just pick out that:
strapplyc(string, "[0-9/]{8,}", simplify = TRUE)
## [1] "7/4/2011"
4) Remove text before and after Another way of doing it is to remove everything up to the ( and after the ) like this:
gsub(".*[(]|[)].*", "", string)
## [1] "7/4/2011"
5) sub This is the same as (4) except it breaks the gsub
into two sub
invocations, one removing everything up to ( and the other removing ) onwards. The regular expressions are therefore slightly simpler.
sub(".*\\(", "", sub("\\).*", "", string))
6) read.table This solution uses no regular expressions at all. It defines sep
and comment.char
in read.table
so that the second column of the result of read.table
is the required date or dates.
read.table(text = string, sep = "(", comment.char = ")", as.is = TRUE)$V2
## [1] "7/4/2011"
Note: Note that you don't need the c
in defining string
string <- c("Posted 69 months ago (7/4/2011)")
string2 <- "Posted 69 months ago (7/4/2011)"
identical(string, string2)
## [1] TRUE
Extract date from a given string in R
You have replaced your dash -
with a slash /
in your regular expression.
as.Date(str_extract(string, "[0-9]{4}-[0-9]{2}-[0-9]{2}"), format="%Y-%m-%d")
# [1] "2013-08-21"
But you can also replace the [0-9]
bits with \d
, which represent the same thing. I'm not sure why, but regex pros seem to always use the \d
version (note that you'll have to escape the backslash with another backslash):
as.Date(str_extract(string, "\\d{4}-\\d{2}-\\d{2}"), format="%Y-%m-%d")
# [1] "2013-08-21"
Extract DateTime from a string in R
You are right that you should extract the character form of the datetime first. Here is a method that works with that format. It's just using a regular expression and matching 4 digits, then groups of two digits separated by -
, T
and :
where appropriate. You can then use lubridate::ymd_hms
as an alternative to as.Date
, since it's a good Swiss army knife at different date formats.
library(stringr)
library(lubridate)
string <- "<13>1 2018-04-18T10:29:00.581243+10:00 KOI-QWE-HUJ vmon 2318 - - Some Description..."
string %>%
str_extract("\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}") %>%
ymd_hms()
#> [1] "2018-04-18 10:29:00 UTC"
Created on 2018-05-02 by the reprex package (v0.2.0).
Extracting dates in R, from a string variable with different date formats exhibiting lack of general structure / difficult pattern
Based off of @mnist's comment and a recognized pattern in my subsequent comment, I split the data (let myData
denote my data frame and String
denote the column of all 1300
string observations) with grepl
myData <- myData %>% filter(grepl("Eff|eff|Ef",String))
Then I again split myData
into 2 subsets, with Case 1 (nice case) corresponding to filter(grepl("\\d+/\\d+/\\d+", String))
and Case 2 corresponding to filter(!grepl("\\d+/\\d+/\\d+", String))
respectively. As it turns out, Case 2 (annoying case) only amounts to 3% of the observations (<50 obs) which I suppose I will deal with manually since it is not much.
Turns out Case 1 only had one observation like string8
so I corrected that manually.
Extract date from string in R
1) Assuming that the reason you want to extract that string is so that you can convert it to Date
class, remove everything up to and including the underscore and then convert to Date
class. This uses the fact that as.Date
ignores junk characters at the end. This uses only a simple regular expression and uses no packages.
as.Date(sub(".*_", "", string))
## [1] "2019-12-09"
2) strapplyc To use strapplyc
as was attempted in the question to get a string result use this code which is likely sufficient:
library(gsubfn)
strapplyc(string, "....-..-..", simplify = TRUE)
## [1] "2019-12-09"
or you can be even more specific with this pattern:
strapplyc(string, "\\d{4}-\\d{2}-\\d{2}", simplify = TRUE)
## [1] "2019-12-09"
3) trimws Using R 3.6 or later we can use trimws
to trim away all non-digits from the beginning and end. This will work as long as there are no digits before or after the date (which is satisfied in the example in the question). This does not use any packages.
trimws(string, whitespace = "\\D")
## [1] "2019-12-09"
4) file_path_sans_ext Use the indicated function to remove the extension and then remove everything up to the underscore. Note that the tools package is included with R so there is nothing to install. The regular expression is the same simple one used in (1).
library(tools)
sub(".*_", "", file_path_sans_ext(string))
## [1] "2019-12-09"
5) Remove everything before and after the date. No packages are used.
gsub(".*_|.csv$", "", string)
## [1] "2019-12-09"
Extract date after string in R
One method can be like this one also. (Assuming that you need either of S/DATE
or START
as your expected new column name is Start_date). If however all such values aren't required you may easily modify this syntax.
Explanation -
- In the innermost expr
Notes
column has been splitted into list by either of these separators:
or\n
. - In this list, blanks are removed then
- In the modified list item next to
Start
orS/Date
is extracted usingsapply
which simplifies the list into a vector (if possible) - lastly
lubridate::dmy
is used in outermost expr.
sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])
[1] "09/01/2019" "28/08/19" "04/12/2018"
If you'll wrap the above in lubridate::dmy
dates will be correctly formatted too
dmy(sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))]))
[1] "2019-01-09" "2019-08-28" "2018-12-04"
Further, this can be passed into dplyr pipes, so as to simultaneously create a new column in your dates
dates %>% mutate(Start_Date = dmy(sapply(strsplit(Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])))
col1 Notes End_date Start_Date
1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
Extract dates in a complex string
Solution using base R functions. Works as long as the format is always "yyyymmdd" and the relevant string appears before the first underscore:
file.name<- c("AZAMBUJAI002A20190518T133231_20190518T133919_T22JCM_2021_05_19_01_18_22.tif",
"RINCAODOSSOARES051B20210107T133231_20190518T133919_T22JSM_2021_05_19_01_18_22",
"VILAPALMA33K20181018T133231_20190518T133919_T23JCM_2020_05_19_01_18_22.tif")
Using gsub
twice: First (in the inner function) to get rid of everything after the first underscore, and then to extract the sequence of eight numbers ([0-9]{8}
:
dates <- gsub(".*([0-9]{8}).*", "\\1", gsub("^([^_]*)_.*", "\\1", file.name))
Finally using as.Date
to convert the strings to a R date object (can be re-cast to a string using format
):
dates_as_actual_date <- as.Date(dates, format("%Y%m%d"))
dates_as_actual_date
is a R date object and looks like this:
[1] "2019-05-18" "2021-01-07" "2018-10-18"
Related Topics
Joining Factor Levels of Two Columns
Combining More Than 2 Columns by Removing Na's in R
Dygraph in R Multiple Plots at Once
Colors Lost in Legend When Using Scale_Shape_Manual
Force a Regular Plot Object into a Grob for Use in Grid.Arrange
Repeat Vector to Fill Down Column in Data Frame
Lm(): What Is Qraux Returned by Qr Decomposition in Linpack/Lapack
Factor with Comma and Percentage to Numeric
How to Merge Two Nodes into a Single Node Using Igraph
Print Tibble with Column Breaks as in V1.3.0
R Specify Function Environment
Knitr Compile Problems with Rstudio (Windows)
Group Rows in Data Frame Based on Time Difference Between Consecutive Rows
X^(1/3)' Behaves Differently for Negative Scalar 'X' and Vector 'X' with Negative Values