column name with brackets or other punctuations for dplyr group_by
I think you can make this work if you enclose the "illegal" column names in backticks. For example, let's say I start with this data frame (called df
):
BILLING.STATUS.(COMPLETED./.INCOMPLETE) ORDER.VALUE.(USD)
1 A 0.01544196
2 A 0.95522706
3 B 1.13479303
4 B 1.22848285
Then I can summarise it like this:
dat %>% group_by(`BILLING.STATUS.(COMPLETED./.INCOMPLETE)`) %>%
summarise(count=n(),
mean = mean(`ORDER.VALUE.(USD)`))
Giving:
BILLING.STATUS.(COMPLETED./.INCOMPLETE) count mean
1 A 2 0.4853345
2 B 2 1.1816379
Backticks also come in handy for referring to or creating variable names with whitespace. You can find a number of questions related to dplyr
and backticks on SO, and there's also some discussion of backticks in the help for Quotes
.
How to deal with nonstandard column names (white space, punctuation, starts with numbers)
You may select
the variable by using backticks `
.
select(df, `a a`)
# a a
# 1 1
# 2 2
# 3 3
However, if your main objective is to rename the column, you may use rename
in plyr
package, in which you can use both ""
and ``
.
rename(df, replace = c("a a" = "a"))
rename(df, replace = c(`a a` = "a"))
Or in base
R:
names(df)[names(df) == "a a"] <- "a"
For a more thorough description on the use of various quotes, see ?Quotes
. The 'Names and Identifiers' section is especially relevant here:
other [syntactically invalid] names can be used provided they are quoted. The preferred quote is the backtick".
See also ?make.names
about valid names.
See also this post about renaming in dplyr
renaming columns in R with `-` symbol
This can be done using rename
. You just have to put the column names with special charcters inside the "`" sign:
temp <- temp %>% dplyr::rename(`Re-ply` = re_ply,
total_id = total_ID,
`Re-ask` = re_ask)
names(temp)
[1] "Re-ply" "total_id" "Re-ask"
Remove parentheses and text within from strings in R
A gsub
should work here
gsub("\\s*\\([^\\)]+\\)","",as.character(companies$Name))
# or using "raw" strings as of R 4.0
gsub(r"{\s*\([^\)]+\)}","",as.character(companies$Name))
# [1] "Company A Inc" "Company B" "Company C Inc."
# [4] "Company D Inc." "Company E"
Here we just replace occurrences of "(...)" with nothing (also removing any leading space). R makes it look worse than it is with all the escaping we have to do for the parenthesis since they are special characters in regular expressions.
compute sum for space string column
Does this work?
df %>% group_by(`a 1`) %>% summarise(tx = sum(`t t`))
case_when() issue with evaluating multiple conditions
Here is a completely different, database-like approach which uses a lookup table of fruit and fruit types . This approach can handle an arbitrary number of fruits and fruit types.
# create or read lookup table
lut <- readr::read_table(
"fruit fruit_type
banana 34
apple 45
orange 88")
library(dplyr)
library(tidyr)
df %>%
mutate(fruit = fruits) %>%
separate_rows(fruit, sep = "\\s+") %>%
left_join(lut, by = "fruit") %>%
group_by(ID) %>%
mutate(rowid = row_number(ID)) %>%
pivot_wider(id_cols = c(ID, fruits), values_from = fruit_type,
names_prefix = "fruit_type", names_from = rowid)
ID fruits fruit_type1 fruit_type2 fruit_type3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 banana apple orange 34 45 88
2 2 apple orange 45 88 NA
3 3 orange 88 NA NA
4 4 orange apple 88 45 NA
5 5 nothing NA NA NA
The fruits
column is copied and then split. Now, column fruit
contains a single fruit on separate rows. These are joined with the lookup table lut
to get the matching fruit_type
value. Before this result can be reshaped to wide format, the new columns need to be numbered. This is achieved by numbering the rows within each ID
.
Edit:
According to OP's comment the production dataset contains paragraphs where the keywords aren't separated by white space but also by punctuation marks like commas or appear in their plural form with trailing s. In addition, the keywords may be written in upper case or may appear multiple times in a paragraph.
Instead of separating all words we can try to extract the keywords from the paragraphs. This can be achieved by combining all keywords into one regular expression with alternation |
. So, the regular expression banana|apple|orange
will match either of the fruits.
For testing we need a more complex use case:
df <- tibble(fruits = readr::read_lines(
"There are bananas, oranges, and also apples here
One Orange and another orange make two Oranges
apples and pineapples go together
But pineapples alone must not be counted
banana apple orange
apple orange
orange
orange apple
nothing")
) %>%
mutate(ID = row_number())
With the modified code
df %>%
mutate(fruit = fruits %>%
tolower() %>%
stringr::str_extract_all(paste(lut$fruit, collapse = "|")) %>%
lapply(unique)) %>%
unnest(fruit, keep_empty = TRUE) %>%
left_join(lut, by = "fruit") %>%
group_by(ID) %>%
mutate(rowid = row_number(ID)) %>%
pivot_wider(id_cols = c(ID, fruits), values_from = fruit_type,
names_prefix = "fruit_type", names_from = rowid)
we get
ID fruits fruit_type1 fruit_type2 fruit_type3
<int> <chr> <dbl> <dbl> <dbl>
1 1 "There are bananas, oranges, and also apples here" 34 88 45
2 2 "One Orange and another Orange make two Oranges " 88 NA NA
3 3 "apples and pineapples go together" 45 NA NA
4 4 "But pineapples alone must not be counted" 45 NA NA
5 5 "banana apple orange" 34 45 88
6 6 "apple orange" 45 88 NA
7 7 "orange" 88 NA NA
8 8 "orange apple" 88 45 NA
9 9 "nothing" NA NA NA
This approach has detected keywords in plural form and independent of upper/lower case.
Note that I have deliberately chosen to count multiple occurrences of a keyword in a paragraph only once by lapply(unique)
. If each occurrence is to be counted separately then just remove that line of code.
However, there is one (at least) drawback of this approach: The word pineapple
is counted as apple
because it contains apple
as substring.
Validate name with email in dataframe
Maybe you can try
within(
df,
consistent <- mapply(
function(x, y) 1 - any(mapply(grepl, x, y) | mapply(grepl, x, y)),
strsplit(name, ","),
strsplit(gsub("@.*", "", email), "\\.")
)
)
which gives
name email consistent
1 maay,bhtr maay.bhtr@email.com 0
2 nsgu,nhuts thang nsgu.nhuts@gmail.com 0
3 affat,nurfs asfa.1234@gmail.com 1
4 nukhyu,biyts nukhyu.biyts@gmail.com 0
5 ngyst,muun ngyst.muun@gmail.com 0
6 nsgyu,noon nsgyu.noon@gmail.com 0
7 utrs guus,book utrs.book@hotmail.com 0
8 thum,cryant thum.cryant@live.com 0
9 mumt,cant mumt.cant@gmail.com 0
10 bhan,btan bhan.btan@gmail.com 0
11 khtri,ntuk khtri.ntuk@gmail.c.om 0
12 ghaan,rstu chang.lee@gmail.com 1
13 shaan,btqaan shaan.btqaan@gmail.com 0
14 nhue,bjtraan nhue.bjtraan@gmail.com 0
15 wutys,cyun wutys.cyun@gmailcom 0
16 hrtsh,jaan hrtsh.jaan@gmail.com 0
R - Splitting strings in a column on a character and keeping specific results
We can capture as a group. Match one or more characters that are not a .
([^.]+
) from the beginning (^
) of string followed by a .
followed by another set of characters that are not a dot captured as a group (([^.]+)
) followed by other character and replace with the backreference (\\1
) of the captured group
library(dplyr)
df1 %>%
mutate(D= sub("^[^.]+\\.([^.]+)\\..*", "\\1", A))
# A B C D
#1 awer.ttp.net Code 554 ttp
#2 abcd.ttp.net Code 747 ttp
#3 asdf.ttp.net Part 554 ttp
#4 xyz.ttp.net Part 747 ttp
Or using extract
library(tidyr)
df1 %>%
extract(A, into = 'D', "^[^.]+\\.([^.]+).*", remove = FALSE)
Note that we don't need the dplyr
for this
df1$D <- sub("^[^.]+\\.([^.]+)\\..*", "\\1", df1$A)
Related Topics
Recursive Function Using Dplyr
How to Force Ggplot's Geom_Tile to Fill Every Facet
Rstudio Viewer Pane Not Working
Generating Split-Color Rectangles from Ggplot2 Geom_Raster()
How to Remove Rows with Nas Only If They Are Present in More Than Certain Percentage of Columns
Data.Table Objects Aren't Updated in Rstudio Environment Panel
Using Glmer for Logistic Regression, How to Verify Response Reference
Make List of Vectors by Joining Pair-Corresponding Elements of 2 Vectors Efficiently in R
Disable Gui, Graphics Devices in R
In R, Merge Two Data Frames, Fill Down The Blanks
How to Find Changing Points in a Dataset
Schedule a Rscript Crontab Everyminute
R Shiny: How to Change The Background Color of The Header
Small Ggplot Object (1 Mb) Turns into 7 Gigabyte .Rdata Object When Saved
How to Split a Dataframe Column by The First Instance of a Character in Its Values
Download Multiple CSV Files with One Button (Downloadhandler) with R Shiny