Column Name with Brackets or Other Punctuations for Dplyr Group_By

column name with brackets or other punctuations for dplyr group_by

I think you can make this work if you enclose the "illegal" column names in backticks. For example, let's say I start with this data frame (called df):

  BILLING.STATUS.(COMPLETED./.INCOMPLETE) ORDER.VALUE.(USD)
1 A 0.01544196
2 A 0.95522706
3 B 1.13479303
4 B 1.22848285

Then I can summarise it like this:

dat %>% group_by(`BILLING.STATUS.(COMPLETED./.INCOMPLETE)`) %>% 
summarise(count=n(),
mean = mean(`ORDER.VALUE.(USD)`))

Giving:

  BILLING.STATUS.(COMPLETED./.INCOMPLETE) count      mean
1 A 2 0.4853345
2 B 2 1.1816379

Backticks also come in handy for referring to or creating variable names with whitespace. You can find a number of questions related to dplyr and backticks on SO, and there's also some discussion of backticks in the help for Quotes.

How to deal with nonstandard column names (white space, punctuation, starts with numbers)

You may select the variable by using backticks `.

select(df, `a a`)
# a a
# 1 1
# 2 2
# 3 3

However, if your main objective is to rename the column, you may use rename in plyr package, in which you can use both "" and ``.

rename(df, replace = c("a a" = "a"))
rename(df, replace = c(`a a` = "a"))

Or in base R:

names(df)[names(df) == "a a"] <- "a"

For a more thorough description on the use of various quotes, see ?Quotes. The 'Names and Identifiers' section is especially relevant here:

other [syntactically invalid] names can be used provided they are quoted. The preferred quote is the backtick".

See also ?make.names about valid names.

See also this post about renaming in dplyr

renaming columns in R with `-` symbol

This can be done using rename. You just have to put the column names with special charcters inside the "`" sign:

temp <- temp %>% dplyr::rename(`Re-ply` = re_ply,
total_id = total_ID,
`Re-ask` = re_ask)
names(temp)
[1] "Re-ply" "total_id" "Re-ask"

Remove parentheses and text within from strings in R

A gsub should work here

gsub("\\s*\\([^\\)]+\\)","",as.character(companies$Name))
# or using "raw" strings as of R 4.0
gsub(r"{\s*\([^\)]+\)}","",as.character(companies$Name))

# [1] "Company A Inc" "Company B" "Company C Inc."
# [4] "Company D Inc." "Company E"

Here we just replace occurrences of "(...)" with nothing (also removing any leading space). R makes it look worse than it is with all the escaping we have to do for the parenthesis since they are special characters in regular expressions.

compute sum for space string column

Does this work?

df %>% group_by(`a 1`) %>% summarise(tx = sum(`t t`))

case_when() issue with evaluating multiple conditions

Here is a completely different, database-like approach which uses a lookup table of fruit and fruit types . This approach can handle an arbitrary number of fruits and fruit types.

# create or read lookup table
lut <- readr::read_table(
"fruit fruit_type
banana 34
apple 45
orange 88")

library(dplyr)
library(tidyr)
df %>%
mutate(fruit = fruits) %>%
separate_rows(fruit, sep = "\\s+") %>%
left_join(lut, by = "fruit") %>%
group_by(ID) %>%
mutate(rowid = row_number(ID)) %>%
pivot_wider(id_cols = c(ID, fruits), values_from = fruit_type,
names_prefix = "fruit_type", names_from = rowid)
     ID fruits              fruit_type1 fruit_type2 fruit_type3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 banana apple orange 34 45 88
2 2 apple orange 45 88 NA
3 3 orange 88 NA NA
4 4 orange apple 88 45 NA
5 5 nothing NA NA NA

The fruits column is copied and then split. Now, column fruit contains a single fruit on separate rows. These are joined with the lookup table lut to get the matching fruit_type value. Before this result can be reshaped to wide format, the new columns need to be numbered. This is achieved by numbering the rows within each ID.

Edit:

According to OP's comment the production dataset contains paragraphs where the keywords aren't separated by white space but also by punctuation marks like commas or appear in their plural form with trailing s. In addition, the keywords may be written in upper case or may appear multiple times in a paragraph.

Instead of separating all words we can try to extract the keywords from the paragraphs. This can be achieved by combining all keywords into one regular expression with alternation |. So, the regular expression banana|apple|orange will match either of the fruits.

For testing we need a more complex use case:

df <- tibble(fruits = readr::read_lines(
"There are bananas, oranges, and also apples here
One Orange and another orange make two Oranges
apples and pineapples go together
But pineapples alone must not be counted
banana apple orange
apple orange
orange
orange apple
nothing")
) %>%
mutate(ID = row_number())

With the modified code

df %>% 
mutate(fruit = fruits %>%
tolower() %>%
stringr::str_extract_all(paste(lut$fruit, collapse = "|")) %>%
lapply(unique)) %>%
unnest(fruit, keep_empty = TRUE) %>%
left_join(lut, by = "fruit") %>%
group_by(ID) %>%
mutate(rowid = row_number(ID)) %>%
pivot_wider(id_cols = c(ID, fruits), values_from = fruit_type,
names_prefix = "fruit_type", names_from = rowid)

we get

     ID fruits                                             fruit_type1 fruit_type2 fruit_type3
<int> <chr> <dbl> <dbl> <dbl>
1 1 "There are bananas, oranges, and also apples here" 34 88 45
2 2 "One Orange and another Orange make two Oranges " 88 NA NA
3 3 "apples and pineapples go together" 45 NA NA
4 4 "But pineapples alone must not be counted" 45 NA NA
5 5 "banana apple orange" 34 45 88
6 6 "apple orange" 45 88 NA
7 7 "orange" 88 NA NA
8 8 "orange apple" 88 45 NA
9 9 "nothing" NA NA NA

This approach has detected keywords in plural form and independent of upper/lower case.

Note that I have deliberately chosen to count multiple occurrences of a keyword in a paragraph only once by lapply(unique). If each occurrence is to be counted separately then just remove that line of code.

However, there is one (at least) drawback of this approach: The word pineapple is counted as apple because it contains apple as substring.

Validate name with email in dataframe

Maybe you can try

within(
df,
consistent <- mapply(
function(x, y) 1 - any(mapply(grepl, x, y) | mapply(grepl, x, y)),
strsplit(name, ","),
strsplit(gsub("@.*", "", email), "\\.")
)
)

which gives

               name                  email consistent
1 maay,bhtr maay.bhtr@email.com 0
2 nsgu,nhuts thang nsgu.nhuts@gmail.com 0
3 affat,nurfs asfa.1234@gmail.com 1
4 nukhyu,biyts nukhyu.biyts@gmail.com 0
5 ngyst,muun ngyst.muun@gmail.com 0
6 nsgyu,noon nsgyu.noon@gmail.com 0
7 utrs guus,book utrs.book@hotmail.com 0
8 thum,cryant thum.cryant@live.com 0
9 mumt,cant mumt.cant@gmail.com 0
10 bhan,btan bhan.btan@gmail.com 0
11 khtri,ntuk khtri.ntuk@gmail.c.om 0
12 ghaan,rstu chang.lee@gmail.com 1
13 shaan,btqaan shaan.btqaan@gmail.com 0
14 nhue,bjtraan nhue.bjtraan@gmail.com 0
15 wutys,cyun wutys.cyun@gmailcom 0
16 hrtsh,jaan hrtsh.jaan@gmail.com 0

R - Splitting strings in a column on a character and keeping specific results

We can capture as a group. Match one or more characters that are not a . ([^.]+) from the beginning (^) of string followed by a . followed by another set of characters that are not a dot captured as a group (([^.]+)) followed by other character and replace with the backreference (\\1) of the captured group

library(dplyr)
df1 %>%
mutate(D= sub("^[^.]+\\.([^.]+)\\..*", "\\1", A))
# A B C D
#1 awer.ttp.net Code 554 ttp
#2 abcd.ttp.net Code 747 ttp
#3 asdf.ttp.net Part 554 ttp
#4 xyz.ttp.net Part 747 ttp

Or using extract

library(tidyr)
df1 %>%
extract(A, into = 'D', "^[^.]+\\.([^.]+).*", remove = FALSE)

Note that we don't need the dplyr for this

df1$D <- sub("^[^.]+\\.([^.]+)\\..*", "\\1", df1$A)


Related Topics



Leave a reply



Submit