Group Data Frame by Pattern in R

R Identifying Dataframe Change Patterns by Groups

With the help of data.table rleid you can do -

library(dplyr)

df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(val = data.table::rleid(location)) %>%
arrange(person, location) %>%
group_by(location, .add = TRUE) %>%
summarise(move_back = any(val != lag(val, default = first(val)))) %>%
summarise(move_back = as.integer(any(move_back)))

# person move_back
# <chr> <int>
#1 Harry 1
#2 Lily 1
#3 Peter 0

How to create a group based on pattern from another column?

We can use findInterval like below

> transform(dt, group = 1 + findInterval(seq_along(value), which(value == 2), left.open = TRUE))
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4

or cut

> transform(dt, group = as.integer(cut(seq_along(value), c(-Inf, which(value == 2)))))
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4

R function for grouping rows based on patterns across columns?

Try this approach. Create a variable to collect the values across rows using c_across() and toString(). After that, format as factor and assign the suffix Group.. Here the code using tidyverse functions:

library(tidyverse)
#Code
dfnew <- df %>% group_by(gene) %>%
mutate(Var=toString(c_across(stg.1:stg.4))) %>%
ungroup() %>%
mutate(Var=paste0('Group.',as.numeric(factor(Var,levels = unique(Var),ordered = T))))

Output:

# A tibble: 5 x 6
gene stg.1 stg.2 stg.3 stg.4 Var
<int> <fct> <fct> <fct> <fct> <chr>
1 1 up up up down Group.1
2 2 up up up down Group.1
3 3 NA NA NA up Group.2
4 4 NA NA NA up Group.2
5 5 NA NA NA NA Group.3

If you only need a pattern, try this:

#Code 2
dfnew <- df %>% group_by(gene) %>%
mutate(Var=toString(c_across(stg.1:stg.4)))

Output:

# A tibble: 5 x 6
# Groups: gene [5]
gene stg.1 stg.2 stg.3 stg.4 Var
<int> <fct> <fct> <fct> <fct> <chr>
1 1 up up up down up, up, up, down
2 2 up up up down up, up, up, down
3 3 NA NA NA up NA, NA, NA, up
4 4 NA NA NA up NA, NA, NA, up
5 5 NA NA NA NA NA, NA, NA, NA

Grouping a mass spectrometry data in R by loop

Let's write this script with a dummy data

df<-data.frame(rep(data.frame(n1=c(0:6),n2=c(0:6)),3))

stack_scipt1<-function(df,replicate_value){ #df= dataframe, replicate value= no, of replicates input by user
df<-read.csv(file.choose(),header = TRUE) #loads csv file into R as dataframe from pc folder where it is stored
df_col<-ncol(df) #calculates no. of columns in dataframe
groups<-sort(rep(0:((df_col/replicate_value)-1),replicate_value)) #creates user determined groups
id<-list() #creates empty list
for (i in 1:length(unique(groups))){
id[[i]]<-which(groups == unique(groups)[i])} #creates list of groups
names(id)<-paste0("id",unique(groups)) #assigns group based names to the list "id"
data<-list() #creates empty list
for (i in 1:length(id)){
data[[i]]<-df[,id[[i]]]} #creates list of dataframe columns sorted by groups
names(data)<-paste0("data",unique(groups)) #assigns group based names to the list "data"
return(data)}

Executing the script:

>stack_script1(df,3)

Note:

  1. This script will work for dataframes of any column number as long as column number is a multiple of replicate value
  2. The line df<-read.csv(file.choose(),header = TRUE) is not compulsory if you have imported the data into R previously. I prefer to use it because it allows me to import data from save folder without needing to import it additionally
  3. The output is a list containing grouped data frames. It's not possible to return multiple dataframe vectors from a script

Find pattern of combination by group in R

The problem is essentially one of constructing an adjacency table based on common memberships, e.g. Working with Bipartite/Affiliation Network Data in R. To do that, we make a table out of the data (after eliminating duplicates), and then take the cross-product.

dd <- unique(df)
tab <- table(dd)
dd <- crossprod(t(tab))
diag(dd) <- 0
# ID
# ID 1 2 3 4 5
# 1 0 3 3 2 2
# 2 3 0 2 1 3
# 3 3 2 0 2 1
# 4 2 1 2 0 0
# 5 2 3 1 0 0

The table above allows us to see the number of categories that IDs share. Now we just have to go through the rows; for each row, I select the first ID that has a value of at least 3 (matched).

matched <- apply(dd >= 3, MAR = 1, function(x) which(x == TRUE)[1])   
# 1 2 3 4 5
# 2 1 1 NA 2

So "1" matched with "2", "2" matched with "1", "3" matched with "1", "4" has no matches, "5" matched with "2". Finish off by manipulating this output to get the desired final product:

out <- apply(cbind(as.numeric(names(matched)), matched), MAR = 1, function(x) {
if (any(is.na(x))) {
data.frame(var2 = "No", var3 = x[1])
} else {
data.frame(var2 = "Yes", var3 = paste(sort(x), collapse = "-"))
}
})
out <- plyr::ldply(out, .id = "ID")

merge(df, out, all.x = TRUE)
# ID var1 var2 var3
# 1 1 A Yes 1-2
# 2 1 B Yes 1-2
# 3 1 C Yes 1-2
# 4 1 A Yes 1-2
# 5 1 D Yes 1-2
# 6 2 D Yes 1-2
# 7 2 C Yes 1-2
# 8 2 D Yes 1-2
# 9 2 B Yes 1-2
# 10 2 F Yes 1-2
# 11 3 A Yes 1-3
# 12 3 B Yes 1-3
# 13 3 C Yes 1-3
# 14 3 C Yes 1-3
# 15 4 A No 4
# 16 4 B No 4
# 17 5 D Yes 2-5
# 18 5 D Yes 2-5
# 19 5 C Yes 2-5
# 20 5 C Yes 2-5
# 21 5 F Yes 2-5

Group by and keep columns with matching pattern

You can do this by finding columns that have a bijection from gp to the values for that column. That is, for every value in gp (x or y) there is exactly one matching value in the variable column e.g. V1 (0 or 1). The reverse is also true, for every value in a variable column like V1 there is exactly one match in gp.

To operate on all columns at once, start by pivoting to a longer form. This will also remove the easy duplicates right away.

uniq <- df %>% pivot_longer(-gp) %>% distinct(name, gp, value)
# name gp value
# <chr> <fct> <int>
# 1 V1 x 0
# 2 V2 x 1
# 3 V3 x 0
# 4 V4 x 0
# 5 V5 x 0
# 6 V6 x 0
# 7 V2 x 0
# 8 V5 x 1
# ...
# 14 V6 y 1
# 15 V6 x 1

Then you can find the "exactly once" matches in one direction by counting how often each gp value shows up per name. It will be exactly once if it always matches a constant value for the variable column.

match_left <- uniq %>%
count(name, gp) %>%
group_by(name) %>%
filter(max(n) == 1) %>%
distinct(name)

match_left
# # A tibble: 3 x 1
# # Groups: name [3]
# name
# <chr>
# 1 V1
# 2 V3
# 3 V4

Do the same thing, but in reverse for the value column.

match_right <- uniq %>%
count(name, value) %>%
group_by(name) %>%
filter(max(n) == 1) %>%
distinct(name)

match_right
# # A tibble: 2 x 1
# # Groups: name [2]
# name
# <chr>
# 1 V1
# 2 V3

Now that we know which variables to keep, we can merge everything back together and reshape to the wide form.

matches <- df %>%
mutate(i = row_number()) %>%
pivot_longer(-c(i, gp)) %>%
inner_join(match_left, on='name') %>%
inner_join(match_right, on='name') %>%
spread(name, value) %>%
arrange(i) %>%
select(-i)

matches
# # A tibble: 4 x 3
# gp V1 V3
# <fct> <int> <int>
# 1 x 0 0
# 2 x 0 0
# 3 y 1 1
# 4 x 0 0

Group_by for consecutive patterns in R

library(dplyr)
df %>% mutate(flag=case_when(grepl(paste(g1,collapse = '|'),col1)~1,
grepl(paste(g2,collapse = '|'),col1)~2,
TRUE~3),
group=data.table::rleid(flag))

num col1 flag group
1 1 SENSOR_01 2 1
2 2 SENSOR_05 1 2
3 3 SENSOR_05, SENSOR_07 1 2
4 4 SENSOR_05, SENSOR_07 1 2
5 5 SENSOR_07 1 2
6 6 SENSOR_05 1 2
7 7 SENSOR_01, SENSOR_03 2 3
8 8 SENSOR_01 2 3
9 9 SENSOR_03 2 3
10 10 SENSOR_01 2 3
11 11 SENSOR_05 1 4

PS: I used SENSOR_05 or SENSOR_07 not SENSOR_05 and SENSOR_07

Group string values in column by their beginning

First we extract the first 3 words or 2 words that are followed by :, using stringr::str_extract or you could just use sub to match the full value and only capture the given expression i.e sub('^(expre).+$', '\\1', value) , the regex pattern is as follows \w+ \w+(:| \w+) i.e match two words \w+ \w+ then either match : or another word.

library(stringr)
df %>%
mutate(beginnings= str_extract(value, "\\w+ \\w+(:| \\w+)")) %>%
group_by(beginnings)
# A tibble: 7 x 3
# Groups: beginnings [3]
ID value beginnings
<int> <fct> <chr>
1 1 request body: <?xml version=2.0> values received request body:
2 2 request body: <code> jnwg3425 request body:
3 3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:
4 4 Error in message received Error in message
5 5 Error in message received Error in message
6 6 Push forward message x3535 Push forward message
7 7 Push forward message <MarkCheckMSG> Push forward message

Using a different regular expression

(\w+ )+[a-z]{2,}:? => match as much words followed by space as possible ((\w+ )+) followed by more then two letters [a-z]{2,} and : if it exists.

df %>%
mutate(beginings= str_extract(value, "(\\w+ )+[a-z]{2,}:?")) %>%
group_by(beginings)
# A tibble: 7 x 3
# Groups: beginings [3]
ID value beginings
<int> <fct> <chr>
1 1 request body: <?xml version=2.0> values received request body:
2 2 request body: <code> jnwg3425 request body:
3 3 request body: <?xml version=2.0, <PlatCode>, <code> qwefn2 request body:
4 4 Error in message received Error in message received
5 5 Error in message received Error in message received
6 6 Push forward message x3535 Push forward message
7 7 Push forward message <MarkCheckMSG> Push forward message


Related Topics



Leave a reply



Submit