R aggregate and handle missing combinations
You can use xtabs
from base:
as.data.frame(xtabs(~ subj + task + correct, data = newdf))
subj task correct Freq
1 1 A 0 1
2 2 A 0 1
3 3 A 0 1
4 1 B 0 1
5 2 B 0 1
6 3 B 0 1
7 1 A 1 1
8 2 A 1 0
9 3 A 1 1
10 1 B 1 1
11 2 B 1 1
12 3 B 1 1
Even simpler, again in base from @Frank:
as.data.frame(table(newdf[1:3]))
Add missing rows within combinations of factors
To get only the unique combination that already exist in df
, it might be better to use by
to create a new reference data.table and then merge that back with the original one.
Using:
df2 <- df[, .(transp = transtype), by = .(var1,var2)]
merge(df, df2, by = c('var1','var2','transp'), all = TRUE)
gives:
var1 var2 transp z y sample1 sample2
1: a y bus z st 4 3
2: a y plane z st 10 7
3: a y train NA NA NA NA
4: b y bus NA NA NA NA
5: b y train z co 8 9
6: b z bus z co 1 5
7: b z train NA NA NA NA
8: c x bus z fu 6 4
9: c x train NA NA NA NA
If you don't the z
and y
columns to have NA
-values, you could do:
df2 <- df[, .(transp = transtype), by = .(var1,var2,z,y)]
merge(df, df2, by = c('var1','var2','transp','z','y'), all = TRUE)
which gives:
var1 var2 transp z y sample1 sample2
1: a y bus z st 4 3
2: a y plane z st 10 7
3: a y train z st NA NA
4: b y bus z co NA NA
5: b y train z co 8 9
6: b z bus z co 1 5
7: b z train z co NA NA
8: c x bus z fu 6 4
9: c x train z fu NA NA
NOTE: If the z
and y
columns have more than one unique value for each var1
/var2
combo, it is better to use na.locf
from the zoo
package to fill the NA
-values in the z
and y
columns.
Used data:
df <- fread("z y var1 var2 transp sample1 sample2
z st a y bus 4 3
z st a y plane 10 7
z co b y train 8 9
z co b z bus 1 5
z fu c x bus 6 4")
List all combinations of factors (interactions) with no observations in a dataframe, up to a given dimension, removing redundancies
Here's how you can continue your algo to pick out those sequences. First let's convert your list to a matrix, with NA's filled in. I find this easier to deal with, but I'm sure with some effort you can make it work with a list as well:
m = as.matrix(rbind.fill(lapply(zz, as.data.frame)))
# y z w x
#[1,] 1 1 NA NA
#[2,] NA 1 1 1
#[3,] 1 1 1 NA
#[4,] 1 1 2 NA
#[5,] 1 1 NA 1
#[6,] 1 1 NA 2
Now let's introduce a function which will tell us if each row of a matrix given by subseq
is a "subsequence" of seq
, meaning it is already covered by seq
as per OP's definitions:
is.subsequence = function(seq, subseq) {
comp = seq == t(subseq)
rowSums(t(is.na(comp) == is.na(seq) &
matrix(!(comp %in% FALSE), nrow = length(seq)))) == length(seq)
}
All that's left is to iterate over the matrix and throw out the covered sequences. We can do this going from top to bottom because of the automatic arrangement of zz
from OP.
i = 1
while(i < nrow(m)) {
m = rbind(m[1:i,], tail(m, -i)[!is.subsequence(m[i,], tail(m, -i)),])
i = i+1
}
m
# y z w x
#[1,] 1 1 NA NA
#[2,] NA 1 1 1
And you can go back to a list if you like:
apply(m, 1, na.omit)
Complete with all combinations after counting on data.table
Here is one possible way to solve your problem. Note that the argument with=FALSE
in the data.table
context allows to select the columns using the standard data.frame
rules. In the example below, I assumed that the columns used to compute all combinations are passed to myfun
as a character vector.
Keep in mind that no columns in your dataset should be named gcases. .EACHI
in by
allows to perform some operation for each row in i
.
myfun = function(d, g) {
# get levels (for factors) and unique values for other types.
fn <- function(x) if(is.factor(x)) levels(x) else unique(x)
gcases <- lapply(setDT(d, key=g)[, g, with=FALSE], fn)
# count based on all combinations
d[do.call(CJ, gcases), .N, keyby=.EACHI]
}
How to get combinations of two factors and convert into a new factor in R
Just use paste0
to combine the vectors
factor(paste0(y, x))
Or
factor(paste(y, x, sep=""))
return ID's of unique combinations
df %>% group_by_at(.vars=-1) %>% summarize(IDs=list(ID))
Similar to Sotos' solution, but simplifies selection of the ID column assuming all other columns need to be unique, and IDs column will be a column of lists rather than a string.
# A tibble: 2 x 4
# Groups: Var1, Var2 [2]
Var1 Var2 Var3 IDs
<int> <int> <int> <list>
1 0 0 1 <chr [2]>
2 1 1 0 <chr [1]>
Just for fun, you can further simplify it using tidyr
's nest
function:
require(tidyr)
nest(df,IDs=ID)
# A tibble: 2 x 4
Var1 Var2 Var3 IDs
<int> <int> <int> <S3: vctrs_list_of>
1 0 0 1 1_1, 1_3
2 1 1 0 1_2
This still leaves IDs as a list, which may or may not be useful for you, but displays it more clearly in the tibble. An extra benefit of keeping the column as a list rather than a string is that you can easily recreate the original table using unnest
:
unnest(nest(dd,IDs=ID),cols=IDs)
# A tibble: 3 x 4
Var1 Var2 Var3 ID
<int> <int> <int> <chr>
1 0 0 1 1_1
2 0 0 1 1_3
3 1 1 0 1_2
Data.table: Add rows for missing combinations of 2 factors without losing associated descriptive factors
Here's one way:
dt[, .SD[.(stage=c("A", "B")), on="stage"], by=.(station, station.type)]
Filtering according to combination of matching data across variables in R
You can filter
by the Group column after this,
df <-as.data.frame(df)
df$v <- sapply(seq(df[,1]),function(x)
paste(sort(c(df[x,1],df[x,2])),collapse=""))
l <- data.frame(v=unique(df$v),
Group=paste0("Group",seq(unique(df$v))))
df <- merge(df,l,by="v")[,-1]
df
Word1 Word2 distance speaker session Group
1 WordA WordX 1.40 JB 1 Group1
2 WordX WordA 0.23 JB 1 Group1
3 WordB WordY 2.10 JB 1 Group2
4 WordY WordB 2.30 JB 1 Group2
5 WordC WordZ 4.70 JB 1 Group3
6 WordZ WordC 0.51 JB 1 Group3
Related Topics
How Is Ggplot2 Plus Operator Defined
Reshape R Data with User Entries in Rows, Collapsing for Each User
Plot a Function with Several Arguments in R
Convert Data with One Column and Multiple Rows into Multi Column Multi Row Data
Extracting HTML Table from a Website in R
Adding a New Column to Matrix Error
Get Rows of Unique Values by Group
Population Pyramid Plot with Ggplot2 and Dplyr (Instead of Plyr)
R Cmd Check Not Looking for Gcc in Rtools Directory
Rank Vector with Some Equal Values
Handling Missing Combinations of Factors in R
R Histogram from Frequency Table
Add a Constant Value to All Rows in a Dataframe
Removing Everything After First 'Backslash' in a String
R Shiny - Ui.R Seems to Not Recognize a Dataframe Read by Server.R