Handling Missing Combinations of Factors in R

R aggregate and handle missing combinations

You can use xtabs from base:

as.data.frame(xtabs(~ subj + task + correct, data = newdf))

subj task correct Freq
1 1 A 0 1
2 2 A 0 1
3 3 A 0 1
4 1 B 0 1
5 2 B 0 1
6 3 B 0 1
7 1 A 1 1
8 2 A 1 0
9 3 A 1 1
10 1 B 1 1
11 2 B 1 1
12 3 B 1 1

Even simpler, again in base from @Frank:

as.data.frame(table(newdf[1:3]))

Add missing rows within combinations of factors

To get only the unique combination that already exist in df, it might be better to use by to create a new reference data.table and then merge that back with the original one.

Using:

df2 <- df[, .(transp = transtype), by = .(var1,var2)]
merge(df, df2, by = c('var1','var2','transp'), all = TRUE)

gives:

   var1 var2 transp  z  y sample1 sample2
1: a y bus z st 4 3
2: a y plane z st 10 7
3: a y train NA NA NA NA
4: b y bus NA NA NA NA
5: b y train z co 8 9
6: b z bus z co 1 5
7: b z train NA NA NA NA
8: c x bus z fu 6 4
9: c x train NA NA NA NA

If you don't the z and y columns to have NA-values, you could do:

df2 <- df[, .(transp = transtype), by = .(var1,var2,z,y)]
merge(df, df2, by = c('var1','var2','transp','z','y'), all = TRUE)

which gives:

   var1 var2 transp z  y sample1 sample2
1: a y bus z st 4 3
2: a y plane z st 10 7
3: a y train z st NA NA
4: b y bus z co NA NA
5: b y train z co 8 9
6: b z bus z co 1 5
7: b z train z co NA NA
8: c x bus z fu 6 4
9: c x train z fu NA NA

NOTE: If the z and y columns have more than one unique value for each var1/var2 combo, it is better to use na.locf from the zoo package to fill the NA-values in the z and y columns.


Used data:

df <- fread("z  y var1 var2 transp sample1 sample2
z st a y bus 4 3
z st a y plane 10 7
z co b y train 8 9
z co b z bus 1 5
z fu c x bus 6 4")

List all combinations of factors (interactions) with no observations in a dataframe, up to a given dimension, removing redundancies

Here's how you can continue your algo to pick out those sequences. First let's convert your list to a matrix, with NA's filled in. I find this easier to deal with, but I'm sure with some effort you can make it work with a list as well:

m = as.matrix(rbind.fill(lapply(zz, as.data.frame)))
# y z w x
#[1,] 1 1 NA NA
#[2,] NA 1 1 1
#[3,] 1 1 1 NA
#[4,] 1 1 2 NA
#[5,] 1 1 NA 1
#[6,] 1 1 NA 2

Now let's introduce a function which will tell us if each row of a matrix given by subseq is a "subsequence" of seq, meaning it is already covered by seq as per OP's definitions:

is.subsequence = function(seq, subseq) {
comp = seq == t(subseq)

rowSums(t(is.na(comp) == is.na(seq) &
matrix(!(comp %in% FALSE), nrow = length(seq)))) == length(seq)
}

All that's left is to iterate over the matrix and throw out the covered sequences. We can do this going from top to bottom because of the automatic arrangement of zz from OP.

i = 1
while(i < nrow(m)) {
m = rbind(m[1:i,], tail(m, -i)[!is.subsequence(m[i,], tail(m, -i)),])

i = i+1
}

m
# y z w x
#[1,] 1 1 NA NA
#[2,] NA 1 1 1

And you can go back to a list if you like:

apply(m, 1, na.omit)

Complete with all combinations after counting on data.table

Here is one possible way to solve your problem. Note that the argument with=FALSE in the data.table context allows to select the columns using the standard data.frame rules. In the example below, I assumed that the columns used to compute all combinations are passed to myfun as a character vector.
Keep in mind that no columns in your dataset should be named gcases. .EACHI in by allows to perform some operation for each row in i.

myfun = function(d, g) {
# get levels (for factors) and unique values for other types.
fn <- function(x) if(is.factor(x)) levels(x) else unique(x)
gcases <- lapply(setDT(d, key=g)[, g, with=FALSE], fn)

# count based on all combinations
d[do.call(CJ, gcases), .N, keyby=.EACHI]
}

How to get combinations of two factors and convert into a new factor in R

Just use paste0 to combine the vectors

factor(paste0(y, x))

Or

factor(paste(y, x, sep=""))

return ID's of unique combinations

df %>% group_by_at(.vars=-1) %>% summarize(IDs=list(ID))

Similar to Sotos' solution, but simplifies selection of the ID column assuming all other columns need to be unique, and IDs column will be a column of lists rather than a string.

# A tibble: 2 x 4
# Groups: Var1, Var2 [2]
Var1 Var2 Var3 IDs
<int> <int> <int> <list>
1 0 0 1 <chr [2]>
2 1 1 0 <chr [1]>

Just for fun, you can further simplify it using tidyr's nest function:

require(tidyr)
nest(df,IDs=ID)
# A tibble: 2 x 4
Var1 Var2 Var3 IDs
<int> <int> <int> <S3: vctrs_list_of>
1 0 0 1 1_1, 1_3
2 1 1 0 1_2

This still leaves IDs as a list, which may or may not be useful for you, but displays it more clearly in the tibble. An extra benefit of keeping the column as a list rather than a string is that you can easily recreate the original table using unnest:

unnest(nest(dd,IDs=ID),cols=IDs)
# A tibble: 3 x 4
Var1 Var2 Var3 ID
<int> <int> <int> <chr>
1 0 0 1 1_1
2 0 0 1 1_3
3 1 1 0 1_2

Data.table: Add rows for missing combinations of 2 factors without losing associated descriptive factors

Here's one way:

dt[, .SD[.(stage=c("A", "B")), on="stage"], by=.(station, station.type)]

Filtering according to combination of matching data across variables in R

You can filter by the Group column after this,

df <-as.data.frame(df)

df$v <- sapply(seq(df[,1]),function(x)
paste(sort(c(df[x,1],df[x,2])),collapse=""))
l <- data.frame(v=unique(df$v),
Group=paste0("Group",seq(unique(df$v))))
df <- merge(df,l,by="v")[,-1]

df

Word1 Word2 distance speaker session Group
1 WordA WordX 1.40 JB 1 Group1
2 WordX WordA 0.23 JB 1 Group1
3 WordB WordY 2.10 JB 1 Group2
4 WordY WordB 2.30 JB 1 Group2
5 WordC WordZ 4.70 JB 1 Group3
6 WordZ WordC 0.51 JB 1 Group3


Related Topics



Leave a reply



Submit