R - How to Add Row Index to a Data Frame, Based on Combination of Factors

r - How to add row index to a data frame, based on combination of factors

This is probably going to look like cheating since I am passing a vector into a function which I then totally ignore except to get its length:

 df$Index <- ave( 1:nrow(df), df$Dim1, factor( df$Dim2), FUN=function(x) 1:length(x) )

The ave function returns a vector of the same length as its first argument but computed within categories defined by all of the factors between the first argument and the argument named FUN. (I often forget to put the "FUN=" in for my function and get a cryptic error message along the lines of unique() applies only to vectors, since it was trying to determine how many unique values an anonymous function possesses and it fails.

There's actually another even more compact way of expressing function(x) 1:length(x) using the seq_along function whch is probably safer since it would fail properly if passed a vector of length zero whereas the anonymous function form would fail improperly by returning 1:0 instead of numeric(0):

ave( 1:nrow(df), df$Dim1, factor( df$Dim2), FUN=seq_along )

r - Adding a row index based on a combination of multiple columns in a large dataframe

Use a data.table:

library(data.table)
DT <- as.data.table(dat)
DT[, index := seq_len(.N), by = user_id]
timestamp user_id index
1: 2013-11-07 ff268cef0c29 1
2: 2013-11-02 12bb7af7a842 1
3: 2013-11-30 e45abb10ae0b 1
4: 2013-11-06 e45abb10ae0b 2
5: 2013-11-25 f266f8c9580e 1

R add index column to data frame based on row values

If you use data.table, there is a "symbol" .GRP which records this information ( a simple group counter)

library(data.table)
DT <- data.table(temp)
DT[, index := .GRP, by = list(Dim1, Dim2)]
DT
# Dim1 Dim2 Value index
# 1: A 100 10 1
# 2: A 100 2 1
# 3: A 100 9 1
# 4: A 100 4 1
# 5: A 200 6 2
# 6: A 200 1 2
# 7: B 100 8 3
# 8: B 200 7 4

R - Add row index to a data frame but handle ties with minimum rank

You would want to use the rank function with ties.method="min" within your ave call:

df$Index <- ave(-df$fant.pts.passing, df$season, df$week,
FUN=function(x) rank(x, ties.method="min"))
df
# season week player.name fant.pts.passing Index
# 3 2014 1 Cam Newton 29 1
# 1 2014 1 Matt Ryan 28 2
# 4 2014 1 Matthew Stafford 28 2
# 2 2014 1 Peyton Manning 19 4
# 7 2014 2 Aaron Rodgers 29 1
# 6 2014 2 Andrew Luck 22 2
# 8 2014 2 Chad Henne 22 2
# 5 2014 2 Carson Palmer 18 4

Create subsets from a dataframe by a combination of factors

combn accepts a function so you can perform t.test for every combination in the function itself. With sapply you can do this on every column in ls2.

sapply(ls2, function(y) combn(c("a", "b", "c"), 2, function(x) {
data.x <- subset(df, T %in% x)
t.test(reformulate('T', y), data = data.x, var.equal = TRUE)[["p.value"]]
}))

# G H I
#[1,] 0.0155 0.1599 0.0434
#[2,] 0.0086 0.0383 0.0282
#[3,] 0.6681 0.0804 0.5531

Inserting data into a data frame based on the unique combination of two factors

Let's suppose you have the file names in a vector datafiles such that files 1-4 are the data for all assays for samples 1-384, 5-8 for all assays for samples 385-768, and so on, and that you want to end up with a data frame that is 1536 rows by 162 columns.

library(reshape)
## read all files into a list of data frames:
alldata <- lapply(datafiles,read.table)

Split into four chunks:

splitdata <- split(alldata,rep(1:4,each=4))

A function to take a list of n data sets, each containing m assays from k individuals (i.e. each one is k*m rows by 4 columns: SampleID, Well, Assay, Value) and combine them into a single data set that is k rows by n*m+2 columns long:

mergefun <- function(X) {
cdata <- lapply(X,
cast,
formula=SampleID+Well~Assay,
value="Value")
## produces data sets of the form
## SampleID Well V3 V4
## 1 SID1 A01 0 0
## 2 SID2 A02 1 2
## ...
Reduce(cdata,merge)
}

Now apply this to each of the chunks:

merged_data <- lapply(splitdata,mergefun)

Now combine the chunks:

final <- do.call(rbind,merged_data)

I'm not sure this will work, but it might. You should take the pieces apart and examine what they do separately if it doesn't work on the first try -- I may have screwed up somewhere.

R - find row indices where each combination of factors occurs

We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Dim1', and 'Dim2', get the row index (.I) in a list, which we can extract.

library(data.table)
res <- setDT(df1)[, list(Rows = list(.I)), by = .(Dim1, Dim2)]
res
# Dim1 Dim2 Rows
#1: A 100 1, 3, 4
#2: A 200 2, 5
#3: B 200 6, 7
#4: B 100 8
res$Rows
#[[1]]
#[1] 1 3 4

#[[2]]
#[1] 2 5

#[[3]]
#[1] 6 7

#[[4]]
#[1] 8

Create an Index of a combination of data.frame columns in R

The interaction function will work well:

foo = structure(list(avg = c(0.246985988921473, 0.481522354272779, 0.575400762275067, 0.14651009243539, 0.489308880181752, 0.523678968337178), i_ID = c("H", "H", "C", "C", "H", "S"), j_ID = c("P", "P", "P", "P", "P", "P")), .Names = c("avg", "i_ID", "j_ID"), row.names = 7:12, class = "data.frame")

foo$idx <- as.integer(interaction(foo$i_ID, foo$j_ID))

> foo
avg i_ID j_ID idx
7 0.2469860 H P 2
8 0.4815224 H P 2
9 0.5754008 C P 1
10 0.1465101 C P 1
11 0.4893089 H P 2
12 0.5236790 S P 3

Ah, I didn't read carefully enough. There is probably a more elegant solution, but you can use outer function and upper and lower triangles:

# lets assign some test values
x <- c('a', 'b', 'c')
foo$idx <- c('a b', 'b a', 'b c', 'c b', 'a a', 'b a')

mat <- outer(x, x, FUN = 'paste') # gives all possible combinations
uppr_ok <- mat[upper.tri(mat, diag=TRUE)]
mat_ok <- mat
mat_ok[lower.tri(mat)] <- mat[upper.tri(mat)]

Then you can match indexes found in mat with those found in mat_ok:

foo$idx <- mat_ok[match(foo$idx, mat)]

Add variable to group data by unique combinations of variables

We can use .GRP from data.table after grouping by 'Date', 'Location'

library(data.table)
setDT(df)[, Combo := .GRP, .(Date, Location)]
df
# Date Location Var1 Var2 Combo
#1: 2018 Ohio A 1 1
#2: 2018 Ohio B 2 1
#3: 2018 Arizona C 3 2
#4: 2018 Arizona D 4 2
#5: 2018 Nebraska E 5 3
#6: 2017 Nebraska F 6 4
#7: 2017 New Mexico G 7 5
#8: 2016 Idaho H 8 6

Or using rleid

setDT(df)[, Combo := rleid(Date, Location)]


Related Topics



Leave a reply



Submit