Applying Function (Ks.Test) Between Two Data Frames Column-Wise in R

Applying function (ks.test) between two data frames column-wise in R

Created two data.frames D1 and D2 with some random numbers and same column names.

set.seed(12)
D1 = data.frame(A=rnorm(n = 30,mean = 5,sd = 2.5),B=rnorm(n = 30,mean = 4.5,sd = 2.2),C=rnorm(n = 30,mean = 2.5,sd = 12))
D2 = data.frame(A=rnorm(n = 30,mean = 5,sd = 2.49),B=rnorm(n = 30,mean = 4.4,sd = 2.2),C=rnorm(n = 30,mean = 2,sd = 12))

Now we can use the column names to loop through and pass it to D1 and D2 to perform the ks.test on the corresponding columns of the respective data.frames.

col.names = colnames(D1)
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D1,D2)

#[[1]]

#Two-sample Kolmogorov-Smirnov test

#data: d1[, t] and d2[, t]
#D = 0.167, p-value = 0.81
#alternative hypothesis: two-sided

#[[2]]

#Two-sample Kolmogorov-Smirnov test

#data: d1[, t] and d2[, t]
#D = 0.233, p-value = 0.39
#alternative hypothesis: two-sided

#[[3]]

#Two-sample Kolmogorov-Smirnov test

#data: d1[, t] and d2[, t]
#D = 0.2, p-value = 0.59
#alternative hypothesis: two-sided

In the notation you have used in the question description, ideally the following code should work:

col.names =colnames(S)
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D,S)

R apply KS Test to 2 Matrices Row by Row

You can get what you want with sapply used on the row indices

sapply(1:nrow(A), function(i) ks.test(as.vector(A[i,]), as.vector(B[i,])))

actually, it looks like the only interesting part is the p values so this could be simplified with

sapply(1:nrow(A), function(i) ks.test(as.vector(A[i,]), as.vector(B[i,]))$p)
[1] 0.01587302 0.01587302 0.01587302 0.01587302

Applying a function to values from two data frames simultaneously, to generate a third

The mistake was attempting to use grep as grepl. This was corrected after Joran pointed it out.
The working function is as follows.

>df1

V1 V2 V3 V4 V5
1 R is fantastic language <NA>
2 Java is far from good
3 Data mining is fascinating <NA>

>df2

V1 V2 V3 V4 V5
1 NN VBZ JJ NN <NA>
2 NNP VBZ RB IN JJ
3 NNP NN VBZ JJ <NA>

#Function to match words and assign sentiment score
checkLexicon<- function(word,tag){
if (grepl("JJ|JJR|JJS",tag)){
ifelse(word %in% posAdj, +1, ifelse(word %in% negAdj, -1, 0))
}
else if(grepl("NN|NNP|NNPS|NNS",tag)){
ifelse(word %in% posNoun, +1, ifelse(word %in% negNoun, -1, 0))
}
else if(grepl("VBZ",tag)){
ifelse(word %in% "is","ok","none")
}
else if(grepl("RB",tag)){
ifelse(word %in% "not",-1,0)
}
else if(grepl("IN",tag)){
ifelse(word %in% "far",-1,0)
}
}

#Method to output a single value when used in conjuction with apply
justShow<- function(x){
x
}

#Main method that intends to extract word/POS tag pair, and determine sentiment score
myObject<- mapply(FUN=checkLexicon, word=apply(df1,2,justShow),tag=apply(df2,2,justShow))

#Shaping the final dataframe
scoredDF<- as.data.frame(matrix(myObject,nrow=3))

V1 V2 V3 V4 V5
1 1 ok 1 0 NULL
2 -1 ok 0 0 1
3 0 0 ok 1 NULL

How to return the range of values shared between two data frames in R?

We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.

You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.

Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.

res = rbind(a,b) %>%  # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe

If you play a bit with the code, not using the filter() and select() you will see how that's work.

> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

Apply over two data frames

Since you are using data frames, it might be faster to use lapply or sapply to do this (specially given the scope of your data frames). For example,

x <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
bl <- lapply(x, function(u){
lapply(y, function(v){
f(u,v) # Function with column from x and column from y as inputs
})
})
out = matrix(unlist(bl), ncol=ncol(y), byrow=T)

apply a function to each element of a column of a dataframe

The issue is with if/else which is not vectorized. If we change the function to ifelse, it would work. Another issue is that apply with MARGIN it expects a data.frame/matrix. Here, it is extracting a vector 'G3'

fun_pass <- function(calif) ifelse(calif >= 10, 1, 0)

Here we don't need ifelse also

fun_pass <- function(calif) as.integer(calif >= 10)

If it is a single column, use

mat_data$pass <- fun_pass(mat_data$G3)

Using apply using multiple sources of data?

Not quite sure of your needs but consider transposing, t() to run plots column-wise for row data. And consider using mapply(), the multivariate type of the apply family which runs an operation element-wise at the same time for equal-length objects. Even break apart the operations as running them together may only print/plot the last iteration to screen.

Transpose (data used were slight variations of posted dput matrix)

pop1 <- data.frame(t(data))
pop2 <- data.frame(t(data))
pop3 <- data.frame(t(data))
pop4 <- data.frame(t(data))

Histograms

hist_fx <- function(w,x,y,z) {

whist <- hist(w,prob=TRUE,col="green",xlim=c(-1,1),ylim=c(0,3))
lines(density(w),col="red")
abline(v=c(mean(w)),col="red")

xhist <- hist(x,prob=TRUE,col="blue",xlim=c(-1,1),ylim=c(0,3))
lines(density(x),col="red")
abline(v=c(mean(x)),col="red")

yhist <- hist(y,prob=TRUE,col="yellow",xlim=c(-1,1),ylim=c(0,3))
lines(density(y),col="red")
abline(v=c(mean(y)),col="red")

zhist <- hist(z,prob=TRUE,col="purple",xlim=c(-1,1),ylim=c(0,3))
lines(density(z),col="red")
abline(v=c(mean(z)),col="red")

}

# HISTOGRAM PLOTS FOR EACH DF COLUMN
output <- mapply(hist_fx, w=pop1, x=pop2, y=pop3, z=pop4)

Kolmogorov-Smirnov tests (using slight variations of dput data)

hist_fx <- function(w,x,y,z) {
t1 <- ks.test(w,x)
t2 <- ks.test(w,y)
t3 <- ks.test(w,z)

if(t1$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(x),col="blue",add=TRUE)
}
if(t2$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(y),col="red",add=TRUE)
}
if(t3$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(z),col="purple",add=TRUE)
}

return(c(t1, t2, t3))
}

output <- mapply(hist_fx, w=pop1, x=pop2, y=pop3, z=pop4)

output
# X1
# statistic 0.1666667
# p.value 0.9962552
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and x"
# statistic 0.25
# p.value 0.8474885
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and y"
# statistic 0.08333333
# p.value 1
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and z"
# X2
# statistic 0.25
# p.value 0.8474885
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and x"
# statistic 0.08333333
# p.value 1
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and y"
# statistic 0.1666667
# p.value 0.9962552
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and z"
# ...

A good way for pairwise t-test for many rows? (R)

You can try this (Archaea_Other is zero, so no output is produced). I hope this helped.

library(reshape2)
library(rstatix)
#Melt
Melted <- reshape2::melt(data,id.vars = 'label_Group')
#Stat test
pwc1 <- Melted %>% group_by(variable) %>%
pairwise_t_test(value ~ label_Group, p.adjust.method = "bonferroni")

# A tibble: 3 x 10
variable .y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <fct> <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 Bacteria_Firmicutes value EPI HC 2 3 0.273 ns 0.273 ns
2 Archaea_Euryarchaeota value EPI HC 2 3 0.536 ns 0.536 ns
3 Bacteria_Other value EPI HC 2 3 0.761 ns 0.761 ns

Data

data <- structure(list(label_Group = c("HC", "HC", "HC", "EPI", "EPI"
), Bacteria_Firmicutes = c(6.771703e-05, 3.362588e-05, 0, 0,
0), Archaea_Other = c(0L, 0L, 0L, 0L, 0L), Archaea_Euryarchaeota = c(0,
0.016835356, 0, 0.001121252, 0), Bacteria_Other = c(0.0009480385,
5.604313e-05, 0.0002209945, 0.0002466755, 0.0003335038)), class = "data.frame", row.names = c(NA,
-5L))

Compare character rows of a df matching NA to everything and create new column or df based on comparison

A very ugly while loop solution but I think it works.

#Change sets to dataframe
sets <- data.frame(sets)
result <- integer(nrow(sets))
group_count <- 1
x <- 1

while(any(result == 0)) {
a <- sets[-x, !is.na(sets[x, ])]
b <- na.omit(unlist(sets[x, ]))
inds <- which(rowSums(sweep(a, 2, as.matrix(b), `==`), na.rm = TRUE) == length(b))
#If a complete match is found
if(length(inds)) {
#Need to adjust the match since we are dropping 1 row from original df
if(all(inds > x)) {
result[c(x, inds + 1)] <- group_count
} else {
result[c(x, inds)] <- group_count
}
} else {
result[x] <- group_count
}
group_count <- group_count + 1
#Get the next row number to check.
x <- which(result == 0)[1]
}

#Reset so you get counts in order 1, 2, 3...
result <- match(result, unique(result))
result
[1] 1 2 1 2 3 4

The logic here is to compare every row value with every other row in the dataframe dropping their NA values and if there is a match we update the rows with group_count value.

Applying a Chi-square test to a list of data frames in R

You can try this:

#Data
codons_list <- list(gct = structure(list(homo = c(1546L, 2991L), all = c(0.2857718,
0.7142282)), class = "data.frame", row.names = c("1", "2")),
gcc = structure(list(homo = c(2181L, 2356L), all = c(0.2468482,
0.7531518)), class = "data.frame", row.names = c("1", "2"
)))

#Just test
lapply(codons_list,function(x) chisq.test(x=x$homo, p=x$all))

$gct

Chi-squared test for given probabilities

data: x$homo
X-squared = 67.198, df = 1, p-value = 2.456e-16

$gcc

Chi-squared test for given probabilities

data: x$homo
X-squared = 1334.7, df = 1, p-value < 2.2e-16

Or if you want other element like p-vals:

#Just p vals
lapply(codons_list,function(x) chisq.test(x=x$homo, p=x$all)$p.value)

$gct
[1] 2.456177e-16

$gcc
[1] 3.222034e-292


Related Topics



Leave a reply



Submit