Applying function (ks.test) between two data frames column-wise in R
Created two data.frames D1
and D2
with some random numbers and same column names.
set.seed(12)
D1 = data.frame(A=rnorm(n = 30,mean = 5,sd = 2.5),B=rnorm(n = 30,mean = 4.5,sd = 2.2),C=rnorm(n = 30,mean = 2.5,sd = 12))
D2 = data.frame(A=rnorm(n = 30,mean = 5,sd = 2.49),B=rnorm(n = 30,mean = 4.4,sd = 2.2),C=rnorm(n = 30,mean = 2,sd = 12))
Now we can use the column names to loop through and pass it to D1
and D2
to perform the ks.test
on the corresponding columns of the respective data.frames.
col.names = colnames(D1)
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D1,D2)
#[[1]]
#Two-sample Kolmogorov-Smirnov test
#data: d1[, t] and d2[, t]
#D = 0.167, p-value = 0.81
#alternative hypothesis: two-sided
#[[2]]
#Two-sample Kolmogorov-Smirnov test
#data: d1[, t] and d2[, t]
#D = 0.233, p-value = 0.39
#alternative hypothesis: two-sided
#[[3]]
#Two-sample Kolmogorov-Smirnov test
#data: d1[, t] and d2[, t]
#D = 0.2, p-value = 0.59
#alternative hypothesis: two-sided
In the notation you have used in the question description, ideally the following code should work:
col.names =colnames(S)
lapply(col.names,function(t,d1,d2){ks.test(d1[,t],d2[,t])},D,S)
R apply KS Test to 2 Matrices Row by Row
You can get what you want with sapply
used on the row indices
sapply(1:nrow(A), function(i) ks.test(as.vector(A[i,]), as.vector(B[i,])))
actually, it looks like the only interesting part is the p values so this could be simplified with
sapply(1:nrow(A), function(i) ks.test(as.vector(A[i,]), as.vector(B[i,]))$p)
[1] 0.01587302 0.01587302 0.01587302 0.01587302
Applying a function to values from two data frames simultaneously, to generate a third
The mistake was attempting to use grep
as grepl
. This was corrected after Joran pointed it out.
The working function is as follows.
>df1
V1 V2 V3 V4 V5
1 R is fantastic language <NA>
2 Java is far from good
3 Data mining is fascinating <NA>
>df2
V1 V2 V3 V4 V5
1 NN VBZ JJ NN <NA>
2 NNP VBZ RB IN JJ
3 NNP NN VBZ JJ <NA>
#Function to match words and assign sentiment score
checkLexicon<- function(word,tag){
if (grepl("JJ|JJR|JJS",tag)){
ifelse(word %in% posAdj, +1, ifelse(word %in% negAdj, -1, 0))
}
else if(grepl("NN|NNP|NNPS|NNS",tag)){
ifelse(word %in% posNoun, +1, ifelse(word %in% negNoun, -1, 0))
}
else if(grepl("VBZ",tag)){
ifelse(word %in% "is","ok","none")
}
else if(grepl("RB",tag)){
ifelse(word %in% "not",-1,0)
}
else if(grepl("IN",tag)){
ifelse(word %in% "far",-1,0)
}
}
#Method to output a single value when used in conjuction with apply
justShow<- function(x){
x
}
#Main method that intends to extract word/POS tag pair, and determine sentiment score
myObject<- mapply(FUN=checkLexicon, word=apply(df1,2,justShow),tag=apply(df2,2,justShow))
#Shaping the final dataframe
scoredDF<- as.data.frame(matrix(myObject,nrow=3))
V1 V2 V3 V4 V5
1 1 ok 1 0 NULL
2 -1 ok 0 0 1
3 0 0 ok 1 NULL
How to return the range of values shared between two data frames in R?
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from
col with the previous value lag(from)
to see if the actual value is smaller. Also you compare the previous lag(to)
to the actual to
col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1
(the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter()
and select()
you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b
Apply over two data frames
Since you are using data frames, it might be faster to use lapply or sapply to do this (specially given the scope of your data frames). For example,
x <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8), col3=c(9,10,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
bl <- lapply(x, function(u){
lapply(y, function(v){
f(u,v) # Function with column from x and column from y as inputs
})
})
out = matrix(unlist(bl), ncol=ncol(y), byrow=T)
apply a function to each element of a column of a dataframe
The issue is with if/else
which is not vectorized. If we change the function to ifelse
, it would work. Another issue is that apply
with MARGIN
it expects a data.frame/matrix. Here, it is extracting a vector
'G3'
fun_pass <- function(calif) ifelse(calif >= 10, 1, 0)
Here we don't need ifelse
also
fun_pass <- function(calif) as.integer(calif >= 10)
If it is a single column, use
mat_data$pass <- fun_pass(mat_data$G3)
Using apply using multiple sources of data?
Not quite sure of your needs but consider transposing, t()
to run plots column-wise for row data. And consider using mapply()
, the multivariate type of the apply family which runs an operation element-wise at the same time for equal-length objects. Even break apart the operations as running them together may only print/plot the last iteration to screen.
Transpose (data used were slight variations of posted dput matrix)
pop1 <- data.frame(t(data))
pop2 <- data.frame(t(data))
pop3 <- data.frame(t(data))
pop4 <- data.frame(t(data))
Histograms
hist_fx <- function(w,x,y,z) {
whist <- hist(w,prob=TRUE,col="green",xlim=c(-1,1),ylim=c(0,3))
lines(density(w),col="red")
abline(v=c(mean(w)),col="red")
xhist <- hist(x,prob=TRUE,col="blue",xlim=c(-1,1),ylim=c(0,3))
lines(density(x),col="red")
abline(v=c(mean(x)),col="red")
yhist <- hist(y,prob=TRUE,col="yellow",xlim=c(-1,1),ylim=c(0,3))
lines(density(y),col="red")
abline(v=c(mean(y)),col="red")
zhist <- hist(z,prob=TRUE,col="purple",xlim=c(-1,1),ylim=c(0,3))
lines(density(z),col="red")
abline(v=c(mean(z)),col="red")
}
# HISTOGRAM PLOTS FOR EACH DF COLUMN
output <- mapply(hist_fx, w=pop1, x=pop2, y=pop3, z=pop4)
Kolmogorov-Smirnov tests (using slight variations of dput data)
hist_fx <- function(w,x,y,z) {
t1 <- ks.test(w,x)
t2 <- ks.test(w,y)
t3 <- ks.test(w,z)
if(t1$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(x),col="blue",add=TRUE)
}
if(t2$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(y),col="red",add=TRUE)
}
if(t3$p.value < 0.05) {
plot(ecdf(w),col="green")
plot(ecdf(z),col="purple",add=TRUE)
}
return(c(t1, t2, t3))
}
output <- mapply(hist_fx, w=pop1, x=pop2, y=pop3, z=pop4)
output
# X1
# statistic 0.1666667
# p.value 0.9962552
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and x"
# statistic 0.25
# p.value 0.8474885
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and y"
# statistic 0.08333333
# p.value 1
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and z"
# X2
# statistic 0.25
# p.value 0.8474885
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and x"
# statistic 0.08333333
# p.value 1
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and y"
# statistic 0.1666667
# p.value 0.9962552
# alternative "two-sided"
# method "Two-sample Kolmogorov-Smirnov test"
# data.name "w and z"
# ...
A good way for pairwise t-test for many rows? (R)
You can try this (Archaea_Other
is zero, so no output is produced). I hope this helped.
library(reshape2)
library(rstatix)
#Melt
Melted <- reshape2::melt(data,id.vars = 'label_Group')
#Stat test
pwc1 <- Melted %>% group_by(variable) %>%
pairwise_t_test(value ~ label_Group, p.adjust.method = "bonferroni")
# A tibble: 3 x 10
variable .y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <fct> <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 Bacteria_Firmicutes value EPI HC 2 3 0.273 ns 0.273 ns
2 Archaea_Euryarchaeota value EPI HC 2 3 0.536 ns 0.536 ns
3 Bacteria_Other value EPI HC 2 3 0.761 ns 0.761 ns
Data
data <- structure(list(label_Group = c("HC", "HC", "HC", "EPI", "EPI"
), Bacteria_Firmicutes = c(6.771703e-05, 3.362588e-05, 0, 0,
0), Archaea_Other = c(0L, 0L, 0L, 0L, 0L), Archaea_Euryarchaeota = c(0,
0.016835356, 0, 0.001121252, 0), Bacteria_Other = c(0.0009480385,
5.604313e-05, 0.0002209945, 0.0002466755, 0.0003335038)), class = "data.frame", row.names = c(NA,
-5L))
Compare character rows of a df matching NA to everything and create new column or df based on comparison
A very ugly while
loop solution but I think it works.
#Change sets to dataframe
sets <- data.frame(sets)
result <- integer(nrow(sets))
group_count <- 1
x <- 1
while(any(result == 0)) {
a <- sets[-x, !is.na(sets[x, ])]
b <- na.omit(unlist(sets[x, ]))
inds <- which(rowSums(sweep(a, 2, as.matrix(b), `==`), na.rm = TRUE) == length(b))
#If a complete match is found
if(length(inds)) {
#Need to adjust the match since we are dropping 1 row from original df
if(all(inds > x)) {
result[c(x, inds + 1)] <- group_count
} else {
result[c(x, inds)] <- group_count
}
} else {
result[x] <- group_count
}
group_count <- group_count + 1
#Get the next row number to check.
x <- which(result == 0)[1]
}
#Reset so you get counts in order 1, 2, 3...
result <- match(result, unique(result))
result
[1] 1 2 1 2 3 4
The logic here is to compare every row value with every other row in the dataframe dropping their NA
values and if there is a match we update the rows with group_count
value.
Applying a Chi-square test to a list of data frames in R
You can try this:
#Data
codons_list <- list(gct = structure(list(homo = c(1546L, 2991L), all = c(0.2857718,
0.7142282)), class = "data.frame", row.names = c("1", "2")),
gcc = structure(list(homo = c(2181L, 2356L), all = c(0.2468482,
0.7531518)), class = "data.frame", row.names = c("1", "2"
)))
#Just test
lapply(codons_list,function(x) chisq.test(x=x$homo, p=x$all))
$gct
Chi-squared test for given probabilities
data: x$homo
X-squared = 67.198, df = 1, p-value = 2.456e-16
$gcc
Chi-squared test for given probabilities
data: x$homo
X-squared = 1334.7, df = 1, p-value < 2.2e-16
Or if you want other element like p-vals:
#Just p vals
lapply(codons_list,function(x) chisq.test(x=x$homo, p=x$all)$p.value)
$gct
[1] 2.456177e-16
$gcc
[1] 3.222034e-292
Related Topics
How to Install Doredis Package Version 1.0.5 into R 3.0.1 on Windows
Customise The Infowindow/Tooltip in R -> Plotly
How to Specify Certificate, Key and Root Certificate with Httr for Certificate Based Authentication
Convert Unicode to Readable Characters in R
Removing "Nul" Characters (Within R)
R: Gradient Fill for Geom_Rect in Ggplot2
How to Plot Classification Borders on an Linear Discrimination Analysis Plot in R
R - Column Names in Read.Table and Write.Table Starting with Number and Containing Space
Combining Multiple Identically-Named Columns in R
Creating a Stacked Bar Chart Centered on Zero Using Ggplot
Na.Locf and Inverse.Rle in Rcpp
Ggplot and Axis Numbers and Labels
Split Line by Multiple Points Using Sf Package