Minus operation of data frames
I remember coming across this exact issue quite a few months back. Managed to sift through my Evernote one-liners.
Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment).
If you don't worry about rownames
then you can do:
df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
# c1 c2
# 1 a 1
# 2 b 2
Edit: A data.table
solution:
dt1 <- data.table(df1, key="c1")
dt2 <- data.table(df2)
dt1[!dt2]
or better one-liner (from v1.9.6+):
setDT(df1)[!df2, on="c1"]
This returns all rows in df1
where df2$c1
doesn't have a match with df1$c1
.
Subtracting columns of data frame by name
Try this base R
solution without loop. Just have in mind the position of columns:
#Data
df <- as.data.frame(matrix(seq(1,20,1),nrow=4), byrow=TRUE)
colnames(df) <- c("X1","X2","X3","X4","X5")
rownames(df) <- as.Date(c("2020-01-02","2020-01-03","2020-01-04","2020-01-05"))
#Set columns for difference
df[,2:5] <- df[,2:5]-df[,1]
Output:
X1 X2 X3 X4 X5
2020-01-02 1 4 8 12 16
2020-01-03 2 4 8 12 16
2020-01-04 3 4 8 12 16
2020-01-05 4 4 8 12 16
Or a more sophisticated way would be:
#Create index
#Var to substract
i1 <- which(names(df)=='X1')
#Vars to be substracted with X1
i2 <- which(names(df)!='X1')
#Compute
df[,i2]<-df[,i2]-df[,i1]
Output:
X1 X2 X3 X4 X5
2020-01-02 1 4 8 12 16
2020-01-03 2 4 8 12 16
2020-01-04 3 4 8 12 16
2020-01-05 4 4 8 12 16
Subtract rows of data frames column-wise preserving multiple factor column
IIUC, you have columns of the same name in df_1
and df_2
(eg int_samp_X
for some integer X
), and you want to get the difference of matching column names, grouped by gene_nm
(eg df_1[df_1$gene_nm == 'A', int_samp_1] - df_2[df_2$gene_nm == 'A', int_samp_1]
).
We can use the tidyverse
family of packages to solve this problem, notably, dplyr
and purrr
.
First, merge df_1
and df_2
with a left_join
, to ensure all the many entries in df_1
are preserved when matched against the gene-level entries in df_2
:
library(tidyverse)
df_3 <- df_1 %>% left_join(df_2, by = "gene_nm")
df_3
pep_seq int_samp_1.x int_samp_2.x int_samp_3.x gene_nm int_samp_1.y int_samp_2.y int_samp_3.y
1 Minus Operation of Data Framesa 2421432 NA 11351 A 2421432 NA 11351
2 ababababba 24242424 5342353 NA A 2421432 NA 11351
3 dfsfsfsfds NA 14532556 NA A 2421432 NA 11351
4 xbbcbcncncc 4684757849 43566 NA A 2421432 NA 11351
5 fbbdsgffhhh NA 46367367 1354151345 A 2421432 NA 11351
6 dggdgdgegeggerr 10485040 768769769 1351351354 A 2421432 NA 11351
7 dfgthrgfgf NA 797899 314534 B 24242424 5342353 NA
8 wegregegg 6849400 NA 1535 B 24242424 5342353 NA
9 egegegergewge 40300 NA 3145354 B 24242424 5342353 NA
10 sfngegebser NA NA 4353455 C NA 14532556 NA
11 qegqeefbew NA 686899 324535 C NA 14532556 NA
12 qegqetegqt NA 7898979 3543445 C NA 14532556 NA
13 qwtqtewr 556456466 678568 34535 C NA 14532556 NA
14 etghsfrgf 4646456466 NA 34535534 C NA 14532556 NA
15 sfsdfbdfbergeagaegr 246464266 68886 NA C NA 14532556 NA
16 wasfqertsdfaefwe 4564242646 488 NA C NA 14532556 NA
Then map
over the column names of interest, taking the difference from each column-pair. (Note that you'll need to convert the int_samp
columns from factor
to numeric
first.)
Update (per OP comments):
To convert NA
to 0
before computing differences, we can use mutate_if()
and replace()
, adding the following to the method chain:
mutate_if(is.numeric, funs(replace(., is.na(.), 0)))
Finally, join
back to df_1
:
var_names <- df_1 %>% select(starts_with("int_samp")) %>% names()
var_names # [1] "int_samp_1" "int_samp_2" "int_samp_3"
var_names %>%
map_dfc(~df_3 %>%
mutate_at(vars(matches(.x)), funs(as.numeric(as.character(.)))) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 0))) %>%
select(matches(.x)) %>%
reduce(`-`)) %>%
set_names(paste0(var_names, "_diff")) %>%
bind_cols(df_1)
Output:
int_samp_1_diff int_samp_2_diff int_samp_3_diff pep_seq int_samp_1 int_samp_2 int_samp_3 gene_nm
<dbl> <dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 0. 0. 0. Minus Operation of Data Framesa 2421432 NA 11351 A
2 21820992. 5342353. -11351. ababababba 24242424 5342353 NA A
3 -2421432. 14532556. -11351. dfsfsfsfds NA 14532556 NA A
4 4682336417. 43566. -11351. xbbcbcncncc 4684757849 43566 NA A
5 -2421432. 46367367. 1354139994. fbbdsgffhhh NA 46367367 1354151345 A
6 8063608. 768769769. 1351340003. dggdgdgegeggerr 10485040 768769769 1351351354 A
7 -24242424. -4544454. 314534. dfgthrgfgf NA 797899 314534 B
8 -17393024. -5342353. 1535. wegregegg 6849400 NA 1535 B
9 -24202124. -5342353. 3145354. egegegergewge 40300 NA 3145354 B
10 0. -14532556. 4353455. sfngegebser NA NA 4353455 C
11 0. -13845657. 324535. qegqeefbew NA 686899 324535 C
12 0. -6633577. 3543445. qegqetegqt NA 7898979 3543445 C
13 556456466. -13853988. 34535. qwtqtewr 556456466 678568 34535 C
14 4646456466. -14532556. 34535534. etghsfrgf 4646456466 NA 34535534 C
15 246464266. -14463670. 0. sfsdfbdfbergeagaegr 246464266 68886 NA C
16 4564242646. -14532068. 0. wasfqertsdfaefwe 4564242646 488 NA C
Note: This answer is largely a derivation from akrun's answer here.
How to use SQL minus query equivalent between two dataframes properly
Yes, you can use the indicator like this:
df1.merge(df2, how='left', indicator='ind').query('ind=="left_only"')
Where df1 is:
col1 col2 col3
0 1.0 2.0 3.0
1 2.0 3.0 4.0
2 5.0 6.0 6.0
3 8.0 9.0 9.0
4 10.0 10.0 10.0
and df2 is:
col1 col2 col3
0 5 6 6
1 8 9 9
2 1 2 3
3 2 3 4
Output:
col1 col2 col3 ind
4 10.0 10.0 10.0 left_only
Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns
We can use pandas.concat
with drop_duplicates
here and pass it the argument to drop all duplicates with keep=False
:
pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Edit after comment by OP
If you want to make sure that unique rows in df2
arnt taken into account, we can duplicate that df
:
pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
subtracting two dataframes
Move the City column into the index. The DataFrames will align by both index and columns first and then do subtraction. Any combination not present will result in NaN
.
df2.set_index('City').subtract(df1.set_index('City'), fill_value=0)
how to subtract 2 data frame with the same size?
I guess this is what you want.
df1 = pd.DataFrame({"A": [4,0], "B": [5,6]})
df2 = pd.DataFrame({"A": [6,7], "B": [0,1]})
df = df1 - df2
df
Out[4]:
A B
0 -2 5
1 -7 5
Spark: subtract two DataFrames
According to the Scala API docs, doing:
dataFrame1.except(dataFrame2)
will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.
Related Topics
Rcpp Warning: "Directory Not Found for Option '-L/Usr/Local/Cellar/Gfortran/4.8.2/Gfortran'"
Text-Mining with the Tm-Package - Word Stemming
How to Read Huge CSV File into R by Row Condition
Configuration Failed Because Libcurl Was Not Found
"Adding Missing Grouping Variables" Message in Dplyr in R
What Does the Error "Arguments Imply Differing Number of Rows: X, Y" Mean
Remove 'Search' Option But Leave 'Search Columns' Option
Stop Lapply from Printing to Console
What Are the Differences Between Concatenating Strings with Cat() and Paste()
Sum of Two Columns of Data Frame with Na Values
Get the Size of the Window in Shiny
How to Read the Source Code for an R Function
Configuration Failed Because Libcurl Was Not Found
Different Colour Palettes for Two Different Colour Aesthetic Mappings in Ggplot2