Minus Operation of Data Frames

Minus operation of data frames

I remember coming across this exact issue quite a few months back. Managed to sift through my Evernote one-liners.

Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment).

If you don't worry about rownames then you can do:

df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
#   c1 c2
# 1  a  1
# 2  b  2

Edit: A data.table solution:

dt1 <- data.table(df1, key="c1")
dt2 <- data.table(df2)
dt1[!dt2]

or better one-liner (from v1.9.6+):

setDT(df1)[!df2, on="c1"]

This returns all rows in df1 where df2$c1 doesn't have a match with df1$c1.

Subtracting columns of data frame by name

Try this base R solution without loop. Just have in mind the position of columns:

#Data
df <- as.data.frame(matrix(seq(1,20,1),nrow=4), byrow=TRUE)
colnames(df) <- c("X1","X2","X3","X4","X5")
rownames(df) <- as.Date(c("2020-01-02","2020-01-03","2020-01-04","2020-01-05"))
#Set columns for difference
df[,2:5] <- df[,2:5]-df[,1]

Output:

           X1 X2 X3 X4 X5
2020-01-02  1  4  8 12 16
2020-01-03  2  4  8 12 16
2020-01-04  3  4  8 12 16
2020-01-05  4  4  8 12 16

Or a more sophisticated way would be:

#Create index
#Var to substract
i1 <- which(names(df)=='X1')
#Vars to be substracted with X1
i2 <- which(names(df)!='X1')
#Compute
df[,i2]<-df[,i2]-df[,i1]

Output:

           X1 X2 X3 X4 X5
2020-01-02  1  4  8 12 16
2020-01-03  2  4  8 12 16
2020-01-04  3  4  8 12 16
2020-01-05  4  4  8 12 16

Subtract rows of data frames column-wise preserving multiple factor column

IIUC, you have columns of the same name in df_1 and df_2 (eg int_samp_X for some integer X), and you want to get the difference of matching column names, grouped by gene_nm (eg df_1[df_1$gene_nm == 'A', int_samp_1] - df_2[df_2$gene_nm == 'A', int_samp_1]).

We can use the tidyverse family of packages to solve this problem, notably, dplyr and purrr.

First, merge df_1 and df_2 with a left_join, to ensure all the many entries in df_1 are preserved when matched against the gene-level entries in df_2:

library(tidyverse)

df_3 <- df_1 %>% left_join(df_2, by = "gene_nm")

df_3
               pep_seq int_samp_1.x int_samp_2.x int_samp_3.x gene_nm int_samp_1.y int_samp_2.y int_samp_3.y
1            Minus Operation of Data Framesa      2421432           NA        11351       A      2421432           NA        11351
2           ababababba     24242424      5342353           NA       A      2421432           NA        11351
3           dfsfsfsfds           NA     14532556           NA       A      2421432           NA        11351
4          xbbcbcncncc   4684757849        43566           NA       A      2421432           NA        11351
5          fbbdsgffhhh           NA     46367367   1354151345       A      2421432           NA        11351
6      dggdgdgegeggerr     10485040    768769769   1351351354       A      2421432           NA        11351
7           dfgthrgfgf           NA       797899       314534       B     24242424      5342353           NA
8            wegregegg      6849400           NA         1535       B     24242424      5342353           NA
9        egegegergewge        40300           NA      3145354       B     24242424      5342353           NA
10         sfngegebser           NA           NA      4353455       C           NA     14532556           NA
11          qegqeefbew           NA       686899       324535       C           NA     14532556           NA
12          qegqetegqt           NA      7898979      3543445       C           NA     14532556           NA
13            qwtqtewr    556456466       678568        34535       C           NA     14532556           NA
14           etghsfrgf   4646456466           NA     34535534       C           NA     14532556           NA
15 sfsdfbdfbergeagaegr    246464266        68886           NA       C           NA     14532556           NA
16    wasfqertsdfaefwe   4564242646          488           NA       C           NA     14532556           NA

Then map over the column names of interest, taking the difference from each column-pair. (Note that you'll need to convert the int_samp columns from factor to numeric first.)

Update (per OP comments):
To convert NA to 0 before computing differences, we can use mutate_if() and replace(), adding the following to the method chain:

mutate_if(is.numeric,  funs(replace(., is.na(.), 0)))

Finally, join back to df_1:

var_names <- df_1 %>% select(starts_with("int_samp")) %>% names()

var_names # [1] "int_samp_1" "int_samp_2" "int_samp_3"

var_names %>%
  map_dfc(~df_3 %>%
            mutate_at(vars(matches(.x)), funs(as.numeric(as.character(.)))) %>%
            mutate_if(is.numeric,  funs(replace(., is.na(.), 0))) %>%
            select(matches(.x)) %>%
            reduce(`-`)) %>%
  set_names(paste0(var_names, "_diff")) %>%
  bind_cols(df_1)

Output:

   int_samp_1_diff int_samp_2_diff int_samp_3_diff pep_seq             int_samp_1 int_samp_2 int_samp_3 gene_nm
             <dbl>           <dbl>           <dbl> <fct>               <fct>      <fct>      <fct>      <fct>  
 1              0.              0.              0. Minus Operation of Data Framesa           2421432    NA         11351      A      
 2       21820992.        5342353.         -11351. ababababba          24242424   5342353    NA         A      
 3       -2421432.       14532556.         -11351. dfsfsfsfds          NA         14532556   NA         A      
 4     4682336417.          43566.         -11351. xbbcbcncncc         4684757849 43566      NA         A      
 5       -2421432.       46367367.     1354139994. fbbdsgffhhh         NA         46367367   1354151345 A      
 6        8063608.      768769769.     1351340003. dggdgdgegeggerr     10485040   768769769  1351351354 A      
 7      -24242424.       -4544454.         314534. dfgthrgfgf          NA         797899     314534     B      
 8      -17393024.       -5342353.           1535. wegregegg           6849400    NA         1535       B      
 9      -24202124.       -5342353.        3145354. egegegergewge       40300      NA         3145354    B      
10              0.      -14532556.        4353455. sfngegebser         NA         NA         4353455    C      
11              0.      -13845657.         324535. qegqeefbew          NA         686899     324535     C      
12              0.       -6633577.        3543445. qegqetegqt          NA         7898979    3543445    C      
13      556456466.      -13853988.          34535. qwtqtewr            556456466  678568     34535      C      
14     4646456466.      -14532556.       34535534. etghsfrgf           4646456466 NA         34535534   C      
15      246464266.      -14463670.              0. sfsdfbdfbergeagaegr 246464266  68886      NA         C      
16     4564242646.      -14532068.              0. wasfqertsdfaefwe    4564242646 488        NA         C

Note: This answer is largely a derivation from akrun's answer here.

How to use SQL minus query equivalent between two dataframes properly

Yes, you can use the indicator like this:

df1.merge(df2, how='left', indicator='ind').query('ind=="left_only"')

Where df1 is:

   col1  col2  col3
0   1.0   2.0   3.0
1   2.0   3.0   4.0
2   5.0   6.0   6.0
3   8.0   9.0   9.0
4  10.0  10.0  10.0

and df2 is:

   col1  col2  col3
0     5     6     6
1     8     9     9
2     1     2     3
3     2     3     4

Output:

   col1  col2  col3        ind
4  10.0  10.0  10.0  left_only

Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns

We can use pandas.concat with drop_duplicates here and pass it the argument to drop all duplicates with keep=False:

pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)

   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7

Edit after comment by OP

If you want to make sure that unique rows in df2 arnt taken into account, we can duplicate that df:

pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)

   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7

subtracting two dataframes

Move the City column into the index. The DataFrames will align by both index and columns first and then do subtraction. Any combination not present will result in NaN.

df2.set_index('City').subtract(df1.set_index('City'), fill_value=0)

how to subtract 2 data frame with the same size?

I guess this is what you want.

df1 = pd.DataFrame({"A": [4,0], "B": [5,6]})
df2 = pd.DataFrame({"A": [6,7], "B": [0,1]})

df = df1 - df2

df
Out[4]: 
   A  B
0 -2  5
1 -7  5

Spark: subtract two DataFrames

According to the Scala API docs, doing:

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

Minus Operation of Data Frames