Minus Operation of Data Frames

Minus operation of data frames

I remember coming across this exact issue quite a few months back. Managed to sift through my Evernote one-liners.

Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment).

If you don't worry about rownames then you can do:

df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
# c1 c2
# 1 a 1
# 2 b 2

Edit: A data.table solution:

dt1 <- data.table(df1, key="c1")
dt2 <- data.table(df2)
dt1[!dt2]

or better one-liner (from v1.9.6+):

setDT(df1)[!df2, on="c1"]

This returns all rows in df1 where df2$c1 doesn't have a match with df1$c1.

Subtracting columns of data frame by name

Try this base R solution without loop. Just have in mind the position of columns:

#Data
df <- as.data.frame(matrix(seq(1,20,1),nrow=4), byrow=TRUE)
colnames(df) <- c("X1","X2","X3","X4","X5")
rownames(df) <- as.Date(c("2020-01-02","2020-01-03","2020-01-04","2020-01-05"))
#Set columns for difference
df[,2:5] <- df[,2:5]-df[,1]

Output:

           X1 X2 X3 X4 X5
2020-01-02 1 4 8 12 16
2020-01-03 2 4 8 12 16
2020-01-04 3 4 8 12 16
2020-01-05 4 4 8 12 16

Or a more sophisticated way would be:

#Create index
#Var to substract
i1 <- which(names(df)=='X1')
#Vars to be substracted with X1
i2 <- which(names(df)!='X1')
#Compute
df[,i2]<-df[,i2]-df[,i1]

Output:

           X1 X2 X3 X4 X5
2020-01-02 1 4 8 12 16
2020-01-03 2 4 8 12 16
2020-01-04 3 4 8 12 16
2020-01-05 4 4 8 12 16

Subtract rows of data frames column-wise preserving multiple factor column

IIUC, you have columns of the same name in df_1 and df_2 (eg int_samp_X for some integer X), and you want to get the difference of matching column names, grouped by gene_nm (eg df_1[df_1$gene_nm == 'A', int_samp_1] - df_2[df_2$gene_nm == 'A', int_samp_1]).

We can use the tidyverse family of packages to solve this problem, notably, dplyr and purrr.

First, merge df_1 and df_2 with a left_join, to ensure all the many entries in df_1 are preserved when matched against the gene-level entries in df_2:

library(tidyverse)

df_3 <- df_1 %>% left_join(df_2, by = "gene_nm")

df_3
pep_seq int_samp_1.x int_samp_2.x int_samp_3.x gene_nm int_samp_1.y int_samp_2.y int_samp_3.y
1 Minus Operation of Data Framesa 2421432 NA 11351 A 2421432 NA 11351
2 ababababba 24242424 5342353 NA A 2421432 NA 11351
3 dfsfsfsfds NA 14532556 NA A 2421432 NA 11351
4 xbbcbcncncc 4684757849 43566 NA A 2421432 NA 11351
5 fbbdsgffhhh NA 46367367 1354151345 A 2421432 NA 11351
6 dggdgdgegeggerr 10485040 768769769 1351351354 A 2421432 NA 11351
7 dfgthrgfgf NA 797899 314534 B 24242424 5342353 NA
8 wegregegg 6849400 NA 1535 B 24242424 5342353 NA
9 egegegergewge 40300 NA 3145354 B 24242424 5342353 NA
10 sfngegebser NA NA 4353455 C NA 14532556 NA
11 qegqeefbew NA 686899 324535 C NA 14532556 NA
12 qegqetegqt NA 7898979 3543445 C NA 14532556 NA
13 qwtqtewr 556456466 678568 34535 C NA 14532556 NA
14 etghsfrgf 4646456466 NA 34535534 C NA 14532556 NA
15 sfsdfbdfbergeagaegr 246464266 68886 NA C NA 14532556 NA
16 wasfqertsdfaefwe 4564242646 488 NA C NA 14532556 NA

Then map over the column names of interest, taking the difference from each column-pair. (Note that you'll need to convert the int_samp columns from factor to numeric first.)

Update (per OP comments):
To convert NA to 0 before computing differences, we can use mutate_if() and replace(), adding the following to the method chain:

mutate_if(is.numeric,  funs(replace(., is.na(.), 0)))

Finally, join back to df_1:

var_names <- df_1 %>% select(starts_with("int_samp")) %>% names()

var_names # [1] "int_samp_1" "int_samp_2" "int_samp_3"

var_names %>%
map_dfc(~df_3 %>%
mutate_at(vars(matches(.x)), funs(as.numeric(as.character(.)))) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 0))) %>%
select(matches(.x)) %>%
reduce(`-`)) %>%
set_names(paste0(var_names, "_diff")) %>%
bind_cols(df_1)

Output:

   int_samp_1_diff int_samp_2_diff int_samp_3_diff pep_seq             int_samp_1 int_samp_2 int_samp_3 gene_nm
<dbl> <dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 0. 0. 0. Minus Operation of Data Framesa 2421432 NA 11351 A
2 21820992. 5342353. -11351. ababababba 24242424 5342353 NA A
3 -2421432. 14532556. -11351. dfsfsfsfds NA 14532556 NA A
4 4682336417. 43566. -11351. xbbcbcncncc 4684757849 43566 NA A
5 -2421432. 46367367. 1354139994. fbbdsgffhhh NA 46367367 1354151345 A
6 8063608. 768769769. 1351340003. dggdgdgegeggerr 10485040 768769769 1351351354 A
7 -24242424. -4544454. 314534. dfgthrgfgf NA 797899 314534 B
8 -17393024. -5342353. 1535. wegregegg 6849400 NA 1535 B
9 -24202124. -5342353. 3145354. egegegergewge 40300 NA 3145354 B
10 0. -14532556. 4353455. sfngegebser NA NA 4353455 C
11 0. -13845657. 324535. qegqeefbew NA 686899 324535 C
12 0. -6633577. 3543445. qegqetegqt NA 7898979 3543445 C
13 556456466. -13853988. 34535. qwtqtewr 556456466 678568 34535 C
14 4646456466. -14532556. 34535534. etghsfrgf 4646456466 NA 34535534 C
15 246464266. -14463670. 0. sfsdfbdfbergeagaegr 246464266 68886 NA C
16 4564242646. -14532068. 0. wasfqertsdfaefwe 4564242646 488 NA C

Note: This answer is largely a derivation from akrun's answer here.

How to use SQL minus query equivalent between two dataframes properly

Yes, you can use the indicator like this:

df1.merge(df2, how='left', indicator='ind').query('ind=="left_only"')

Where df1 is:

   col1  col2  col3
0 1.0 2.0 3.0
1 2.0 3.0 4.0
2 5.0 6.0 6.0
3 8.0 9.0 9.0
4 10.0 10.0 10.0

and df2 is:

   col1  col2  col3
0 5 6 6
1 8 9 9
2 1 2 3
3 2 3 4

Output:

   col1  col2  col3        ind
4 10.0 10.0 10.0 left_only

Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns

We can use pandas.concat with drop_duplicates here and pass it the argument to drop all duplicates with keep=False:

pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)

a b c
1 0 1 2
2 1 0 3
6 2 2 7

Edit after comment by OP

If you want to make sure that unique rows in df2 arnt taken into account, we can duplicate that df:

pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)

a b c
1 0 1 2
2 1 0 3
6 2 2 7

subtracting two dataframes

Move the City column into the index. The DataFrames will align by both index and columns first and then do subtraction. Any combination not present will result in NaN.

df2.set_index('City').subtract(df1.set_index('City'), fill_value=0)

how to subtract 2 data frame with the same size?

I guess this is what you want.

df1 = pd.DataFrame({"A": [4,0], "B": [5,6]})
df2 = pd.DataFrame({"A": [6,7], "B": [0,1]})

df = df1 - df2

df
Out[4]:
A B
0 -2 5
1 -7 5

Spark: subtract two DataFrames

According to the Scala API docs, doing:

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.



Related Topics



Leave a reply



Submit