Efficient Multiplication of Columns in a Data Frame

Efficient multiplication of columns in a data frame

As Blue Magister said in comments,

df$new_column <- df$column1 * df$column2

should work just fine. Of course we can never know for sure if we don't have an example of the data.

faster column-multiply in dataframe

To check which is faster you can check the time that it takes for each case:
In Ipython or Jupiter would be:

%%timeit
d['a'] * d['b']

For a dataframe like this one:

a = np.arange(0,10000)
b = np.ones(10000)

d = pd.DataFrame(np.vstack([a,b]).T, columns=["a","b"])

Get your multiplication:

1- in pandas

d['a'] * d['b']
81.2 µs ± 977 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

2 - in numpy. avoiding pandas overhead

d['a'].values * d['b'].values
9.21 µs ± 41.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

... If you are worried so much about speed, use just numpy. Take advantage of the nice feature of pandas to allow you to access the array with the feature values.

How to simply multiply two columns of a dataframe?

In Base R:

df$c <- df$a * df$b

or df$c <- with(df, a * b)

In Dplyr:

df <- df %>% mutate(c = a * b)

Efficient ways to multiply all columns in data frame with each other

Here is another option with combn where do the combination of column names taking two at a time, multiply the columns after subsetting and cbind with square of the original dataset.

res <- cbind(df1^2, do.call(cbind,combn(colnames(df1), 2, 
FUN= function(x) list(df1[x[1]]*df1[x[2]]))))
colnames(res)[-(seq_len(ncol(df1)))] <- combn(colnames(df1), 2,
FUN = paste, collapse=":")
res
# a b c a:b a:c b:c
#1 0.08559952 0.365890531 0.008823729 0.17697473 0.02748285 0.056820059
#2 0.05057603 0.137444401 0.304984209 0.08337501 0.12419698 0.204739766
#3 0.49592997 0.451167798 0.525871254 0.47301970 0.51068123 0.487089495
#4 0.26925425 0.452905189 0.019023202 0.34920860 0.07156869 0.092820832
#5 0.43906475 0.102675746 0.049713853 0.21232357 0.14774167 0.071445132
#6 0.84721676 0.817486693 0.472890881 0.83221898 0.63296215 0.621757189
#7 0.07825199 0.039249934 0.005850588 0.05542008 0.02139673 0.015153719
#8 0.58342170 0.001953909 0.359676293 0.03376319 0.45808619 0.026509902
#9 0.64261164 0.250923183 0.397086073 0.40155468 0.50514566 0.315655035
#10 0.06488487 0.019260683 0.002174826 0.03535148 0.01187911 0.006472142

Fastest way to multiply multiple columns in Dataframe based on conditions

If multiple values selected by list of columns names by DataFrame.mul it is fast:

cols = ['a','c','e']
df[cols] = df[cols].mul(df['d'], axis=0)
print (df)
a b c d e
0 1.20 23 3.40 0.10 2.50
1 0.26 26 0.76 0.02 0.52
2 0.76 28 1.24 0.04 0.88

Numpy alternative, but not faster:

cols = ['a','c','e']
df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]


df = pd.DataFrame(data)
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
print (df)

In [113]: %%timeit
...: cols = ['a','c','e']
...: df[cols] = df[cols].mul(df['d'], axis=0)
...:
...:
14.5 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [114]: %%timeit
...: cols = ['a','c','e']
...: df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]
...:
138 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Multiply all values in each column of a data frame by another value based on matching column names

We can replicate the second dataset and do the multiplication if we use the 'd.2'

dd[names(d.2)] <-  dd[names(d.2)] * d.2[col(dd[names(d.2)])]

With 'd.1'

dd[as.character(d.1$p)] <-  dd[as.character(d.1$p)] * d.1$q[col(dd[d.1$p])]

Most efficient way to multiply every column of a large pandas dataframe with every other column of the same dataframe

You could use itertools.combinations for this:

>>> import pandas as pd
>>> from itertools import combinations
>>> df = pd.DataFrame({
... "A": [1,1,1,0,1],
... "B": [1,1,0,0,1],
... "C": [.75,1,.35,1,0]
... })
>>> df.head()
A B C
0 1 1 0.75
1 1 1 1.00
2 1 0 0.35
3 0 0 1.00
4 1 1 0.00
>>> for col1, col2 in combinations(df.columns, 2):
... df[f"{col1}_{col2}"] = df[col1] * df[col2]
...
>>> df.head()
A B C A_B A_C B_C
0 1 1 0.75 1 0.75 0.75
1 1 1 1.00 1 1.00 1.00
2 1 0 0.35 0 0.35 0.00
3 0 0 1.00 0 0.00 0.00
4 1 1 0.00 1 0.00 0.00

If you need to vectorize an arbitrary function on the pairs of columns you could use:

import numpy as np

def fx(x, y):
return np.multiply(x, y)

for col1, col2 in combinations(df.columns, 2):
df[f"{col1}_{col2}"] = np.vectorize(fx)(df[col1], df[col2])

Most efficient way to multiply a data frame by a vector

You could try (using df and v from Richard Scriven's answer):

df[-1] <- t(t(df[-1]) * v)
df
# a x y z
# 1 a 5 40 105
# 2 b 10 50 120
# 3 c 15 60 135

When you multiply a matrix by a vector, it multiplies columnwise. Since you want to multiply your rows by the vector, we transpose df[-1] using t, multiply by v, and transpose back using t.

It seems like this approach has a slight edge in benchmarking over the Map approach, and a significant advantage over sweep:

library(microbenchmark)
rscriven <- function(df, v) cbind(df[1], Map(`*`, df[-1], v))
josilber <- function(df, v) cbind(df[1], t(t(df[-1]) * v))
dardisco <- function(df, v) cbind(df[1], sweep(df[-1], MARGIN=2, STATS=v, FUN="*"))
df2 <- cbind(data.frame(rep("a", 1000)), matrix(rnorm(100000), nrow=1000))
v2 <- rnorm(100)
all.equal(rscriven(df2, v2), josilber(df2, v2))
# [1] TRUE
all.equal(rscriven(df2, v2), dardisco(df2, v2))
# [1] TRUE

microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 5.276458 5.378436 5.451041 5.587644 9.470207 100
# josilber(df2, v2) 2.545144 2.753363 3.099589 3.704077 8.955193 100
# dardisco(df2, v2) 11.647147 12.761184 14.196678 16.581004 132.428972 100

Thanks to @thelatemail for pointing out that the Map approach is a good deal faster for 100x larger data frames:

df2 <- cbind(data.frame(rep("a", 10000)), matrix(rnorm(10000000), nrow=10000))
v2 <- rnorm(1000)
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 75.74051 90.20161 97.08931 115.7789 259.0855 100
# josilber(df2, v2) 340.72774 388.17046 498.26836 514.5923 623.4020 100
# dardisco(df2, v2) 928.81128 1041.34497 1156.39293 1271.4758 1506.0348 100

It seems like you'll need to benchmark to determine which approach is fastest for your application.

Multiply columns using values in a list

You could do:

df * coef[col(df)]

or eve

data.frame(t(t(df) * coef))


Related Topics



Leave a reply



Submit