Efficient multiplication of columns in a data frame
As Blue Magister said in comments,
df$new_column <- df$column1 * df$column2
should work just fine. Of course we can never know for sure if we don't have an example of the data.
faster column-multiply in dataframe
To check which is faster you can check the time that it takes for each case:
In Ipython or Jupiter would be:
%%timeit
d['a'] * d['b']
For a dataframe like this one:
a = np.arange(0,10000)
b = np.ones(10000)
d = pd.DataFrame(np.vstack([a,b]).T, columns=["a","b"])
Get your multiplication:
1- in pandas
d['a'] * d['b']
81.2 µs ± 977 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
2 - in numpy. avoiding pandas overhead
d['a'].values * d['b'].values
9.21 µs ± 41.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
... If you are worried so much about speed, use just numpy. Take advantage of the nice feature of pandas to allow you to access the array with the feature values
.
How to simply multiply two columns of a dataframe?
In Base R:
df$c <- df$a * df$b
or df$c <- with(df, a * b)
In Dplyr:
df <- df %>% mutate(c = a * b)
Efficient ways to multiply all columns in data frame with each other
Here is another option with combn
where do the combination of column names taking two at a time, multiply the columns after subsetting and cbind
with square of the original dataset.
res <- cbind(df1^2, do.call(cbind,combn(colnames(df1), 2,
FUN= function(x) list(df1[x[1]]*df1[x[2]]))))
colnames(res)[-(seq_len(ncol(df1)))] <- combn(colnames(df1), 2,
FUN = paste, collapse=":")
res
# a b c a:b a:c b:c
#1 0.08559952 0.365890531 0.008823729 0.17697473 0.02748285 0.056820059
#2 0.05057603 0.137444401 0.304984209 0.08337501 0.12419698 0.204739766
#3 0.49592997 0.451167798 0.525871254 0.47301970 0.51068123 0.487089495
#4 0.26925425 0.452905189 0.019023202 0.34920860 0.07156869 0.092820832
#5 0.43906475 0.102675746 0.049713853 0.21232357 0.14774167 0.071445132
#6 0.84721676 0.817486693 0.472890881 0.83221898 0.63296215 0.621757189
#7 0.07825199 0.039249934 0.005850588 0.05542008 0.02139673 0.015153719
#8 0.58342170 0.001953909 0.359676293 0.03376319 0.45808619 0.026509902
#9 0.64261164 0.250923183 0.397086073 0.40155468 0.50514566 0.315655035
#10 0.06488487 0.019260683 0.002174826 0.03535148 0.01187911 0.006472142
Fastest way to multiply multiple columns in Dataframe based on conditions
If multiple values selected by list of columns names by DataFrame.mul
it is fast:
cols = ['a','c','e']
df[cols] = df[cols].mul(df['d'], axis=0)
print (df)
a b c d e
0 1.20 23 3.40 0.10 2.50
1 0.26 26 0.76 0.02 0.52
2 0.76 28 1.24 0.04 0.88
Numpy alternative, but not faster:
cols = ['a','c','e']
df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]
df = pd.DataFrame(data)
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
print (df)
In [113]: %%timeit
...: cols = ['a','c','e']
...: df[cols] = df[cols].mul(df['d'], axis=0)
...:
...:
14.5 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [114]: %%timeit
...: cols = ['a','c','e']
...: df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]
...:
138 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Multiply all values in each column of a data frame by another value based on matching column names
We can replicate the second dataset and do the multiplication if we use the 'd.2'
dd[names(d.2)] <- dd[names(d.2)] * d.2[col(dd[names(d.2)])]
With 'd.1'
dd[as.character(d.1$p)] <- dd[as.character(d.1$p)] * d.1$q[col(dd[d.1$p])]
Most efficient way to multiply every column of a large pandas dataframe with every other column of the same dataframe
You could use itertools.combinations for this:
>>> import pandas as pd
>>> from itertools import combinations
>>> df = pd.DataFrame({
... "A": [1,1,1,0,1],
... "B": [1,1,0,0,1],
... "C": [.75,1,.35,1,0]
... })
>>> df.head()
A B C
0 1 1 0.75
1 1 1 1.00
2 1 0 0.35
3 0 0 1.00
4 1 1 0.00
>>> for col1, col2 in combinations(df.columns, 2):
... df[f"{col1}_{col2}"] = df[col1] * df[col2]
...
>>> df.head()
A B C A_B A_C B_C
0 1 1 0.75 1 0.75 0.75
1 1 1 1.00 1 1.00 1.00
2 1 0 0.35 0 0.35 0.00
3 0 0 1.00 0 0.00 0.00
4 1 1 0.00 1 0.00 0.00
If you need to vectorize an arbitrary function on the pairs of columns you could use:
import numpy as np
def fx(x, y):
return np.multiply(x, y)
for col1, col2 in combinations(df.columns, 2):
df[f"{col1}_{col2}"] = np.vectorize(fx)(df[col1], df[col2])
Most efficient way to multiply a data frame by a vector
You could try (using df
and v
from Richard Scriven's answer):
df[-1] <- t(t(df[-1]) * v)
df
# a x y z
# 1 a 5 40 105
# 2 b 10 50 120
# 3 c 15 60 135
When you multiply a matrix by a vector, it multiplies columnwise. Since you want to multiply your rows by the vector, we transpose df[-1]
using t
, multiply by v
, and transpose back using t
.
It seems like this approach has a slight edge in benchmarking over the Map
approach, and a significant advantage over sweep
:
library(microbenchmark)
rscriven <- function(df, v) cbind(df[1], Map(`*`, df[-1], v))
josilber <- function(df, v) cbind(df[1], t(t(df[-1]) * v))
dardisco <- function(df, v) cbind(df[1], sweep(df[-1], MARGIN=2, STATS=v, FUN="*"))
df2 <- cbind(data.frame(rep("a", 1000)), matrix(rnorm(100000), nrow=1000))
v2 <- rnorm(100)
all.equal(rscriven(df2, v2), josilber(df2, v2))
# [1] TRUE
all.equal(rscriven(df2, v2), dardisco(df2, v2))
# [1] TRUE
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 5.276458 5.378436 5.451041 5.587644 9.470207 100
# josilber(df2, v2) 2.545144 2.753363 3.099589 3.704077 8.955193 100
# dardisco(df2, v2) 11.647147 12.761184 14.196678 16.581004 132.428972 100
Thanks to @thelatemail for pointing out that the Map
approach is a good deal faster for 100x larger data frames:
df2 <- cbind(data.frame(rep("a", 10000)), matrix(rnorm(10000000), nrow=10000))
v2 <- rnorm(1000)
microbenchmark(rscriven(df2, v2), josilber(df2, v2), dardisco(df2, v2))
# Unit: milliseconds
# expr min lq median uq max neval
# rscriven(df2, v2) 75.74051 90.20161 97.08931 115.7789 259.0855 100
# josilber(df2, v2) 340.72774 388.17046 498.26836 514.5923 623.4020 100
# dardisco(df2, v2) 928.81128 1041.34497 1156.39293 1271.4758 1506.0348 100
It seems like you'll need to benchmark to determine which approach is fastest for your application.
Multiply columns using values in a list
You could do:
df * coef[col(df)]
or eve
data.frame(t(t(df) * coef))
Related Topics
Displaying Image on Point Hover in Plotly
How to Create a Variable of Rownames
Solving a System of Nonlinear Equations in R
Select Columns by Class (E.G. Numeric) from a Data.Table
How to Create a Hyperlink Interactively in Shiny App
Using Rvest to Scrape a Website W/ a Login Page
Filled and Hollow Shapes Where the Fill Color = the Line Color
Calculate Using Dplyr, Percentage of Na's in Each Column
Top to Bottom Alignment of Two Ggplot2 Figures
Replace a Subset of a Data Frame with Dplyr Join Operations
Efficient Multiplication of Columns in a Data Frame
Extracting Zip+CSV File from Attachment W/ Image in Body of Email
Multi Line Title in Ggplot 2 with Multiple Italicized Words
R: Building a Simple Command Line Plotting Tool/Capturing Window Close Events