Is There Any Advantage to Using Pow(X,2) Instead of X*X, with X Double

Is there any advantage to using pow(x,2) instead of x*x, with x double?

FWIW, with gcc-4.2 on MacOS X 10.6 and -O3 compiler flags,

x = x * x;

and

y = pow(y, 2);

result in the same assembly code:

#include <cmath>

void test(double& x, double& y) {
x = x * x;
y = pow(y, 2);
}

Assembles to:

    pushq   %rbp
movq %rsp, %rbp
movsd (%rdi), %xmm0
mulsd %xmm0, %xmm0
movsd %xmm0, (%rdi)
movsd (%rsi), %xmm0
mulsd %xmm0, %xmm0
movsd %xmm0, (%rsi)
leave
ret

So as long as you're using a decent compiler, write whichever makes more sense to your application, but consider that pow(x, 2) can never be more optimal than the plain multiplication.

What is more efficient? Using pow to square or just multiply it with itself?

UPDATE 2021

I've modified the benchmark code as follows:

  • std::chrono used for timing measurements instead of boost
  • C++11 <random> used instead of rand()
  • Avoid repeated operations that can get hoisted out. The base parameter is ever-changing.

I get the following results with GCC 10 -O2 (in seconds):

exp     c++ pow     c pow       x*x*x...
2 0.204243 1.39962 0.0902527
3 1.36162 1.38291 0.107679
4 1.37717 1.38197 0.106103
5 1.3815 1.39139 0.117097

GCC 10 -O3 is almost identical to GCC 10 -O2.

With GCC 10 -O2 -ffast-math:

exp     c++ pow     c pow       x*x*x...
2 0.203625 1.4056 0.0913414
3 0.11094 1.39938 0.108027
4 0.201593 1.38618 0.101585
5 0.102141 1.38212 0.10662

With GCC 10 -O3 -ffast-math:

exp     c++ pow     c pow       x*x*x...
2 0.0451995 1.175 0.0450497
3 0.0470842 1.20226 0.051399
4 0.0475239 1.18033 0.0473844
5 0.0522424 1.16817 0.0522291

With Clang 12 -O2:

exp     c++ pow     c pow       x*x*x...
2 0.106242 0.105435 0.105533
3 1.45909 1.4425 0.102235
4 1.45629 1.44262 0.108861
5 1.45837 1.44483 0.1116

Clang 12 -O3 is almost identical to Clang 12 -O2.

With Clang 12 -O2 -ffast-math:

exp     c++ pow     c pow       x*x*x...
2 0.0233731 0.0232457 0.0231076
3 0.0271074 0.0266663 0.0278415
4 0.026897 0.0270698 0.0268115
5 0.0312481 0.0296402 0.029811

Clang 12 -O3 -ffast-math is almost identical to Clang 12 -O2 -ffast-math.

Machine is Intel Core i7-7700K on Linux 5.4.0-73-generic x86_64.

Conclusions:

  • With GCC 10 (no -ffast-math), x*x*x... is always faster
  • With GCC 10 -O2 -ffast-math, std::pow is as fast as x*x*x... for odd exponents
  • With GCC 10 -O3 -ffast-math, std::pow is as fast as x*x*x... for all test cases, and is around twice as fast as -O2.
  • With GCC 10, C's pow(double, double) is always much slower
  • With Clang 12 (no -ffast-math), x*x*x... is faster for exponents greater than 2
  • With Clang 12 -ffast-math, all methods produce similar results
  • With Clang 12, pow(double, double) is as fast as std::pow for integral exponents
  • Writing benchmarks without having the compiler outsmart you is hard.

I'll eventually get around to installing a more recent version of GCC on my machine and will update my results when I do so.

Here's the updated benchmark code:

#include <cmath>
#include <chrono>
#include <iostream>
#include <random>

using Moment = std::chrono::high_resolution_clock::time_point;
using FloatSecs = std::chrono::duration<double>;

inline Moment now()
{
return std::chrono::high_resolution_clock::now();
}

#define TEST(num, expression) \
double test##num(double b, long loops) \
{ \
double x = 0.0; \
\
auto startTime = now(); \
for (long i=0; i<loops; ++i) \
{ \
x += expression; \
b += 1.0; \
} \
auto elapsed = now() - startTime; \
auto seconds = std::chrono::duration_cast<FloatSecs>(elapsed); \
std::cout << seconds.count() << "\t"; \
return x; \
}

TEST(2, b*b)
TEST(3, b*b*b)
TEST(4, b*b*b*b)
TEST(5, b*b*b*b*b)

template <int exponent>
double testCppPow(double base, long loops)
{
double x = 0.0;

auto startTime = now();
for (long i=0; i<loops; ++i)
{
x += std::pow(base, exponent);
base += 1.0;
}
auto elapsed = now() - startTime;

auto seconds = std::chrono::duration_cast<FloatSecs>(elapsed); \
std::cout << seconds.count() << "\t"; \

return x;
}

double testCPow(double base, double exponent, long loops)
{
double x = 0.0;

auto startTime = now();
for (long i=0; i<loops; ++i)
{
x += ::pow(base, exponent);
base += 1.0;
}
auto elapsed = now() - startTime;

auto seconds = std::chrono::duration_cast<FloatSecs>(elapsed); \
std::cout << seconds.count() << "\t"; \

return x;
}

int main()
{
using std::cout;
long loops = 100000000l;
double x = 0;
std::random_device rd;
std::default_random_engine re(rd());
std::uniform_real_distribution<double> dist(1.1, 1.2);
cout << "exp\tc++ pow\tc pow\tx*x*x...";

cout << "\n2\t";
double b = dist(re);
x += testCppPow<2>(b, loops);
x += testCPow(b, 2.0, loops);
x += test2(b, loops);

cout << "\n3\t";
b = dist(re);
x += testCppPow<3>(b, loops);
x += testCPow(b, 3.0, loops);
x += test3(b, loops);

cout << "\n4\t";
b = dist(re);
x += testCppPow<4>(b, loops);
x += testCPow(b, 4.0, loops);
x += test4(b, loops);

cout << "\n5\t";
b = dist(re);
x += testCppPow<5>(b, loops);
x += testCPow(b, 5.0, loops);
x += test5(b, loops);

std::cout << "\n" << x << "\n";
}

Old Answer, 2010

I tested the performance difference between x*x*... vs pow(x,i) for small i using this code:

#include <cstdlib>
#include <cmath>
#include <boost/date_time/posix_time/posix_time.hpp>

inline boost::posix_time::ptime now()
{
return boost::posix_time::microsec_clock::local_time();
}

#define TEST(num, expression) \
double test##num(double b, long loops) \
{ \
double x = 0.0; \
\
boost::posix_time::ptime startTime = now(); \
for (long i=0; i<loops; ++i) \
{ \
x += expression; \
x += expression; \
x += expression; \
x += expression; \
x += expression; \
x += expression; \
x += expression; \
x += expression; \
x += expression; \
x += expression; \
} \
boost::posix_time::time_duration elapsed = now() - startTime; \
\
std::cout << elapsed << " "; \
\
return x; \
}

TEST(1, b)
TEST(2, b*b)
TEST(3, b*b*b)
TEST(4, b*b*b*b)
TEST(5, b*b*b*b*b)

template <int exponent>
double testpow(double base, long loops)
{
double x = 0.0;

boost::posix_time::ptime startTime = now();
for (long i=0; i<loops; ++i)
{
x += std::pow(base, exponent);
x += std::pow(base, exponent);
x += std::pow(base, exponent);
x += std::pow(base, exponent);
x += std::pow(base, exponent);
x += std::pow(base, exponent);
x += std::pow(base, exponent);
x += std::pow(base, exponent);
x += std::pow(base, exponent);
x += std::pow(base, exponent);
}
boost::posix_time::time_duration elapsed = now() - startTime;

std::cout << elapsed << " ";

return x;
}

int main()
{
using std::cout;
long loops = 100000000l;
double x = 0.0;
cout << "1 ";
x += testpow<1>(rand(), loops);
x += test1(rand(), loops);

cout << "\n2 ";
x += testpow<2>(rand(), loops);
x += test2(rand(), loops);

cout << "\n3 ";
x += testpow<3>(rand(), loops);
x += test3(rand(), loops);

cout << "\n4 ";
x += testpow<4>(rand(), loops);
x += test4(rand(), loops);

cout << "\n5 ";
x += testpow<5>(rand(), loops);
x += test5(rand(), loops);
cout << "\n" << x << "\n";
}

Results are:

1 00:00:01.126008 00:00:01.128338 
2 00:00:01.125832 00:00:01.127227
3 00:00:01.125563 00:00:01.126590
4 00:00:01.126289 00:00:01.126086
5 00:00:01.126570 00:00:01.125930
2.45829e+54

Note that I accumulate the result of every pow calculation to make sure the compiler doesn't optimize it away.

If I use the std::pow(double, double) version, and loops = 1000000l, I get:

1 00:00:00.011339 00:00:00.011262 
2 00:00:00.011259 00:00:00.011254
3 00:00:00.975658 00:00:00.011254
4 00:00:00.976427 00:00:00.011254
5 00:00:00.973029 00:00:00.011254
2.45829e+52

This is on an Intel Core Duo running Ubuntu 9.10 64bit. Compiled using gcc 4.4.1 with -o2 optimization.

So in C, yes x*x*x will be faster than pow(x, 3), because there is no pow(double, int) overload. In C++, it will be the roughly same. (Assuming the methodology in my testing is correct.)


This is in response to the comment made by An Markm:

Even if a using namespace std directive was issued, if the second parameter to pow is an int, then the std::pow(double, int) overload from <cmath> will be called instead of ::pow(double, double) from <math.h>.

This test code confirms that behavior:

#include <iostream>

namespace foo
{

double bar(double x, int i)
{
std::cout << "foo::bar\n";
return x*i;
}

}

double bar(double x, double y)
{
std::cout << "::bar\n";
return x*y;
}

using namespace foo;

int main()
{
double a = bar(1.2, 3); // Prints "foo::bar"
std::cout << a << "\n";
return 0;
}

How raise to power works? Is it worth to use pow(x, 2)?

If you're comparing multiplication with the pow() standard library function then yes, multiplication is definitely faster.

When not to use pow()

May I suggest one thing? pow( x, 2.0 ) may be as fast as x * x when you have good compiler and proper optimization level. This will NOT be faster, though. It may be slower. For that particular reason I would still do x * x.

Why does pow(n,2) return 24 when n=5, with my compiler and OS?

Here is what may be happening here. You should be able to confirm this by looking at your compiler's implementation of the pow function:

Assuming you have the correct #include's, (all the previous answers and comments about this are correct -- don't take the #include files for granted), the prototype for the standard pow function is this:

double pow(double, double);

and you're calling pow like this:

pow(5,2);

The pow function goes through an algorithm (probably using logarithms), thus uses floating point functions and values to compute the power value.

The pow function does not go through a naive "multiply the value of x a total of n times", since it has to also compute pow using fractional exponents, and you can't compute fractional powers that way.

So more than likely, the computation of pow using the parameters 5 and 2 resulted in a slight rounding error. When you assigned to an int, you truncated the fractional value, thus yielding 24.

If you are using integers, you might as well write your own "intpow" or similar function that simply multiplies the value the requisite number of times. The benefits of this are:

  1. You won't get into the situation where you may get subtle rounding errors using pow.

  2. Your intpow function will more than likely run faster than an equivalent call to pow.

Unary operator in C or C++ for the second power of a number

"it is more efficient to use x*x instead of pow(x,2)"

Not certainly more efficient. It might be the same. C allows analyze-ability of such functions like pow() and both may emit the same code.

A compiler may not analyze your pow2() and create sub-optimal code as compared to pow(a,2).

If truly concerned, profile your code. Yet what is best on one platform may differ on others.

How to implement a genuine unary implementation of pow2?

inline double pow2(double a){return a*a;} is OK.

Would it be the most efficient way to obtain the second power of a double?

"most efficient" --> I suggest no function, just x*x.


Also note that C allows FP to evaluate at higher precision that required. Research FLT_EVL_METHOD. The goal for most efficient way to obtain the second power of a double with a function may defeat overall performance.

Exponentials in python: x**y vs math.pow(x, y)

Using the power operator ** will be faster as it won’t have the overhead of a function call. You can see this if you disassemble the Python code:

>>> dis.dis('7. ** i')
1 0 LOAD_CONST 0 (7.0)
3 LOAD_NAME 0 (i)
6 BINARY_POWER
7 RETURN_VALUE
>>> dis.dis('pow(7., i)')
1 0 LOAD_NAME 0 (pow)
3 LOAD_CONST 0 (7.0)
6 LOAD_NAME 1 (i)
9 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
12 RETURN_VALUE
>>> dis.dis('math.pow(7, i)')
1 0 LOAD_NAME 0 (math)
3 LOAD_ATTR 1 (pow)
6 LOAD_CONST 0 (7)
9 LOAD_NAME 2 (i)
12 CALL_FUNCTION 2 (2 positional, 0 keyword pair)
15 RETURN_VALUE

Note that I’m using a variable i as the exponent here because constant expressions like 7. ** 5 are actually evaluated at compile time.

Now, in practice, this difference does not matter that much, as you can see when timing it:

>>> from timeit import timeit
>>> timeit('7. ** i', setup='i = 5')
0.2894785532627111
>>> timeit('pow(7., i)', setup='i = 5')
0.41218495570683444
>>> timeit('math.pow(7, i)', setup='import math; i = 5')
0.5655053168791255

So, while pow and math.pow are about twice as slow, they are still fast enough to not care much. Unless you can actually identify the exponentiation as a bottleneck, there won’t be a reason to choose one method over the other if clarity decreases. This especially applies since pow offers an integrated modulo operation for example.


Alfe asked a good question in the comments above:

timeit shows that math.pow is slower than ** in all cases. What is math.pow() good for anyway? Has anybody an idea where it can be of any advantage then?

The big difference of math.pow to both the builtin pow and the power operator ** is that it always uses float semantics. So if you, for some reason, want to make sure you get a float as a result back, then math.pow will ensure this property.

Let’s think of an example: We have two numbers, i and j, and have no idea if they are floats or integers. But we want to have a float result of i^j. So what options do we have?

  • We can convert at least one of the arguments to a float and then do i ** j.
  • We can do i ** j and convert the result to a float (float exponentation is automatically used when either i or j are floats, so the result is the same).
  • We can use math.pow.

So, let’s test this:

>>> timeit('float(i) ** j', setup='i, j = 7, 5')
0.7610865891750791
>>> timeit('i ** float(j)', setup='i, j = 7, 5')
0.7930400942188385
>>> timeit('float(i ** j)', setup='i, j = 7, 5')
0.8946636625872202
>>> timeit('math.pow(i, j)', setup='import math; i, j = 7, 5')
0.5699394063529439

As you can see, math.pow is actually faster! And if you think about it, the overhead from the function call is also gone now, because in all the other alternatives we have to call float().


In addition, it might be worth to note that the behavior of ** and pow can be overridden by implementing the special __pow__ (and __rpow__) method for custom types. So if you don’t want that (for whatever reason), using math.pow won’t do that.

Why is x**3 slower than x*x*x?

As per this answer, it's because the implementation of exponentiation has some overhead that multiplication does not. However, naive multiplication will get slower and slower as the exponent increases. An empirical demonstration:

 In [3]: x = np.random.rand(1e6)

In [15]: %timeit x**2
100 loops, best of 3: 11.9 ms per loop

In [16]: %timeit x*x
100 loops, best of 3: 12.7 ms per loop

In [17]: %timeit x**3
10 loops, best of 3: 132 ms per loop

In [18]: %timeit x*x*x
10 loops, best of 3: 27.2 ms per loop

In [19]: %timeit x**4
10 loops, best of 3: 132 ms per loop

In [20]: %timeit x*x*x*x
10 loops, best of 3: 42.4 ms per loop

In [21]: %timeit x**10
10 loops, best of 3: 132 ms per loop

In [22]: %timeit x*x*x*x*x*x*x*x*x*x
10 loops, best of 3: 137 ms per loop

In [24]: %timeit x**15
10 loops, best of 3: 132 ms per loop

In [25]: %timeit x*x*x*x*x*x*x*x*x*x*x*x*x*x*x
1 loops, best of 3: 212 ms per loop

Note the exponentiation time stays more or less constant, except for the x**2 case which I suspect is special-cased, while multiplication gets slower and slower. It seems you could exploit this to get faster integer exponentiation... for example:

In [26]: %timeit x**16
10 loops, best of 3: 132 ms per loop

In [27]: %timeit x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x
1 loops, best of 3: 225 ms per loop

In [28]: def tosixteenth(x):
....: x2 = x*x
....: x4 = x2*x2
....: x8 = x4*x4
....: x16 = x8*x8
....: return x16
....:

In [29]: %timeit tosixteenth(x)
10 loops, best of 3: 49.5 ms per loop

It seems you could apply this technique generically by splitting any integer into a sum of the powers of two, computing each power of two as above, and summing:

In [93]: %paste
def smartintexp(x, exp):
result = np.ones(len(x))
curexp = np.array(x)
while True:
if exp%2 == 1:
result *= curexp
exp >>= 1
if not exp: break
curexp *= curexp
return result
## -- End pasted text --

In [94]: x
Out[94]:
array([ 0.0163407 , 0.57694587, 0.47336487, ..., 0.70255032,
0.62043303, 0.0796748 ])

In [99]: x**21
Out[99]:
array([ 3.01080670e-38, 9.63466181e-06, 1.51048544e-07, ...,
6.02873388e-04, 4.43193256e-05, 8.46721060e-24])

In [100]: smartintexp(x, 21)
Out[100]:
array([ 3.01080670e-38, 9.63466181e-06, 1.51048544e-07, ...,
6.02873388e-04, 4.43193256e-05, 8.46721060e-24])

In [101]: %timeit x**21
10 loops, best of 3: 132 ms per loop

In [102]: %timeit smartintexp(x, 21)
10 loops, best of 3: 70.7 ms per loop

It's fast for small even powers of two:

In [106]: %timeit x**32
10 loops, best of 3: 131 ms per loop

In [107]: %timeit smartintexp(x, 32)
10 loops, best of 3: 57.4 ms per loop

But gets slower as the exponent gets larger:

In [97]: %timeit x**63
10 loops, best of 3: 133 ms per loop

In [98]: %timeit smartintexp(x, 63)
10 loops, best of 3: 110 ms per loop

And not faster for large worst-cases:

In [115]: %timeit x**511
10 loops, best of 3: 135 ms per loop

In [114]: %timeit smartintexp(x, 511)
10 loops, best of 3: 192 ms per loop


Related Topics



Leave a reply



Submit