How True Is "Want Speed? Pass by Value"

How true is Want Speed? Pass by value

The idea of "Want speed? Pass by value"⁽¹⁾ is that sometimes, the copy can be elided. Taking your classes X and Y, consider this usecase:

// Simulating a complex operation returning a temporary:
std::string foo() { return "a" + std::string("b"); }


struct X
{
  std::string mem_name;
  X(std::string name): mem_name(std::move(name)) {}
};

struct Y
{
  std::string mem_name;
  Y(const std::string &name): mem_name(name) {}
};


int main()
{
  X(foo());
  Y(foo());
}

Now let's analyse both the construction cases.

X first. foo() returns a temporary, which is used to initialise the object name. That object is then moved into mem_name. Notice that the compiler can apply Return Value Optimisation and construct the return value of foo() (actually even the return value of operator+) directly in the space of name. So no copying actually happens, only a move.

Now let's analyse Y. foo() returns a temporary again, which is bound to the reference name. Now there's no "externally supplied" space for the return value, so it has to be constructed in its own space and bound to the reference. It is then copied into mem_name. So we are doing a copy, no way around it.

In short, the outcome is:

If an lvalue is being passed in, both X and Y will perform a copy (X when initialising name, Y when initialising mem_name). In addition, X will perform a move (when initialising mem_name).
If an rvalue is being passed in, X will potentially only perform a move, while Y has to perform a copy.

Generally, a move is expected to be an operation whose time requirements are comparable to those of passing a pointer (which is what passing by reference does). So in effect, X is no worse than Y for lvalues, and better for rvalues.

Of course, it's not an absolute rule, and must be taken with a grain of salt. When in doubt, profile.

(1) The link is prone to being temporarily unavailable, and as of 11-12-2014, it seems broken (404). A copy of the contents (albeit with weird formatting) seems available at several blog sites:

blog on csdn.net
blog on blogspot.cz

Alternatively, the original content might be accessible through the wayback machine.

Also note that the topic has in general stirred up quite a discussion. Googling the paper title brings up a lot of follow-ups and counter-points. To list an example of one of these, there's "Want speed? Don't (always) pass by value" by SO member juanchopanza

Performance cost of passing by value vs. by reference or by pointer?

It depends on what you mean by "cost", and properties of the host system (hardware, operating system) with respect to operations.

If your cost measure is memory usage, then the calculation of cost is obvious - add up the sizes of whatever is being copied.

If your measure is execution speed (or "efficiency") then the game is different. Hardware (and operating systems and compiler) tend to be optimised for performance of operations on copying things of particular sizes, by virtue of dedicated circuits (machine registers, and how they are used).

It is common, for example, for a machine to have an architecture (machine registers, memory architecture, etc) which result in a "sweet spot" - copying variables of some size is most "efficient", but copying larger OR SMALLER variables is less so. Larger variables will cost more to copy, because there may be a need to do multiple copies of smaller chunks. Smaller ones may also cost more, because the compiler needs to copy the smaller value into a larger variable (or register), do the operations on it, then copy the value back.

Examples with floating point include some cray supercomputers, which natively support double precision floating point (aka double in C++), and all operations on single precision (aka float in C++) are emulated in software. Some older 32-bit x86 CPUs also worked internally with 32-bit integers, and operations on 16-bit integers required more clock cycles due to translation to/from 32-bit (this is not true with more modern 32-bit or 64-bit x86 processors, as they allow copying 16-bit integers to/from 32-bit registers, and operating on them, with fewer such penalties).

It is a bit of a no-brainer that copying a very large structure by value will be less efficient than creating and copying its address. But, because of factors like the above, the cross-over point between "best to copy something of that size by value" and "best to pass its address" is less clear.

Pointers and references tend to be implemented in a similar manner (e.g. pass by reference can be implemented in the same way as passing a pointer) but that is not guaranteed.

The only way to be sure is to measure it. And realise that the measurements will vary between systems.

Which is faster? Pass by reference vs pass by value C++

The answer to "Which is faster" is usually "It depends".

If instead of passing four bytes of data you are passing an eight byte pointer to data, then you can't really expect that to make things faster. If instead of passing 100 bytes of data you are passing an eight byte pointer to data, that's different.

But now the function doesn't have the data, it only has a reference. So whenever it needs to read the data, it has to do that indirectly through the reference. That takes longer. If you pass a 100 byte object and only read eight byte of it, you still are likely to win. But if you actually read all the data, and maybe multiple times, then it could easily be faster to pass the value even for large objects.

The real difference comes when you pass an object, and passing by value means a more or less complex constructor will be called. Passing by reference means no constructor. But int has no constructor anyway.

And then there is optimisation. Passing by value means the compiler knows your function is the only one with access to the data. Pass by reference means the data could be anywhere. If you have two int& parameters, I could pass the some int twice. So increasing row might increase pos. Or it might not. That kills optimisations.

And then there is the rule of optimisation: "Measure it". You measured it and found what's faster. Sometimes things are faster or slower for no good reason whatsoever.

Pass by value faster than pass by reference

A good way to find out why there are any differences is to check the disassembly. Here are the results I got on my machine with Visual Studio 2012.

With optimization flags, both functions generate the same code:

009D1270 57                   push        edi  
009D1271 FF 15 D4 30 9D 00    call        dword ptr ds:[9D30D4h]  
009D1277 8B F8                mov         edi,eax  
009D1279 FF 15 D4 30 9D 00    call        dword ptr ds:[9D30D4h]  
009D127F 8B 0D 48 30 9D 00    mov         ecx,dword ptr ds:[9D3048h]  
009D1285 2B C7                sub         eax,edi  
009D1287 50                   push        eax  
009D1288 E8 A3 04 00 00       call        std::operator<<<std::char_traits<char> > (09D1730h)  
009D128D 8B C8                mov         ecx,eax  
009D128F FF 15 2C 30 9D 00    call        dword ptr ds:[9D302Ch]  
009D1295 33 C0                xor         eax,eax  
009D1297 5F                   pop         edi  
009D1298 C3                   ret

This is basically equivalent to:

int main ()
{
    clock_t start, stop ;
    start = clock () ;
    stop = clock () ;
    cout << "time: " << stop - start ;
    return 0 ;
}

Without optimization flags, you will probably get different results.

function (no optimizations):

00114890 55                   push        ebp  
00114891 8B EC                mov         ebp,esp  
00114893 81 EC C0 00 00 00    sub         esp,0C0h  
00114899 53                   push        ebx  
0011489A 56                   push        esi  
0011489B 57                   push        edi  
0011489C 8D BD 40 FF FF FF    lea         edi,[ebp-0C0h]  
001148A2 B9 30 00 00 00       mov         ecx,30h  
001148A7 B8 CC CC CC CC       mov         eax,0CCCCCCCCh  
001148AC F3 AB                rep stos    dword ptr es:[edi]  
001148AE 8B 45 08             mov         eax,dword ptr [ptr]  
001148B1 8B 08                mov         ecx,dword ptr [eax]  
001148B3 6B C9 05             imul        ecx,ecx,5  
001148B6 8B 55 08             mov         edx,dword ptr [ptr]  
001148B9 89 0A                mov         dword ptr [edx],ecx  
001148BB 5F                   pop         edi  
001148BC 5E                   pop         esi  
001148BD 5B                   pop         ebx  
001148BE 8B E5                mov         esp,ebp  
001148C0 5D                   pop         ebp  
001148C1 C3                   ret

function2 (no optimizations)

00FF4850 55                   push        ebp  
00FF4851 8B EC                mov         ebp,esp  
00FF4853 81 EC C0 00 00 00    sub         esp,0C0h  
00FF4859 53                   push        ebx  
00FF485A 56                   push        esi  
00FF485B 57                   push        edi  
00FF485C 8D BD 40 FF FF FF    lea         edi,[ebp-0C0h]  
00FF4862 B9 30 00 00 00       mov         ecx,30h  
00FF4867 B8 CC CC CC CC       mov         eax,0CCCCCCCCh  
00FF486C F3 AB                rep stos    dword ptr es:[edi]  
00FF486E 8B 45 08             mov         eax,dword ptr [val]  
00FF4871 6B C0 05             imul        eax,eax,5  
00FF4874 89 45 08             mov         dword ptr [val],eax  
00FF4877 5F                   pop         edi  
00FF4878 5E                   pop         esi  
00FF4879 5B                   pop         ebx  
00FF487A 8B E5                mov         esp,ebp  
00FF487C 5D                   pop         ebp  
00FF487D C3                   ret

Why is pass by value faster (in the no optimization case)?

Well, function() has two extra mov operations. Let's take a look at the first extra mov operation:

001148AE 8B 45 08             mov         eax,dword ptr [ptr]  
001148B1 8B 08                mov         ecx,dword ptr [eax]  
001148B3 6B C9 05             imul        ecx,ecx,5

Here we are dereferencing the pointer. In function2 (), we already have the value, so we avoid this step. We first move the address of the pointer into register eax. Then we move the value of the pointer into register ecx. Finally, we multiply the value by five.

Let's look at the second extra mov operation:

001148B3 6B C9 05             imul        ecx,ecx,5  
001148B6 8B 55 08             mov         edx,dword ptr [ptr]  
001148B9 89 0A                mov         dword ptr [edx],ecx

Now we are moving backwards. We have just finished multiplying the value by 5, and we need to place the value back into the memory address.

Because function2 () does not have to deal with referencing and dereferencing a pointer, it gets to skip these two extra mov operations.

Is it better in C++ to pass by value or pass by reference-to-const?

It used to be generally recommended best practice¹ to use pass by const ref for all types, except for builtin types (char, int, double, etc.), for iterators and for function objects (lambdas, classes deriving from std::*_function).

This was especially true before the existence of move semantics. The reason is simple: if you passed by value, a copy of the object had to be made and, except for very small objects, this is always more expensive than passing a reference.

With C++11, we have gained move semantics. In a nutshell, move semantics permit that, in some cases, an object can be passed “by value” without copying it. In particular, this is the case when the object that you are passing is an rvalue.

In itself, moving an object is still at least as expensive as passing by reference. However, in many cases a function will internally copy an object anyway — i.e. it will take ownership of the argument.²

In these situations we have the following (simplified) trade-off:

We can pass the object by reference, then copy internally.
We can pass the object by value.

“Pass by value” still causes the object to be copied, unless the object is an rvalue. In the case of an rvalue, the object can be moved instead, so that the second case is suddenly no longer “copy, then move” but “move, then (potentially) move again”.

For large objects that implement proper move constructors (such as vectors, strings …), the second case is then vastly more efficient than the first. Therefore, it is recommended to use pass by value if the function takes ownership of the argument, and if the object type supports efficient moving.

A historical note:

In fact, any modern compiler should be able to figure out when passing by value is expensive, and implicitly convert the call to use a const ref if possible.

In theory. In practice, compilers can’t always change this without breaking the function’s binary interface. In some special cases (when the function is inlined) the copy will actually be elided if the compiler can figure out that the original object won’t be changed through the actions in the function.

But in general the compiler can’t determine this, and the advent of move semantics in C++ has made this optimisation much less relevant.

¹ E.g. in Scott Meyers, Effective C++.

² This is especially often true for object constructors, which may take arguments and store them internally to be part of the constructed object’s state.

Advantages of pass-by-value and std::move over pass-by-reference

Did I understand correctly what is happening here?

Yes.

Is there any upside of using std::move over passing by reference and just calling m_name{name}?

An easy to grasp function signature without any additional overloads. The signature immediately reveals that the argument will be copied - this saves callers from wondering whether a const std::string& reference might be stored as a data member, possibly becoming a dangling reference later on. And there is no need to overload on std::string&& name and const std::string& arguments to avoid unnecessary copies when rvalues are passed to the function. Passing an lvalue

std::string nameString("Alex");
Creature c(nameString);

to the function that takes its argument by value causes one copy and one move construction. Passing an rvalue to the same function

std::string nameString("Alex");
Creature c(std::move(nameString));

causes two move constructions. In contrast, when the function parameter is const std::string&, there will always be a copy, even when passing an rvalue argument. This is clearly an advantage as long as the argument type is cheap to move-construct (this is the case for std::string).

But there is a downside to consider: the reasoning doesn't work for functions that assign the function argument to another variable (instead of initializing it):

void setName(std::string name)
{
    m_name = std::move(name);
}

will cause a deallocation of the resource that m_name refers to before it's reassigned. I recommend reading Item 41 in Effective Modern C++ and also this question.

Pass by reference more expensive than pass by value

Prefer passing primitive types (int, char, float, ...) and POD structs that are cheap to copy (Point, complex) by value.

This will be more efficient than the indirection required when passing by reference.

See Boost's Call Traits.

The template class call_traits<T> encapsulates the "best" method to pass a parameter of some type T to or from a function, and consists of a collection of typedefs defined as in the table below. The purpose of call_traits is to ensure that problems like "references to references" never occur, and that parameters are passed in the most efficient manner possible.

When is overloading pass by reference (l-value and r-value) preferred to pass-by-value?

For types whose copy assignment operator can recycle resources, swapping with a copy is almost never the best way to implement the copy assignment operator. For example look at std::vector:

This class manages a dynamically sized buffer and maintains both a capacity (maximum length the buffer can hold), and a size (the current length). If the vector copy assignment operator is implemented swap, then no matter what, a new buffer is always allocated if the rhs.size() != 0.

However, if lhs.capacity() >= rhs.size(), no new buffer need be allocated at all. One can simply assign/construct the elements from rhs to lhs. When the element type is trivially copyable, this may boil down to nothing but memcpy. This can be much, much faster than allocating and deallocating a buffer.

Same issue for std::string.

Same issue for MyType when MyType has data members that are std::vector and/or std::string.

There are only 2 times you want to consider implementing copy assignment with swap:

You know that the swap method (including the obligatory copy construction when the rhs is an lvalue) will not be terribly inefficient.
You know that you will always need the copy assignment operator to have the strong exception safety guarantee.

If you're not sure about 2, in other words you think the copy assignment operator might sometimes need the strong exception safety guarantee, don't implement assignment in terms of swap. It is easy for your clients to achieve the same guarantee if you provide one of:

A noexcept swap.
A noexcept move assignment operator.

For example:

template <class T>
T&
strong_assign(T& x, T y)
{
    using std::swap;
    swap(x, y);
    return x;
}

or:

template <class T>
T&
strong_assign(T& x, T y)
{
    x = std::move(y);
    return x;
}

Now there will be some types where implementing copy assignment with swap will make sense. However these types will be the exception, not the rule.

On:

void push_back(const value_type& val);
void push_back(value_type&& val);

Imagine vector<big_legacy_type> where:

class big_legacy_type
{
 public:
      big_legacy_type(const big_legacy_type&);  // expensive
      // no move members ...
};

If we had only:

void push_back(value_type val);

Then push_backing an lvalue big_legacy_type into a vector would require 2 copies instead of 1, even when capacity was sufficient. That would be a disaster, performance wise.

Update

Here is a HelloWorld that you should be able to run on any C++11 conforming platform:

#include <vector>
#include <random>
#include <chrono>
#include <iostream>

class X
{
    std::vector<int> v_;
public:
    explicit X(unsigned s) : v_(s) {}

#if SLOW_DOWN
    X(const X&) = default;
    X(X&&) = default;
    X& operator=(X x)
    {
        v_.swap(x.v_);
        return *this;
    }
#endif
};

std::mt19937_64 eng;
std::uniform_int_distribution<unsigned> size(0, 1000);

std::chrono::high_resolution_clock::duration
test(X& x, const X& y)
{
    auto t0 = std::chrono::high_resolution_clock::now();
    x = y;
    auto t1 = std::chrono::high_resolution_clock::now();
    return t1-t0;
}

int
main()
{
    const int N = 1000000;
    typedef std::chrono::duration<double, std::nano> nano;
    nano ns(0);
    for (int i = 0; i < N; ++i)
    {
        X x1(size(eng));
        X x2(size(eng));
        ns += test(x1, x2);
    }
    ns /= N;
    std::cout << ns.count() << "ns\n";
}

I've coded X's copy assignment operator two ways:

Implicitly, which is equivalent to calling vector's copy assignment operator.
With the copy/swap idiom, suggestively under the macro SLOW_DOWN. I thought about naming it SLEEP_FOR_AWHILE, but this way is actually much worse than sleep statements if you're on a battery powered device.

The test constructs some randomly sized vector<int>s between 0 and 1000, and assigns them a million times. It times each one, sums the times, and then finds the average time in floating point nanoseconds and prints that out. If two consecutive calls to your high resolution clock doesn't return something less than 100 nanoseconds, you may want to raise the length of the vectors.

Here are my results:

$ clang++ -std=c++11 -stdlib=libc++ -O3 test.cpp
$ a.out
428.348ns
$ a.out
438.5ns
$ a.out
431.465ns
$ clang++ -std=c++11 -stdlib=libc++ -O3 -DSLOW_DOWN test.cpp
$ a.out
617.045ns
$ a.out
616.964ns
$ a.out
618.808ns

I'm seeing a 43% performance hit for the copy/swap idiom with this simple test. YMMV.

The above test, on average, has sufficient capacity on the lhs half the time. If we take this to either extreme:

lhs has sufficient capacity all of the time.
lhs has sufficient capacity none of the time.

then the performance advantage of the default copy assignment over the copy/swap idiom varies from about 560% to 0%. The copy/swap idiom is never faster, and can be dramatically slower (for this test).

Want Speed? Measure.

How True Is "Want Speed? Pass by Value"