Is It Better to Use Std::Memcpy() or Std::Copy() in Terms to Performance

Efficiency of std::copy vs memcpy

A reasonably decent implementation will have std::copy compile to a call memmove in the situations where this is possible (i.e. the element type is a POD).

If your implementation doesn't have contiguous storage (the C++03 standard requires it), memmove might be faster than std::copy, but probably not too much. I would start worrying only when you have measurements to show it is indeed an issue.

Why is `std::copy` 5x (!) slower than `memcpy` for reading one int from a char buffer, in my test program?

That is not the results I get:

> g++ -O3 XX.cpp 
> ./a.out
cast:      5 ms
memcpy:    4 ms
std::copy: 3 ms
(counter:  1264720400)

Hardware: 2GHz Intel Core i7
Memory:   8G 1333 MHz DDR3
OS:       Max OS X 10.7.5
Compiler: i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1

On a Linux box I get different results:

> g++ -std=c++0x -O3 XX.cpp 
> ./a.out 
cast:      3 ms
memcpy:    4 ms
std::copy: 21 ms
(counter:  731359744)


Hardware:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Memory:    61363780 kB
OS:        Linux ip-10-58-154-83 3.2.0-29-virtual #46-Ubuntu SMP
Compiler:  g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

memcpy vs for loop - What's the proper way to copy an array from a pointer?

Memcpy will probably be faster, but it's more likely you will make a mistake using it.
It may depend on how smart your optimizing compiler is.

Your code is incorrect though. It should be:

memcpy(myGlobalArray, nums, 10 * sizeof(int) );

In what cases should I use memcpy over standard operators in C++?

Efficiency should not be your concern.

Write clean maintainable code.

It bothers me that so many answers indicate that the memcpy() is inefficient. It is designed to be the most efficient way of copy blocks of memory (for C programs).

So I wrote the following as a test:

#include <algorithm>

extern float a[3];
extern float b[3];
extern void base();

int main()
{
    base();

#if defined(M1)
    a[0] = b[0];
    a[1] = b[1];
    a[2] = b[2];
#elif defined(M2)
    memcpy(a, b, 3*sizeof(float));    
#elif defined(M3)
    std::copy(&a[0], &a[3], &b[0]);
 #endif

    base();
}

Then to compare the code produces:

g++ -O3 -S xr.cpp -o s0.s
g++ -O3 -S xr.cpp -o s1.s -DM1
g++ -O3 -S xr.cpp -o s2.s -DM2
g++ -O3 -S xr.cpp -o s3.s -DM3

echo "=======" >  D
diff s0.s s1.s >> D
echo "=======" >> D
diff s0.s s2.s >> D
echo "=======" >> D
diff s0.s s3.s >> D

This resulted in: (comments added by hand)

=======   // Copy by hand
10a11,18
>   movq    _a@GOTPCREL(%rip), %rcx
>   movq    _b@GOTPCREL(%rip), %rdx
>   movl    (%rdx), %eax
>   movl    %eax, (%rcx)
>   movl    4(%rdx), %eax
>   movl    %eax, 4(%rcx)
>   movl    8(%rdx), %eax
>   movl    %eax, 8(%rcx)

=======    // memcpy()
10a11,16
>   movq    _a@GOTPCREL(%rip), %rcx
>   movq    _b@GOTPCREL(%rip), %rdx
>   movq    (%rdx), %rax
>   movq    %rax, (%rcx)
>   movl    8(%rdx), %eax
>   movl    %eax, 8(%rcx)

=======    // std::copy()
10a11,14
>   movq    _a@GOTPCREL(%rip), %rsi
>   movl    $12, %edx
>   movq    _b@GOTPCREL(%rip), %rdi
>   call    _memmove

Added Timing results for running the above inside a loop of 1000000000.

   g++ -c -O3 -DM1 X.cpp
   g++ -O3 X.o base.o -o m1
   g++ -c -O3 -DM2 X.cpp
   g++ -O3 X.o base.o -o m2
   g++ -c -O3 -DM3 X.cpp
   g++ -O3 X.o base.o -o m3
   time ./m1

   real 0m2.486s
   user 0m2.478s
   sys  0m0.005s
   time ./m2

   real 0m1.859s
   user 0m1.853s
   sys  0m0.004s
   time ./m3

   real 0m1.858s
   user 0m1.851s
   sys  0m0.006s

Performance of memmove compared to memcpy twice?

From http://en.cppreference.com/w/cpp/string/byte/memmove

Despite being specified "as if" a temporary buffer is used, actual implementations of this function do not incur the overhead of double copying or extra memory. For small count, it may load up and write out registers; for larger blocks, a common approach (glibc and bsd libc) is to copy bytes forwards from the beginning of the buffer if the destination starts before the source, and backwards from the end otherwise, with a fall back to std::memcpy when there is no overlap at all.

Therefore the overhead in all likelihood is a couple of conditional branches. Hardly worth worrying about for large blocks.

However, it is worth remembering that std::memcpy is a 'magic' function, being the only legal way to cast between two dissimilar types.

In c++, this is illegal (undefined behaviour):

union {
  float a;
  int b;
} u;

u.a = 10.0;
int x = u.b;

This is legal:

float a = 10.0;
int b;
std::memcpy(std::addressof(b), std::addressof(a), size(b));

and does what you'd expect the union to do if you were a C programmer.

Is there still a performance advantage to redefine standard like memcpy?

The functions like memcpy belong to the standard library and almost sure they are implemented in assembler, not in C.

If you redefine them it will surely work slower. If you want to optimize memcpy you should either use memmove instead or declaring the pointers as restrict, to tell that they do not overlap and treat them as fast as memmove.

Those engineers who wrote the Standard C library for the given arhitechture for sure they used the existing assembler function to move memory faster.

EDIT:

Taking the remarks from some comments, every generation of code that keeps the semantics of copying (including replacing memcpy by mov-instructions or other code) is allowed.

For algorithms of copying (including the algorithm that newlib is using) you can check this article . Quote from this article:

Special situations If you know all about the data you're copying as
well as the environment in which memcpy runs, you may be able to
create a specialized version that runs very fast