Why Are Memcpy() and Memmove() Faster Than Pointer Increments

Why are memcpy() and memmove() faster than pointer increments?

Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.

SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.

What are real significant cases when memcpy() is faster than memmove()?

At best, calling memcpy rather than memmove will save a pointer comparison and a conditional branch. For a large copy, this is completely insignificant. If you are doing many small copies, then it might be worth measuring the difference; that is the only way you can tell whether it's significant or not.

It is definitely a microoptimisation, but that doesn't mean you shouldn't use memcpy when you can easily prove that it is safe. Premature pessimisation is the root of much evil.

memcpy vs for loop - What's the proper way to copy an array from a pointer?

Memcpy will probably be faster, but it's more likely you will make a mistake using it.
It may depend on how smart your optimizing compiler is.

Your code is incorrect though. It should be:

memcpy(myGlobalArray, nums, 10 * sizeof(int) );

Can memcpy or memmove return a different pointer than dest?

memmove will never return anything other than dest.

Returning dest, as opposed to making memmove void, is useful when the first argument is a computed expression, because it lets you avoid computing the same value upfront, and storing it in a variable. This lets you do in a single line

void *dest = memmove(&buf[offset] + copiedSoFar, src + offset, sizeof(buf)-offset-copiedSoFar);

what you would otherwise need to do on two lines:

void *dest = &buf[offset] + copiedSoFar;
memmove(dest, src + offset, sizeof(buf)-offset-copiedSoFar);

Why memcpy/memmove reverse data when copying int to bytes buffer?

It's because the processor architecture you use is little endian. Multibyte numbers (anything bigger than a uint8_t) are stored with the least significant byte at the lowest address.

Edit

What you do about it really depends on what the buffer is for. If you are only going to be using the buffer internally, forget about byte swapping, you'll have to do it in both directions and its a waste of time.

If it is for some external entity e.g. a file or a network protocol, the specification of the file or network protocol will say what the endianness is. For example, network byte order for all the Internet protocols is effectively big endian. The networking library provides a family of functions to convert values for use in sending and receiving Internet protocol messages. Se for instance

https://linux.die.net/man/3/htonl

If you want to roll your own, the portable way is to use bit shifts e.g.

void writeUInt32ToBufferBigEndian(uint32_t number, uint8_t* buffer)
{
    buffer[0] = (uint8_t) ((number >> 24) & 0xff);
    buffer[1] = (uint8_t) ((number >> 16) & 0xff);
    buffer[2] = (uint8_t) ((number >> 8) & 0xff);
    buffer[3] = (uint8_t) ((number >> 0) & 0xff);
}

why memcpy is slower than copying data in bytes granularity?

It is likely slower to use memcpy because the size of the chunk you're copying isn't known at compile time. Usually GCC will optimize calls to memcpy with known sizes into an appropriate type of copy. That could use any specific size and algorithm that the compiler thinks to be optimal, and it can and will adjust it based on any compiler optimization flags you specify.

For example, when copying object IDs, Git knows that they must either be 20 bytes in size (SHA-1) or 32 bytes in size (SHA-256), and specifically calls one of those two specialized values.

If I slightly modify your code (formatted as well), option 3 is the fastest on my system, since it uses a 128-byte chunk with memcpy. Both 2 and 4 perform similarly.

Note that in this case, I've made the exit code depend on the value of the data copied because otherwise the compiler will optimize out my memcpy calls since it determines that they're useless. I've also added an additional option which involves copying the full buffer, which is slower. The other options copy the same portion of the buffer over and over again which means that the performance is much better, since much of the data stays in the CPU's cache.

In general, unless you really know you need substantial memcpy performance, you should just use memcpy itself. If your data has a known size, do indeed use that, since that can help performance. However, in most cases, memcpy isn't the bottleneck in your code, and therefore you shouldn't optimize it until you've measured your code and determined that it's the performance problem.

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#define GB (1UL << 30)
#define MB (1UL << 20)
#define BUF (1 * GB)
#define TIMES (2)
#define CACHELINE 64
int main(int argc, char **argv) {
  assert(argc > 1);
  int flag = atoi(argv[1]);
  int memcpy_sz = 0;
  if (argc > 2)
    memcpy_sz = atoi(argv[2]);
  char *a = (char *)aligned_alloc(64, BUF);
  char *b = (char *)aligned_alloc(64, BUF);
  memset(a, 1, BUF);
  memset(b, 20, BUF);
  unsigned long i = 0, j;
  struct timespec before, after;
  clock_gettime(CLOCK_MONOTONIC, &before);
  if (flag == 1) { // memcpy
    for (j = 0; j < TIMES; j++) {
      size_t *ap = (size_t *)a;
      size_t *bp = (size_t *)b;
      for (i = 0; i < BUF; i += memcpy_sz) {
        memcpy(a + i, b + i, memcpy_sz);
      }
    }
  } else if (flag == 2) {
    for (j = 0; j < TIMES; j++) {
      size_t *ap = (size_t *)a;
      size_t *bp = (size_t *)b;
      for (i = 0; i < BUF / sizeof(size_t); i++) {
        ap[i] = bp[i];
      }
    }
  } else if (flag == 3) {
    for (j = 0; j < TIMES; j++) {
      for (i = 0; i < BUF / 128; i++) {
        memcpy(a + (i % BUF), b + (i % BUF), 128);
      }
    }
  } else if (flag == 4) {
    for (j = 0; j < TIMES; j++) {
      for (i = 0; i < BUF / 64; i++) {
        memcpy(a + (i % BUF), b + (i % BUF), 64);
      }
    }
  } else if (flag == 5) {
    for (j = 0; j < TIMES; j++) {
      memcpy(a, b, BUF);
    }
  } else {
    size_t xlen = BUF / CACHELINE;
    for (j = 0; j < TIMES; j++) {
      size_t *ap = (size_t *)a;
      size_t *bp = (size_t *)b;
      while (xlen > 0) {
        ap[0] = bp[0];
        ap[1] = bp[1];
        ap[2] = bp[2];
        ap[3] = bp[3];
        ap[4] = bp[4];
        ap[5] = bp[5];
        ap[6] = bp[6];
        ap[7] = bp[7];
        ap += 8;
        bp += 8;
        xlen -= 1;
      }
    }
  }

  clock_gettime(CLOCK_MONOTONIC, &after);
  double elapse =
      (after.tv_sec - before.tv_sec) + (after.tv_nsec - before.tv_nsec) / 1e9;
  printf("time = %f s , bw = %.2f GB/s\n", elapse, (1.0) / (elapse / TIMES));
  return a[10] == 20;
}

Why does copying a specific buffer size with memcpy and sprintf, prints more chars in new buffer than there are in the original buffer?

Code is attempting to print a character array as if it was a string leading to undefined behavior.
smallbuf[] does not certainly contain a null character, so it is not a string.

"%s" expects a matching pointer to a string.

Either account for a null character

char smallbuf[8+1];
memcpy(smallbuf, input, 8);
smallbuf[8] = '\0';
printf("%s", smallbuf);

or limit output with a precision. That prints a character array up to N characters or a null character.

char smallbuf[8];
memcpy(smallbuf, input, 8);
printf("%.8s", smallbuf);

Similar issue applies to printf(input);

Do not code printf(input); as that may lead to undefined behavior when input[] contains a %.

// printf(input);
printf("%s", input);

Better code would examine the return value of read(0, input, 64).

Why Are Memcpy() and Memmove() Faster Than Pointer Increments