Why are memcpy() and memmove() faster than pointer increments?
Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.
SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.
What are real significant cases when memcpy() is faster than memmove()?
At best, calling memcpy
rather than memmove
will save a pointer comparison and a conditional branch. For a large copy, this is completely insignificant. If you are doing many small copies, then it might be worth measuring the difference; that is the only way you can tell whether it's significant or not.
It is definitely a microoptimisation, but that doesn't mean you shouldn't use memcpy
when you can easily prove that it is safe. Premature pessimisation is the root of much evil.
memcpy vs for loop - What's the proper way to copy an array from a pointer?
Memcpy will probably be faster, but it's more likely you will make a mistake using it.
It may depend on how smart your optimizing compiler is.
Your code is incorrect though. It should be:
memcpy(myGlobalArray, nums, 10 * sizeof(int) );
Can memcpy or memmove return a different pointer than dest?
memmove
will never return anything other than dest
.
Returning dest
, as opposed to making memmove
void, is useful when the first argument is a computed expression, because it lets you avoid computing the same value upfront, and storing it in a variable. This lets you do in a single line
void *dest = memmove(&buf[offset] + copiedSoFar, src + offset, sizeof(buf)-offset-copiedSoFar);
what you would otherwise need to do on two lines:
void *dest = &buf[offset] + copiedSoFar;
memmove(dest, src + offset, sizeof(buf)-offset-copiedSoFar);
Why memcpy/memmove reverse data when copying int to bytes buffer?
It's because the processor architecture you use is little endian. Multibyte numbers (anything bigger than a uint8_t
) are stored with the least significant byte at the lowest address.
Edit
What you do about it really depends on what the buffer is for. If you are only going to be using the buffer internally, forget about byte swapping, you'll have to do it in both directions and its a waste of time.
If it is for some external entity e.g. a file or a network protocol, the specification of the file or network protocol will say what the endianness is. For example, network byte order for all the Internet protocols is effectively big endian. The networking library provides a family of functions to convert values for use in sending and receiving Internet protocol messages. Se for instance
https://linux.die.net/man/3/htonl
If you want to roll your own, the portable way is to use bit shifts e.g.
void writeUInt32ToBufferBigEndian(uint32_t number, uint8_t* buffer)
{
buffer[0] = (uint8_t) ((number >> 24) & 0xff);
buffer[1] = (uint8_t) ((number >> 16) & 0xff);
buffer[2] = (uint8_t) ((number >> 8) & 0xff);
buffer[3] = (uint8_t) ((number >> 0) & 0xff);
}
why memcpy is slower than copying data in bytes granularity?
It is likely slower to use memcpy because the size of the chunk you're copying isn't known at compile time. Usually GCC will optimize calls to memcpy with known sizes into an appropriate type of copy. That could use any specific size and algorithm that the compiler thinks to be optimal, and it can and will adjust it based on any compiler optimization flags you specify.
For example, when copying object IDs, Git knows that they must either be 20 bytes in size (SHA-1) or 32 bytes in size (SHA-256), and specifically calls one of those two specialized values.
If I slightly modify your code (formatted as well), option 3 is the fastest on my system, since it uses a 128-byte chunk with memcpy. Both 2 and 4 perform similarly.
Note that in this case, I've made the exit code depend on the value of the data copied because otherwise the compiler will optimize out my memcpy calls since it determines that they're useless. I've also added an additional option which involves copying the full buffer, which is slower. The other options copy the same portion of the buffer over and over again which means that the performance is much better, since much of the data stays in the CPU's cache.
In general, unless you really know you need substantial memcpy performance, you should just use memcpy itself. If your data has a known size, do indeed use that, since that can help performance. However, in most cases, memcpy isn't the bottleneck in your code, and therefore you shouldn't optimize it until you've measured your code and determined that it's the performance problem.
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#define GB (1UL << 30)
#define MB (1UL << 20)
#define BUF (1 * GB)
#define TIMES (2)
#define CACHELINE 64
int main(int argc, char **argv) {
assert(argc > 1);
int flag = atoi(argv[1]);
int memcpy_sz = 0;
if (argc > 2)
memcpy_sz = atoi(argv[2]);
char *a = (char *)aligned_alloc(64, BUF);
char *b = (char *)aligned_alloc(64, BUF);
memset(a, 1, BUF);
memset(b, 20, BUF);
unsigned long i = 0, j;
struct timespec before, after;
clock_gettime(CLOCK_MONOTONIC, &before);
if (flag == 1) { // memcpy
for (j = 0; j < TIMES; j++) {
size_t *ap = (size_t *)a;
size_t *bp = (size_t *)b;
for (i = 0; i < BUF; i += memcpy_sz) {
memcpy(a + i, b + i, memcpy_sz);
}
}
} else if (flag == 2) {
for (j = 0; j < TIMES; j++) {
size_t *ap = (size_t *)a;
size_t *bp = (size_t *)b;
for (i = 0; i < BUF / sizeof(size_t); i++) {
ap[i] = bp[i];
}
}
} else if (flag == 3) {
for (j = 0; j < TIMES; j++) {
for (i = 0; i < BUF / 128; i++) {
memcpy(a + (i % BUF), b + (i % BUF), 128);
}
}
} else if (flag == 4) {
for (j = 0; j < TIMES; j++) {
for (i = 0; i < BUF / 64; i++) {
memcpy(a + (i % BUF), b + (i % BUF), 64);
}
}
} else if (flag == 5) {
for (j = 0; j < TIMES; j++) {
memcpy(a, b, BUF);
}
} else {
size_t xlen = BUF / CACHELINE;
for (j = 0; j < TIMES; j++) {
size_t *ap = (size_t *)a;
size_t *bp = (size_t *)b;
while (xlen > 0) {
ap[0] = bp[0];
ap[1] = bp[1];
ap[2] = bp[2];
ap[3] = bp[3];
ap[4] = bp[4];
ap[5] = bp[5];
ap[6] = bp[6];
ap[7] = bp[7];
ap += 8;
bp += 8;
xlen -= 1;
}
}
}
clock_gettime(CLOCK_MONOTONIC, &after);
double elapse =
(after.tv_sec - before.tv_sec) + (after.tv_nsec - before.tv_nsec) / 1e9;
printf("time = %f s , bw = %.2f GB/s\n", elapse, (1.0) / (elapse / TIMES));
return a[10] == 20;
}
Why does copying a specific buffer size with memcpy and sprintf, prints more chars in new buffer than there are in the original buffer?
Code is attempting to print a character array as if it was a string leading to undefined behavior.smallbuf[]
does not certainly contain a null character, so it is not a string."%s"
expects a matching pointer to a string.
Either account for a null character
char smallbuf[8+1];
memcpy(smallbuf, input, 8);
smallbuf[8] = '\0';
printf("%s", smallbuf);
or limit output with a precision. That prints a character array up to N characters or a null character.
char smallbuf[8];
memcpy(smallbuf, input, 8);
printf("%.8s", smallbuf);
Similar issue applies to printf(input);
Do not code printf(input);
as that may lead to undefined behavior when input[]
contains a %
.
// printf(input);
printf("%s", input);
Better code would examine the return value of read(0, input, 64)
.
Related Topics
What Is the Easiest Way to Print a Variadic Parameter Pack Using Std::Ostream
Passing Arguments to Std::Async by Reference Fails
This Regex Doesn't Work in C++
How to Define a Template Function Within a Template Class Outside of the Class Definition
How to Use Member Initialization List to Initialize an Array
Why Doesn't Narrowing Conversion Used with Curly-Brace-Delimited Initializer Cause an Error
How to Add a Library Path in Cmake
When Is It Necessary to Use the Flag -Stdlib=Libstdc++
Precise Thread Sleep Needed. Max 1Ms Error
Are C++17 Parallel Algorithms Implemented Already
How to Stop Name-Mangling of My Dll's Exported Function
Too Many Initializers for 'Int [0]' C++
Are C++ Enums Signed or Unsigned
How to Set a Timeout on Blocking Sockets in Boost Asio
Random Array Generation with No Duplicates
How to Get the Size of a Memory Block Allocated Using Malloc()