Bytewise Reading of Memory: "Signed Char *" VS "Unsigned Char *"

Bytewise reading of memory: signed char * vs unsigned char *

You should use unsigned char. The C99 standard says that unsigned char is the only type guaranteed to be dense (no padding bits), and also defines that you may copy any object (except bitfields) exactly by copying it into an unsigned char array, which is the object representation in bytes.

The sensible interepretation of this is to me, that if you use a pointer to access an object as bytes, you should use unsigned char.

Reference: http://blackshell.com/~msmud/cstd.html#6.2.6.1 (From ~~C1x draft~~ C99)

uint8_t vs unsigned char

It documents your intent - you will be storing small numbers, rather than a character.

Also it looks nicer if you're using other typedefs such as uint16_t or int32_t.

C/C++ underlying representation of char, unsigned char and signed char

In general case it is not correct to say that the pattern is the same, if the range of signed char does not cover 240. If 240 is out of range, the result of this overflowing initialization is implementation-defined (and may result in a signal, see 6.3.1.3/3). The same applies to char initialization if it is signed.

The language guarantees matching representations only for the common part of the ranges of signed char and unsigned char. E.g. this is guaranteed to produce the same pattern

char c = 10;
unsigned char c = 10;
signed char c = 10;

With 240 there's no such guarantee in general case (assuming it is out of range).

C - unsigned int to unsigned char array conversion

You can use memcpy in that case:

memcpy(ch, (char*)&num, 2); /* although sizeof(int) would be better */

Also, how would be convert the unsigned char[2] back to unsigned int.

The same way, just reverse the arguments of memcpy.

Is it better to use char or unsigned char array for storing raw data?

UPDATE: C++17 introduced std::byte, which is more suited to "raw" data buffers than using any manner of char.

For earlier C++ versions:

unsigned char emphasises that the data is not "just" text
if you've got what's effectively "byte" data from e.g. a compressed stream, a database table backup file, an executable image, a jpeg... then unsigned is appropriate for the binary-data connotation mentioned above
- unsigned works better for some of the operations you might want to do on binary data, e.g. there are undefined and implementation defined behaviours for some bit operations on signed types, and unsigned values can be used directly as indices in arrays
- you can't accidentally pass an unsigned char* to a function expecting char* and have it operated on as presumed text
- in these situations it's usually more natural to think of the values as being in the range 0..255, after all - why should the "sign" bit have a different kind of significance to the other bits in the data?
if you're storing "raw data" that - at an application logic/design level happens to be 8-bit numeric data, then by all means choose either unsigned or explicitly signed char as appropriate to your needs

bitwise type convertion with AVX2 and range preservation

Yeah, the "andnot" definitely looks sketchy. Since _cst2 values are set to 0xFF, this operation will AND your _b vector with zero. I think you mixed up the order of arguments. It's the first argument that gets inverted. See the reference.

I don't understand the rest of the guff with conversions etc either. You just need this:

__m256i _a, _b;
_a = _mm256_stream_load_si256( reinterpret_cast<__m256i*>(a) );
_b = _mm256_xor_si256( _a, _mm256_set1_epi8( 0x7f ) );
_b = _mm256_andnot_si256( _b, _mm256_set1_epi8( 0xff ) );
_mm256_stream_si256( reinterpret_cast<__m256i*>(b), _b );

An alternative solution is to just add 128, but I'm not certain of the implications of overflow in this case:

__m256i _a, _b;
_a = _mm256_stream_load_si256( reinterpret_cast<__m256i*>(a) );
_b = _mm256_add_epi8( _a, _mm256_set1_epi8( 0x80 ) );
_mm256_stream_si256( reinterpret_cast<__m256i*>(b), _b );

One final important thing is that your a and b arrays MUST have 32-byte alignment. If you are using C++11 you can use alignas:

alignas(32) signed char a[32] = { -1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,
                                 -128,19,20,21,22,23,24,25,26,27,28,29,30,31,32 };
alignas(32) unsigned char b[32] = {0};

Otherwise you will need to use non-aligned load and store instructions, i.e. _mm256_loadu_si256 and _mm256_storeu_si256. But those don't have the same non-temporal cache properties as the stream instructions.

Count characters in UTF8 when plain char is unsigned

It should.

You are only using binary operators and those function the same irrespective of whether the underlying data type is signed or unsigned. The only exception may be the != operator, but you could replace this with a & and then embrace the whole thing with a !, ala:

!((*s & 0xc0) & 0x80)

and then you have solely binary operators.

You can verify that the characters are promoted to integers by checking section 3.3.10 of the ANSI C Standard which states that "Each of the operands [of the bitwise AND] shall have integral type."

EDIT

I amend my answer. Bitwise operations are not the same on signed as on unsigned, as per 3.3 of the ANSI C Standard:

Some operators (the unary operator ~ , and the binary operators << , >> , & , ^ , and | ,
collectively described as bitwise operators )shall have operands that have integral type.
These operators return
values that depend on the internal representations of integers, and
thus have implementation-defined aspects for signed types.

In fact, performing bitwise operations on signed integers is listed as a possible security hole here.

In the Visual Studio compiler signed and unsigned are treated the same (see here).

As this SO question discusses, it is better to use unsigned char to do byte-wise reads of memory and manipulations of memory.

Why is casting from char to std::byte potentially undefined behavior?

This is going to be fixed in the next standard:

A value of integral or enumeration type can be explicitly converted to a complete enumeration type. If the enumeration type has a fixed underlying type, the value is first converted to that type by integral conversion, if necessary, and then to the enumeration type. If the enumeration type does not have a fixed underlying type, the value is unchanged if the original value is within the range of the enumeration values ([dcl.enum]), and otherwise, the behavior is undefined

Here's the rationale behind the change from (C++11) unspecified to (C++17) undefined:

Although issue 1094 clarified that the value of an expression of enumeration type might not be within the range of the values of the enumeration after a conversion to the enumeration type (see 8.2.9 [expr.static.cast] paragraph 10), the result is simply an unspecified value. This should probably be strengthened to produce undefined behavior, in light of the fact that undefined behavior makes an expression non-constant.

And here's the rationale behind the C++2a fix:

The specifications of std::byte (21.2.5 [support.types.byteops]) and bitmask (20.4.2.1.4 [bitmask.types]) have revealed a problem with the integral conversion rules, according to which both those specifications have, in the general case, undefined behavior. The problem is that a conversion to an enumeration type has undefined behavior unless the value to be converted is in the range of the enumeration.
For enumerations with an unsigned fixed underlying type, this requirement is overly restrictive, since converting a large value to an unsigned integer type is well-defined.

Use char* or void* or something else for byte fields in C++?

Using char* for memory blobs is "easy to use" (e.b. byte by byte operations) however it is very bad for reading and understanding the code (you still see it in various API's however).

If your data is just a blob of memory then better use void*.

Only if your data is an array of a specific type (char, int, uint8_t, some struct, ...), then use a pointer of that type.

If you need to treat a struct as "byte data" (for example to calculate a hash) you can internally treat it as "char*" (or uint8_t* or uint_32_t* or whatever you need there). However the public API should still be void* if you don't require a specific memory layout.

The point is: if you have an API using void* you can supply any type of pointer to it (which is the point of a hash function). However you always need a reinterpret_cast if you use char*.

Bytewise Reading of Memory: "Signed Char " VS "Unsigned Char "