Does C and C++ Guarantee the Ascii of [A-F] and [A-F] Characters

Does C and C++ guarantee the ASCII of [a-f] and [A-F] characters?

There are no guarantees about the particular values but you shouldn't care, because your software will probably never encounter a system which is not compatible in this way with ASCII. Assume that space is always 32 and that A is always 65, this works fine in the modern world.

The C standard only guarantees that letters A-Z and a-z exist and that they fit within a single byte.

It does guarantee that 0-9 are sequential.

In both the source and execution basic character sets, the
value of each character after 0 in the above list of decimal digits shall be one greater than
the value of the previous.

Justification

There are a lot of character encodings out in the world. If you care about portability, you can either make your program portable to different character sets, or you can choose one character set to use everywhere (e.g. Unicode). I'll go ahead and loosely categorize most existing character encodings for you:

  1. Single byte character encodings compatible with ISO/IEC 646. Digits 0-9 and letters A-Z and a-z always occupy the same positions.

  2. Multibyte character encodings (Big5, Shift JIS, ISO 2022-based). In these encodings, your program is probably already broken and you'll need to spend time fixing it if you care. However, parsing numbers will still work as expected.

  3. Unicode encodings. Digits 0-9 and letters A-Z, a-z always occupy the same positions. You can either work with code points or code units freely and you will get the same result, if you are working with code points below 128 (which you are). (Are you working with UTF-7? No, you should only use that for email.

  4. EBCDIC. Digits and letters are assigned different values than their values in ASCII, however, 0-9 and A-F, a-f are still contiguous. Even then, the chance that your code will run on an EBCDIC system is essentially zero.

So the question here is: Do you think that a hypothetical fifth option will be invented in the future, somehow less compatible / more difficult to use than Unicode?

Do you care about EBCDIC?

We could dream up bizarre systems all day... suppose CHAR_BIT is 11, or sizeof(long) = 100, or suppose we use one's complement arithmetic, or malloc() always returns NULL, or suppose the pixels on your monitor are arranged in a hexagonal grid. Suppose your floating-point numbers aren't IEEE 754, suppose all of your data pointers are different sizes. At the end of the day, this does not get us closer to our goals of writing working software on actual modern systems (with the occasional exception).

What is the order of characters when looping through

(b) what is the order of this?

It is the order specified in the native character encoding of the system that you use. It is probably one of ASCII, ISO/IEC 8859 or UTF-8 all of which are identical in the range [0, 128).

Why do they have some symbols after the numbers but not all of them?

Because some of the symbols are before the numbers, and some more are after the letters.

That's just the order that was chosen by whatever committee designed the encoding. There's not necessarily a deep philosophy behind that choice. It may be an anomaly inherited from teletype codes that preceded computer systems.

How can I (a) get all of the characters

You can use numeric limits to find the minimum and maximum values, and a loop to iterate over them:

for(int i = std::numeric_limits<char>::min();
i < std::numeric_limits<char>::max(); i++) {
char c = i;
Chars.push_back(c);
}

Are the character digits ['0'..'9'] required to have contiguous numeric values?

Indeed not looked hard enough: In 2.3. Character sets, item 3:

In both the source and execution basic character sets, the value of each character after 0 in the
above list of decimal digits shall be one greater than the value of the previous.

And this is above list of decimal digits:

0 1 2 3 4 5 6 7 8 9

Therefore, an implementation must use a character set where the decimal digits have a contiguous representation. Thus, optimizations where you rely on this property are safe; however, optimizations where you rely on the coniguity of other digits (e.g. 'a'..'z') are not portable w.r.t. to the standard (see also header <cctype>). If you do this, make sure to assert that property.

How to shift a character by another character in C

To circular shift an 8-bit object with large values like 112 ('p'), mod the shift by 8u. % with a negative char and 8 is not mod so use unsigned math.

Access plaintext[i] as an unsigned char [] to avoid sign extension on right shifts.

Use size_t to index string to handle even very long strings.

Sample fix:

char *shift_encrypt2(char *plaintext, const char *password) {
unsigned char *uplaintext = (unsigned char *) plaintext;
for (size_t i = 0; uplaintext[i]; i++) {
unsigned shift = password[i] % 8u;
uplaintext[i] = (uplaintext[i] << shift) | (uplaintext[i] >> (8u - shift));
}
return plaintext;
}

Note: if the password string is shorter than than plaintext string, we have trouble. A possible fix would re-cycle through the password[].


Advanced: use restrict to allow the compiler to assume plaintext[] and password[] do not overlap and emit potentially faster code.

char *shift_encrypt2(char * restrict plaintext, const char * restrict password) {

Advanced: Code really should access password[] as an unsigned char array too, yet with common and ubiquitous 2's compliment, password[i] % 8u makes no difference.

char *shift_encrypt3(char * restrict plaintext, const char * restrict password) {
if (password[0]) {
unsigned char *uplaintext = (unsigned char *) plaintext;
const unsigned char *upassword = (const unsigned char *) password;
for (size_t i = 0; uplaintext[i]; i++) {
if (*upassword == 0) {
upassword = (const unsigned char *) password;
}
unsigned shift = *upassword++ % 8u;
uplaintext[i] = (uplaintext[i] << shift) | (uplaintext[i] >> (8u - shift));
}
}
return plaintext;
}

Character to integer conversion off for alphabet characters

The ASCII code for 'A' is 65; for 'Z', it is 90.
The ASCII code for '0' is 48; for '9', it is 57. These codes are also used in Unicode (UTF-8), 8859-x, and many other codesets.

When you calculate 'A' - '0', you get 65 - 48 = 17, which is the 'off-by-seven' you are seeing.

To convert the alphabetic characters 'A' to 'F' to their hex equivalents, you need some variation on:

c - 'A' + 10;

Remembering that 'a' to 'f' are also allowed and for them you'd need:

c - 'a' + 10;

Or you'd need to convert to upper-case first. Or you can use:

const char hexdigits[] = "0123456789ABCDEF";

int digit = strchr(hexdigits, toupper(c)) - hexdigits;

or any of a myriad other techniques. This last fragment assumes that c is known to contain a valid hex digit. It fails horribly if that is not the case.

Note that C does guarantee that the codes for the digits 0-9 are consecutive, but does not guarantee that the codes for the letters A-Z are consecutive. In particular, if the codeset is EBCDIC (mainly but not solely used on IBM mainframes), the codes for the letters are not contiguous.

C++ array declaration and initialization

It simply declares an integer array of N elements and initializes it to zero. What N evaluates to is determined by the 'f' + '9' + 2 expression. It evaluates to 161 if you are using ASCII code page or something else if you are using different code page. Every character literal has its corresponding integral value depending on the encoding used. In ASCII code page the character 'f' is represented by a number of 102 and the character '9' has a value of 57. The expression becomes 102 + 57 + 2 which equals 161. In other code pages those characters might have other values. Equivalent of:

int deca[161] = { 0 };  // If ASCII code page is used

Does C++17 allow a non-ascii character as an identifier?

From the link you provided, it quotes:

If the literal operator is a template, it must have an empty parameter
list and can have only one template parameter, which must be a
non-type template parameter pack with element type char

template <char...> double operator "" _x();

Let us see what this means,

Notation char... indicates that this template can be instantiated with 0, 1, 2 or more parameters of type char. This means that each time the compiler encounters a literal like 1234_km it should treat it as the following function call:

operator"" _km<'1', '2', '3', '4'>();

The entire string representing the literal is passed (chopped) as template argument. See this and this for usage.

And regarding the range of characters allowed:(See this Annexure E)

00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF
0100-167F, 1681-180D, 180F-1FFF
200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F
2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF
3004-3007, 3021-302F, 3031-303F
3040-D7FF
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD
10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD


Related Topics



Leave a reply



Submit