Calculate a Checksum for a String

Calculate a checksum for a string

That's not possible.

If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.

Update:

The term "reasonably unique" doesn't make sense, either it's unique or it's not.

To get a reasonably low risk of hash collisions, you can use a resonably large hash code.

The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:

string theString = "asdf";

string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}

Console.WriteLine(hash);

Output:

912EC803B2CE49E4A541068D495AB570

Generate Checksum for String

A very common, fast checksum is the CRC-32, a 32-bit polynomial cyclic redundancy check. Here are three implementations in C, which vary in speed vs. complexity, of the CRC-32: (This is from http://www.hackersdelight.org/hdcodetxt/crc.c.txt)

#include <stdio.h>
#include <stdlib.h>

// ---------------------------- reverse --------------------------------

// Reverses (reflects) bits in a 32-bit word.
unsigned reverse(unsigned x) {
x = ((x & 0x55555555) << 1) | ((x >> 1) & 0x55555555);
x = ((x & 0x33333333) << 2) | ((x >> 2) & 0x33333333);
x = ((x & 0x0F0F0F0F) << 4) | ((x >> 4) & 0x0F0F0F0F);
x = (x << 24) | ((x & 0xFF00) << 8) |
((x >> 8) & 0xFF00) | (x >> 24);
return x;
}

// ----------------------------- crc32a --------------------------------

/* This is the basic CRC algorithm with no optimizations. It follows the
logic circuit as closely as possible. */

unsigned int crc32a(unsigned char *message) {
int i, j;
unsigned int byte, crc;

i = 0;
crc = 0xFFFFFFFF;
while (message[i] != 0) {
byte = message[i]; // Get next byte.
byte = reverse(byte); // 32-bit reversal.
for (j = 0; j <= 7; j++) { // Do eight times.
if ((int)(crc ^ byte) < 0)
crc = (crc << 1) ^ 0x04C11DB7;
else crc = crc << 1;
byte = byte << 1; // Ready next msg bit.
}
i = i + 1;
}
return reverse(~crc);
}

// ----------------------------- crc32b --------------------------------

/* This is the basic CRC-32 calculation with some optimization but no
table lookup. The the byte reversal is avoided by shifting the crc reg
right instead of left and by using a reversed 32-bit word to represent
the polynomial.
When compiled to Cyclops with GCC, this function executes in 8 + 72n
instructions, where n is the number of bytes in the input message. It
should be doable in 4 + 61n instructions.
If the inner loop is strung out (approx. 5*8 = 40 instructions),
it would take about 6 + 46n instructions. */

unsigned int crc32b(unsigned char *message) {
int i, j;
unsigned int byte, crc, mask;

i = 0;
crc = 0xFFFFFFFF;
while (message[i] != 0) {
byte = message[i]; // Get next byte.
crc = crc ^ byte;
for (j = 7; j >= 0; j--) { // Do eight times.
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
i = i + 1;
}
return ~crc;
}

// ----------------------------- crc32c --------------------------------

/* This is derived from crc32b but does table lookup. First the table
itself is calculated, if it has not yet been set up.
Not counting the table setup (which would probably be a separate
function), when compiled to Cyclops with GCC, this function executes in
7 + 13n instructions, where n is the number of bytes in the input
message. It should be doable in 4 + 9n instructions. In any case, two
of the 13 or 9 instrucions are load byte.
This is Figure 14-7 in the text. */

unsigned int crc32c(unsigned char *message) {
int i, j;
unsigned int byte, crc, mask;
static unsigned int table[256];

/* Set up the table, if necessary. */

if (table[1] == 0) {
for (byte = 0; byte <= 255; byte++) {
crc = byte;
for (j = 7; j >= 0; j--) { // Do eight times.
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
table[byte] = crc;
}
}

/* Through with table setup, now calculate the CRC. */

i = 0;
crc = 0xFFFFFFFF;
while ((byte = message[i]) != 0) {
crc = (crc >> 8) ^ table[(crc ^ byte) & 0xFF];
i = i + 1;
}
return ~crc;
}

If you simply google "CRC32", you will get more info than you could possibly absorb.

Calculate checksum in Lua

local someBytes = {0x55, 0xaa, 0x00, 0x00, 0xfe} -- and so forth
function checksum(bytes)
local SUM = 0
for _,v in ipairs(bytes) do
SUM = SUM + v
end

SUM = SUM + 0x5555

local SUM_H = (SUM & 0xFF << 8) >> 8
local SUM_L = SUM & 0xFF
return SUM_H, SUM_L
end

local highByte, lowByte = checksum(someBytes)
print(string.format("High Byte:\t0x%02X\nLow Byte:\t0x%02X", highByte, lowByte))

checksum for UTF-8 string

Replace the code

for (intCtr = 0; intCtr <= chrArray.Length - 1; intCtr++)
{
intAscSum = intAscSum + (chrArray[intCtr]);
}

chrArray[intCtr] is input ASCII String to ouput the ASCII code in decimal, for example "A" is 65. ASCII encoding only uses 1 byte. UTF-8 uses one byte or more than one byte to represent the UTF-8 char. I think chrArray[intCtr] is designed for ASCII - thus the input of UTF-8 (more than one byte) is not reasonable.

With

int i = 0;
for (i = 0; i < bytes.Length; i++)
{
intAscSum = intAscSum + bytes[i];
}
byte[] bytes = Encoding.UTF8.GetBytes(strMsg);

Add up all the bytes, because one UTF8 char can be more than one byte.

How to calculate the hash of a string literal using only the C preprocessor?

The question "How to calculate the hash of a string literal using only the C preprocessor?"
is valid, however I think you're adding a red-herring by including details about __FILE__ and logging ID's.

This means anyone answering needs to solve the problem you describe, or answer the question on hashing a string with the pre-processor (which may not be a good solution in your particular case!).

As it happens, __FILE__ expands to variable, not a literal string (GCC at least), so you will need to define the filename as a constant. You could use the build system to pass along a define for each for example.

As others have pointed out you can calculate the hash and pass this in via the build-system, although this avoids the question about hashing a sting literal.


Whatever the case, this question comes up when I searched for using the pre-processor for hashing, and none of the answers cover this, so heres is an answer that covers the string hashing part.

This is possible, albeit quite verbose

/**
* Implement compile-time string hashing on string literals.
*
* This macro implements the widely used "djb" hash apparently posted
* by Daniel Bernstein to comp.lang.c some time ago. The 32 bit
* unsigned hash value starts at 5381 and for each byte 'c' in the
* string, is updated: ``hash = hash * 33 + c``. This
* function uses the signed value of each byte.
*
* note: this is the same hash method that glib 2.34.0 uses.
*/

#define SEED 5381

#if 0
// correct but causes insane expansion
# define _SH(e, c) (((e) << 5) + (e) + (unsigned char)(c))
#elif defined(__GNUC__)
// Use statement-expression extension
# define _SH(e, c) ({ unsigned int _e = (unsigned int)(e); (_e << 5) + _e + (unsigned char)c; })
#else
// use an inline function, the compiler will be able to optimize this out.
static inline unsigned int _SH(unsigned int e, unsigned char c)
{
unsigned int _e = (unsigned int)e;
return (_e << 5) + _e + (unsigned char)c;
}
#endif

#define _SH_1(a) _SH(SEED, (a)[0])
#define _SH_2(a) _SH(_SH_1(a), (a)[1])
#define _SH_3(a) _SH(_SH_2(a), (a)[2])
#define _SH_4(a) _SH(_SH_3(a), (a)[3])
#define _SH_5(a) _SH(_SH_4(a), (a)[4])
#define _SH_6(a) _SH(_SH_5(a), (a)[5])
#define _SH_7(a) _SH(_SH_6(a), (a)[6])
#define _SH_8(a) _SH(_SH_7(a), (a)[7])
#define _SH_9(a) _SH(_SH_8(a), (a)[8])
#define _SH_10(a) _SH(_SH_9(a), (a)[9])
#define _SH_11(a) _SH(_SH_10(a), (a)[10])
#define _SH_12(a) _SH(_SH_11(a), (a)[11])
#define _SH_13(a) _SH(_SH_12(a), (a)[12])
#define _SH_14(a) _SH(_SH_13(a), (a)[13])
#define _SH_15(a) _SH(_SH_14(a), (a)[14])
#define _SH_16(a) _SH(_SH_15(a), (a)[15])
#define _SH_17(a) _SH(_SH_16(a), (a)[16])
#define _SH_18(a) _SH(_SH_17(a), (a)[17])
#define _SH_19(a) _SH(_SH_18(a), (a)[18])
#define _SH_20(a) _SH(_SH_19(a), (a)[19])
#define _SH_21(a) _SH(_SH_20(a), (a)[20])
#define _SH_22(a) _SH(_SH_21(a), (a)[21])
#define _SH_23(a) _SH(_SH_22(a), (a)[22])
#define _SH_24(a) _SH(_SH_23(a), (a)[23])
#define _SH_25(a) _SH(_SH_24(a), (a)[24])
#define _SH_26(a) _SH(_SH_25(a), (a)[25])
#define _SH_27(a) _SH(_SH_26(a), (a)[26])
#define _SH_28(a) _SH(_SH_27(a), (a)[27])
#define _SH_29(a) _SH(_SH_28(a), (a)[28])
#define _SH_30(a) _SH(_SH_29(a), (a)[29])
#define _SH_31(a) _SH(_SH_30(a), (a)[30])
#define _SH_32(a) _SH(_SH_31(a), (a)[31])

// initial check prevents too-large strings from compiling
#define STRHASH(a) ( \
(void)(sizeof(int[(sizeof(a) > 33 ? -1 : 1)])), \
(sizeof(a) == 1) ? SEED : \
(sizeof(a) == 2) ? _SH_1(a) : \
(sizeof(a) == 3) ? _SH_2(a) : \
(sizeof(a) == 4) ? _SH_3(a) : \
(sizeof(a) == 4) ? _SH_3(a) : \
(sizeof(a) == 5) ? _SH_4(a) : \
(sizeof(a) == 6) ? _SH_5(a) : \
(sizeof(a) == 7) ? _SH_6(a) : \
(sizeof(a) == 8) ? _SH_7(a) : \
(sizeof(a) == 9) ? _SH_8(a) : \
(sizeof(a) == 10) ? _SH_9(a) : \
(sizeof(a) == 11) ? _SH_10(a) : \
(sizeof(a) == 12) ? _SH_11(a) : \
(sizeof(a) == 13) ? _SH_12(a) : \
(sizeof(a) == 14) ? _SH_13(a) : \
(sizeof(a) == 15) ? _SH_14(a) : \
(sizeof(a) == 16) ? _SH_15(a) : \
(sizeof(a) == 17) ? _SH_16(a) : \
(sizeof(a) == 18) ? _SH_17(a) : \
(sizeof(a) == 19) ? _SH_18(a) : \
(sizeof(a) == 20) ? _SH_19(a) : \
(sizeof(a) == 21) ? _SH_20(a) : \
(sizeof(a) == 22) ? _SH_21(a) : \
(sizeof(a) == 23) ? _SH_22(a) : \
(sizeof(a) == 24) ? _SH_23(a) : \
(sizeof(a) == 25) ? _SH_24(a) : \
(sizeof(a) == 26) ? _SH_25(a) : \
(sizeof(a) == 27) ? _SH_26(a) : \
(sizeof(a) == 28) ? _SH_27(a) : \
(sizeof(a) == 29) ? _SH_28(a) : \
(sizeof(a) == 30) ? _SH_29(a) : \
(sizeof(a) == 31) ? _SH_30(a) : \
(sizeof(a) == 32) ? _SH_31(a) : \
(sizeof(a) == 33) ? _SH_32(a) : \
0)
// last zero is unreachable

// only for comparison
unsigned int strhash_func(const void *str)
{
const signed char *p;
unsigned int h = 5381;

for (p = str; *p != '\0'; p++) {
h = (h << 5) + h + (unsigned int)*p;
}

return h;
}

/* -------------------------------------------------------------------- */
#include <stdio.h>

#define TEST_STR1 "Hello World"
#define TEST_STR2 "Testing 123"
int main(void)
{
unsigned int A = STRHASH(TEST_STR1);
unsigned int B = STRHASH(TEST_STR2);

printf("String hash: const %u <- '%s'\n", STRHASH(TEST_STR1), TEST_STR1);
printf("String hash: const %u <- '%s'\n", STRHASH(TEST_STR2), TEST_STR2);
printf("String hash: dyn %u <- '%s'\n", strhash_func(TEST_STR1), TEST_STR1);
printf("String hash: dyn %u <- '%s'\n", strhash_func(TEST_STR2), TEST_STR2);

#if defined(__GNUC__)
printf("Is this known at compile time?, answer is: %d\n", __builtin_constant_p(A));
#endif
}

Note, for some reason Clang 5.0 prints answer is: 0, however on closer inspection is does in fact know the value at compile time, __builtin_constant_p just doesn't seem to work as GCC does.



Related Topics



Leave a reply



Submit