Implementing Sse 4.2's Crc32C in Software

Implementing SSE 4.2's CRC32C in software

Here are both software and hardware versions of CRC-32C. The software version is optimized to process eight bytes at a time. The hardware version is optimized to run three crc32q instructions effectively in parallel on a single core, since the throughput of that instruction is one cycle, but the latency is three cycles.

crc32c.c:

/* crc32c.c -- compute CRC-32C using the Intel crc32 instruction
* Copyright (C) 2013, 2021 Mark Adler
* Version 1.2 5 Jun 2021 Mark Adler
*/

/*
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages
arising from the use of this software.

Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:

1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.

Mark Adler
madler@alumni.caltech.edu
*/

/* Version History:
1.0 10 Feb 2013 First version
1.1 31 May 2021 Correct register constraints on assembly instructions
Include pre-computed tables to avoid use of pthreads
Return zero for the CRC when buf is NULL, as initial value
1.2 5 Jun 2021 Make tables constant
*/

// Use hardware CRC instruction on Intel SSE 4.2 processors. This computes a
// CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc. A software
// version is provided as a fall-back, as well as for speed comparisons.

#include <stddef.h>
#include <stdint.h>

// Tables for CRC word-wise calculation, definitions of LONG and SHORT, and CRC
// shifts by LONG and SHORT bytes.
#include "crc32c.h"

// Table-driven software version as a fall-back. This is about 15 times slower
// than using the hardware instructions. This assumes little-endian integers,
// as is the case on Intel processors that the assembler code here is for.
static uint32_t crc32c_sw(uint32_t crc, void const *buf, size_t len) {
if (buf == NULL)
return 0;
unsigned char const *data = buf;
while (len && ((uintptr_t)data & 7) != 0) {
crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
len--;
}
size_t n = len >> 3;
for (size_t i = 0; i < n; i++) {
uint64_t word = crc ^ ((uint64_t const *)data)[i];
crc = crc32c_table[7][word & 0xff] ^
crc32c_table[6][(word >> 8) & 0xff] ^
crc32c_table[5][(word >> 16) & 0xff] ^
crc32c_table[4][(word >> 24) & 0xff] ^
crc32c_table[3][(word >> 32) & 0xff] ^
crc32c_table[2][(word >> 40) & 0xff] ^
crc32c_table[1][(word >> 48) & 0xff] ^
crc32c_table[0][word >> 56];
}
data += n << 3;
len &= 7;
while (len) {
len--;
crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
}
return crc;
}

// Apply the zeros operator table to crc.
static uint32_t crc32c_shift(uint32_t const zeros[][256], uint32_t crc) {
return zeros[0][crc & 0xff] ^ zeros[1][(crc >> 8) & 0xff] ^
zeros[2][(crc >> 16) & 0xff] ^ zeros[3][crc >> 24];
}

// Compute CRC-32C using the Intel hardware instruction. Three crc32q
// instructions are run in parallel on a single core. This gives a
// factor-of-three speedup over a single crc32q instruction, since the
// throughput of that instruction is one cycle, but the latency is three
// cycles.
static uint32_t crc32c_hw(uint32_t crc, void const *buf, size_t len) {
if (buf == NULL)
return 0;

// Pre-process the crc.
uint64_t crc0 = crc ^ 0xffffffff;

// Compute the crc for up to seven leading bytes, bringing the data pointer
// to an eight-byte boundary.
unsigned char const *next = buf;
while (len && ((uintptr_t)next & 7) != 0) {
__asm__("crc32b\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next++;
len--;
}

// Compute the crc on sets of LONG*3 bytes, making use of three ALUs in
// parallel on a single core.
while (len >= LONG*3) {
uint64_t crc1 = 0;
uint64_t crc2 = 0;
unsigned char const *end = next + LONG;
do {
__asm__("crc32q\t" "(%3), %0\n\t"
"crc32q\t" LONGx1 "(%3), %1\n\t"
"crc32q\t" LONGx2 "(%3), %2"
: "+r"(crc0), "+r"(crc1), "+r"(crc2)
: "r"(next), "m"(*next));
next += 8;
} while (next < end);
crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1;
crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2;
next += LONG*2;
len -= LONG*3;
}

// Do the same thing, but now on SHORT*3 blocks for the remaining data less
// than a LONG*3 block.
while (len >= SHORT*3) {
uint64_t crc1 = 0;
uint64_t crc2 = 0;
unsigned char const *end = next + SHORT;
do {
__asm__("crc32q\t" "(%3), %0\n\t"
"crc32q\t" SHORTx1 "(%3), %1\n\t"
"crc32q\t" SHORTx2 "(%3), %2"
: "+r"(crc0), "+r"(crc1), "+r"(crc2)
: "r"(next), "m"(*next));
next += 8;
} while (next < end);
crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1;
crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2;
next += SHORT*2;
len -= SHORT*3;
}

// Compute the crc on the remaining eight-byte units less than a SHORT*3
// block.
unsigned char const *end = next + (len - (len & 7));
while (next < end) {
__asm__("crc32q\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next += 8;
}
len &= 7;

// Compute the crc for up to seven trailing bytes.
while (len) {
__asm__("crc32b\t" "(%1), %0"
: "+r"(crc0)
: "r"(next), "m"(*next));
next++;
len--;
}

// Return the crc, post-processed.
return ~(uint32_t)crc0;
}

// Check for SSE 4.2. SSE 4.2 was first supported in Nehalem processors
// introduced in November, 2008. This does not check for the existence of the
// cpuid instruction itself, which was introduced on the 486SL in 1992, so this
// will fail on earlier x86 processors. cpuid works on all Pentium and later
// processors.
#define SSE42(have) \
do { \
uint32_t eax, ecx; \
eax = 1; \
__asm__("cpuid" \
: "=c"(ecx) \
: "a"(eax) \
: "%ebx", "%edx"); \
(have) = (ecx >> 20) & 1; \
} while (0)

// Compute a CRC-32C. If the crc32 instruction is available, use the hardware
// version. Otherwise, use the software version.
uint32_t crc32c(uint32_t crc, void const *buf, size_t len) {
int sse42;
SSE42(sse42);
return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len);
}

Code to generate crc32c.h (stackoverflow won't let me post the tables themselves, due to a 30,000 character limit in an answer):

// Generate crc32c.h for crc32c.c.

#include <stdio.h>
#include <stdint.h>

#define LONG 8192
#define SHORT 256

// Print a 2-D table of four-byte constants in hex.
static void print_table(uint32_t *tab, size_t rows, size_t cols, char *name) {
printf("static uint32_t const %s[][%zu] = {\n", name, cols);
size_t end = rows * cols;
size_t k = 0;
for (;;) {
fputs(" {", stdout);
size_t n = 0, j = 0;
for (;;) {
printf("0x%08x", tab[k + n]);
if (++n == cols)
break;
putchar(',');
if (++j == 6) {
fputs("\n ", stdout);
j = 0;
}
putchar(' ');
}
k += cols;
if (k == end)
break;
puts("},");
}
puts("}\n};");
}

/* CRC-32C (iSCSI) polynomial in reversed bit order. */
#define POLY 0x82f63b78

static void crc32c_word_table(void) {
uint32_t table[8][256];

// Generate byte-wise table.
for (unsigned n = 0; n < 256; n++) {
uint32_t crc = ~n;
for (unsigned k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ POLY : crc >> 1;
table[0][n] = ~crc;
}

// Use byte-wise table to generate word-wise table.
for (unsigned n = 0; n < 256; n++) {
uint32_t crc = ~table[0][n];
for (unsigned k = 1; k < 8; k++) {
crc = table[0][crc & 0xff] ^ (crc >> 8);
table[k][n] = ~crc;
}
}

// Print table.
print_table(table[0], 8, 256, "crc32c_table");
}

// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial. For speed, this requires that a not be zero.
static uint32_t multmodp(uint32_t a, uint32_t b) {
uint32_t prod = 0;
for (;;) {
if (a & 0x80000000) {
prod ^= b;
if ((a & 0x7fffffff) == 0)
break;
}
a <<= 1;
b = b & 1 ? (b >> 1) ^ POLY : b >> 1;
}
return prod;
}

/* Take a length and build four lookup tables for applying the zeros operator
for that length, byte-by-byte, on the operand. */
static void crc32c_zero_table(size_t len, char *name) {
// Generate operator for len zeros.
uint32_t op = 0x80000000; // 1 (x^0)
uint32_t sq = op >> 4; // x^4
while (len) {
sq = multmodp(sq, sq); // x^2^(k+3), k == len bit position
if (len & 1)
op = multmodp(sq, op);
len >>= 1;
}

// Generate table to update each byte of a CRC using op.
uint32_t table[4][256];
for (unsigned n = 0; n < 256; n++) {
table[0][n] = multmodp(op, n);
table[1][n] = multmodp(op, n << 8);
table[2][n] = multmodp(op, n << 16);
table[3][n] = multmodp(op, n << 24);
}

// Print the table to stdout.
print_table(table[0], 4, 256, name);
}

int main(void) {
puts(
"// crc32c.h\n"
"// Tables and constants for crc32c.c software and hardware calculations.\n"
"\n"
"// Table for a 64-bits-at-a-time software CRC-32C calculation. This table\n"
"// has built into it the pre and post bit inversion of the CRC."
);
crc32c_word_table();
puts(
"\n// Block sizes for three-way parallel crc computation. LONG and SHORT\n"
"// must both be powers of two. The associated string constants must be set\n"
"// accordingly, for use in constructing the assembler instructions."
);
printf("#define LONG %d\n", LONG);
printf("#define LONGx1 \"%d\"\n", LONG);
printf("#define LONGx2 \"%d\"\n", 2 * LONG);
printf("#define SHORT %d\n", SHORT);
printf("#define SHORTx1 \"%d\"\n", SHORT);
printf("#define SHORTx2 \"%d\"\n", 2 * SHORT);
puts(
"\n// Table to shift a CRC-32C by LONG bytes."
);
crc32c_zero_table(8192, "crc32c_long");
puts(
"\n// Table to shift a CRC-32C by SHORT bytes."
);
crc32c_zero_table(256, "crc32c_short");
return 0;
}

Using SSE 4.2 crc32 algorithm in c# ? Is it possible?

Nowadays we've been spoiled with the System.Runtime.Intrinsics.X86 namespace available in .NET Core 3.0. Here's the full implementation of the CRC32-C algorithm using SSE 4.2:

using System;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;

/// <summary>
/// The hardware implementation of the CRC32-C polynomial
/// implemented on Intel CPUs supporting SSE4.2.
/// </summary>
public class Crc32HardwareAlgorithm : HashAlgorithm
{
/// <summary>
/// the current CRC value, bit-flipped
/// </summary>
private uint _crc;

/// <summary>
/// We can further optimize the algorithm when X64 is available.
/// </summary>
private bool _x64Available;

/// <summary>
/// Default constructor
/// </summary>
public Crc32HardwareAlgorithm()
{
if (!Sse42.IsSupported)
{
throw new NotSupportedException("SSE4.2 is not supported");
}

_x64Available = Sse42.X64.IsSupported;

// The size, in bits, of the computed hash code.
this.HashSizeValue = 32;
this.Reset();
}

/// <summary>When overridden in a derived class, routes data written to the object into the hash algorithm for computing the hash.</summary>
/// <param name="array">The input to compute the hash code for.</param>
/// <param name="ibStart">The offset into the byte array from which to begin using data.</param>
/// <param name="cbSize">The number of bytes in the byte array to use as data.</param>
protected override void HashCore(byte[] array, int ibStart, int cbSize)
{
if (_x64Available)
{
while (cbSize >= 8)
{
_crc = (uint)Sse42.X64.Crc32(_crc, BitConverter.ToUInt64(array, ibStart));
ibStart += 8;
cbSize -= 8;
}
}

while (cbSize > 0)
{
_crc = Sse42.Crc32(_crc, array[ibStart]);
ibStart++;
cbSize--;
}
}

/// <summary>When overridden in a derived class, finalizes the hash computation after the last data is processed by the cryptographic stream object.</summary>
/// <returns>The computed hash code.</returns>
protected override byte[] HashFinal()
{
uint outputCrcValue = ~_crc;

return BitConverter.GetBytes(outputCrcValue);
}

/// <summary>Initializes an implementation of the <see cref="T:System.Security.Cryptography.HashAlgorithm"></see> class.</summary>
public override void Initialize()
{
this.Reset();
}

private void Reset()
{
_crc = uint.MaxValue;
}
}

Javascript CRC32C implementation that is compatible with Intel's SSE4.2 hardware implementation

See this answer for a compatible software implementation and a fast implementation using the hardware instruction.

You have a few problems there. One is that in Javascript you need to use a logical right shift >>> instead of an arithmetic right shift >>. Second is that you are using charCodeAt, which returns the Unicode value of a character, which may be more than one byte. The CRC algorithm operates on a sequence of bytes, not a sequence of Unicode characters. Third is that you're computing the same table every time -- the table should only be computed once. Last is you're jumping straight to a complicated implementation.

As a simple example, this will compute a CRC-32C in Javascript on an array of values expected to be integers in the range 0..255, i.e. bytes:

function crc32c(crc, bytes) {
var POLY = 0x82f63b78;
var n;

crc ^= 0xffffffff;
for (n = 0; n < bytes.length; n++) {
crc ^= bytes[n];
crc = crc & 1 ? (crc >>> 1) ^ POLY : crc >>> 1;
crc = crc & 1 ? (crc >>> 1) ^ POLY : crc >>> 1;
crc = crc & 1 ? (crc >>> 1) ^ POLY : crc >>> 1;
crc = crc & 1 ? (crc >>> 1) ^ POLY : crc >>> 1;
crc = crc & 1 ? (crc >>> 1) ^ POLY : crc >>> 1;
crc = crc & 1 ? (crc >>> 1) ^ POLY : crc >>> 1;
crc = crc & 1 ? (crc >>> 1) ^ POLY : crc >>> 1;
crc = crc & 1 ? (crc >>> 1) ^ POLY : crc >>> 1;
}
return crc ^ 0xffffffff;
}

_mm_crc32 giving different results that manual version

Mark Adler's answer on Implementing SSE 4.2's CRC32C in software shows that you need to start with 0 ^ 0xffffffff, and end with crc0 ^ 0xffffffff; to pre and post process. (Or use the ~ operator like you're doing in the SW version).

Mark's answer uses GNU C inline asm, but an intrinsics port it would be simple. (It unrolls with multiple accumulators to hide the latency of crc32_u64 over a big buffer.)

This version works on my system.

int main(int argc, char **argv)
{
const unsigned int val = 5;
std::cout << std::hex << crc32c2(0,(const unsigned char*)&val,4) << '\n';
std::cout << (_mm_crc32_u32(0^0xffffffff, 5) ^ 0xffffffffU) << '\n';
}

(Note that std::endl is pointlessly slower than a newline, unless you actually need to force a flush in case the stream was full-buffered instead of line buffered.)

How to implement a hash function for strings in C using CRC32C instruction from sse4.2 x86 extension?

I think, you misunderstand the way those functions work. They are expected to be called in sequence for every character in the string you need to calculate CRC for (or word if you use larger argument version of it, which you should).

But the idea remains the same: first you init CRC to 0, than you call CRC function in a loop, giving the previous value of CRC in the first argument, and hashed value in the second. You store the result in CRC and keep calm and carry on.

Here is code example:

uint64_t byte_crc32(unsigned int crc, const char** buf, size_t len) {
while (len > 0) {
crc = _mm_crc32_u8(crc, *(const unsigned char*)(*buf));
++*buf;
--len;
}

return crc;
}

uint64_t hw_crc32(unsigned int crc, const char** buf, size_t len) {
while (len > 0) {
crc = _mm_crc32_u16(crc, *(const uint16_t*)(*buf));
*buf += 2;
--len;
}

return crc;
}

uint64_t word_crc32(unsigned int crc, const char** buf, size_t len) {
while (len > 0) {
crc = _mm_crc32_u32(crc, *(const uint32_t*)(*buf));
*buf += 4;
--len;
}

return crc;
}

uint64_t dword_crc32(uint64_t crc, const char** buf, size_t len) {
while (len > 0) {
crc = _mm_crc32_u64(crc, *(const uint64_t*)(*buf));
*buf += 8;
--len;
}

return crc;
}


uint64_t crc(const char* buff, size_t len) {

const size_t dword_chunks = len / 8;
const size_t dword_diff = len % 8;

const size_t word_chunks = dword_diff / 4;
const size_t word_diff = dword_diff % 4;

const size_t hw_chunks = word_diff / 2;
const size_t hw_diff = word_diff % 2;

uint64_t crc = byte_crc32(0, &buff, hw_diff);
crc = hw_crc32(crc, &buff, hw_chunks);
crc = word_crc32(crc, &buff, word_chunks);
crc = dword_crc32(crc, &buff, dword_chunks);

return crc;
}

Why SSE4.2 CRC32 hash value is different with software CRC32 hash value?

That table is for the "standard" CRC-32 used in ethernet, zip, gzip, v.42, many other places.

I need to point out that that code in your question is incorrect, with the crc = 127; statement. crc should not be set at all, since it is an input to the function that is then lost, and it should not be initialized to 127. What should be there is crc = ~crc;. That should also be the approach used at the end, instead of return crc ^ ~0U;. That may not be portable if unsigned is not the same size at uint32_t. The use of ~ is fine with uint32_t, but if a different type is used that is not 32 bits, then more portable still is crc ^ 0xffffffff.

Anyway, to answer your question, Intel chose a different 32-bit CRC to implement in their hardware instruction. That CRC is usually referred to as CRC-32C, using a polynomial discovered by Castagnoli (what the "C" refers to) with better properties than the polynomial used in the standard CRC-32. iSCSI and SCTP use the CRC-32C instead of CRC-32. I presume that that had something to do with Intel's choice.

See this answer for code that computes the CRC-32C in software, as well as in hardware if available.

CRC32 C or C++ implementation

Use the Boost C++ libraries. There is a CRC included there and the license is good.

_mm_crc32_u64 poorly defined


Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.

My friend and I wrote a c++ sse intrinsics wrapper which contains the more preferred usage of the crc32 instruction with 64bit src.

http://code.google.com/p/sse-intrinsics/

See the i_crc32() instruction.
(sadly there are even more flaws with intel's sse intrinsic specifications on other instructions, see this page for more examples of flawed intrinsic design)



Related Topics



Leave a reply



Submit